It was an interesting question and one I am sure many companies are faced with regardless of whether they explicitly accept it or not.
It is important to understand that every company has good data and not crib about data being not available or not usable. Indeed there is opportunity to increase the data set and the data veracity but waiting for that to happen would only delay your analytics initiative further and further - not a good idea. Here is an option to consider, with a few steps to implement it:
- Use unsupervised machine learning algorithm - a technique which may be used to employ machine learning algorithms to work on the available data set, disregarding any data quality issues to begin with. An unsupervised machine learning algorithm would cluster the data sets into different categories, classes based on the information that is available. The clusters, classes or categories may be analyzed to reveal insight and information inherent in the data. Remember that the data set may be data in its raw or native form in a schema-less (think Hadoop) storage, in transactional systems or data warehouses, or a combination of both.
- Present the initial findings - Polish the analytical output and make them presentable in human understandable, business-centric lingo and present it to the stakeholders. Allow them to chew on the findings as you explain the same. Remember, the good thing here is that you take no blame in your ability or inability to extract the insight that client is looking for; it is their data with no quality profiling or transformations!
- Gather feedback from client - The feedback would most probably come back as a combination of the following:
- Pleasantly surprised with the initial insight and its value.
- The insight is not complete and requires further tuning.
- Perform data analysis - At this point, given the type of analytical insight being sought after, it is time to look more closely at the data. The completeness and the quality of the data requires analysis and determine which subset of the data may have issues (quality, completeness among others) which requires to be fixed. This is the time when the proper data profiling and transformations will be required to put in place to go after the insight that is expected.
- Add more data sources - Depending on the nature of insight being sought after, after the initial analysis and assessment, it be required a different data source or data set. I'd recommend not to perform a full-scale data profiling on the new data set; just bring it in! However, focus your efforts on data quality for the original data sets.
The advantage of this approach is it nature of allowing you to start very quickly on an analytics solution and then progressively iterate over it till the commensurate level of prediction accuracies are obtained.
Some may call this approach as Agile Analytic Development. I am OK with that. Ask me before using that term though. I may patent it! :-)