Monday, October 1, 2012

Agile Big Data Analytics: How to start a Big Data Analytics project

I was at a client briefing today where I was speaking and a question was asked: How do you start a big data analytics project? Do you first try to get the veracity and quality of all the data correct before you embark on such an initiative? Or are there alternative ways of kicking off such an initiative?

It was an interesting question and one I am sure many companies are faced with regardless of whether they explicitly accept it or not.

It is important to understand that every company has good data and not crib about data being not available or not usable. Indeed there is opportunity to increase the data set and the data veracity but waiting for that to happen would only delay your analytics initiative further and further - not a good idea. Here is an option to consider, with a few steps to implement it:

  1. Use unsupervised machine learning algorithm - a technique which may be used to employ machine learning algorithms to work on the available data set, disregarding any data quality issues to begin with. An unsupervised machine learning algorithm would cluster the data sets into different categories, classes based on the information that is available. The clusters, classes or categories may be analyzed to reveal insight and information inherent in the data. Remember that the data set may be data in its raw or native form in a schema-less (think Hadoop) storage, in transactional systems or data warehouses, or a combination of both.
  2. Present the initial findings - Polish the analytical output and make them presentable in human understandable, business-centric lingo and present it to the stakeholders. Allow them to chew on the findings as you explain the same. Remember, the good thing here is that you take no blame in your ability or inability to extract the insight that client is looking for; it is their data with no quality profiling or transformations!
  3. Gather feedback from client - The feedback would most probably come back as a combination of the following:
    • Pleasantly surprised with the initial insight and its value.
    • The insight is not complete and requires further tuning.
  4. Perform data analysis - At this point, given the type of analytical insight being sought after, it is time to look more closely at the data. The completeness and the quality of the data requires analysis and determine which subset of the data may have issues (quality, completeness among others) which requires to be fixed. This is the time when the proper data profiling and transformations will be required to put in place to go after the insight that is expected.

  5. Add more data sources - Depending on the nature of insight being sought after, after the initial analysis and assessment, it be required a different data source or data set. I'd recommend not to perform a full-scale data profiling on the new data set; just bring it in! However, focus your efforts on data quality for the original data sets.
I hope you can detect the pattern here. The idea is to bring the data in and apply analytics to the same at the onset and derive as much value as the raw data is able to provide. Analyze the outcome of the initial analytics and then introduce the necessary data quality, profiling and transformations to the portions necessary. Repeat this process in an iterative manner till the results are as expected.

The advantage of this approach is it nature of allowing you to start very quickly on an analytics solution and then progressively iterate over it till the commensurate level of prediction accuracies are obtained.

Some may call this approach as Agile Analytic Development. I am OK with that. Ask me before using that term though. I may patent it! :-)


  1. Dear Tilak,

    I recently have the good fortune of reading your all articles posted in your BlogSpot “Big Data and Analytics – the Currency of 21st Century”. I am immensely impressed with how well written, organized and contained timely topics. These articles reveal better ways to exploit information revolution or state-of-the-art information and devoted to the emerging interdisciplinary field of Web Science. It takes your special dedication and perseverance to engrave such refreshing and demanding, but very simple and easy-to-understand and thought-provocative and up-beat topics of Information Technology that are rapidly changing the universe. These articles are great motivator for many present-day IT professionals and successors who want to reach extra mile. These are illustrated in such a modest, sound and practical ways, it has already aided me and my team a great deal with our work and I am quite confident that many people will be benefited from it for years to come. I am eagerly looking forward to your future posts and so offer my sincere gratitude for the enormous efforts that you have already provided to us.

    Yours Sincerely,
    Sarbani Duttagupta, PhD, PGDBM (International Business)
    Head (R&D)
    Fertin Pharma A/s
    (Bagger-Sørensen Group)
    Dandyvej 19 • 7100 Vejle • Denmark

    1. Sarbani,
      I am happy to hear that you and your team intend to benefit from some of my thoughts and experiences in this emerging field. It is only fair to state that "The Art of the Possible" is yet to be untapped and we are just at the very onset of a fascinating journey powered by insight. The subtle difference here is that, I am confidently now speaking about a journey powered by insight and not just by information.
      That is the leap which Big Data and Analytics is empowering!

      If you and/or your team has any specific questions or topics in Big Data Analytics, please feel free to drop me a note or start a thread here and I shall offer my thoughts.