Friday, August 31, 2012

Hadoop Usage Patterns

There has been enough talk about how exactly to use Hadoop in an enterprise. Many companies have jumped into the Big Data bandwagon with their first step typically being to get a Hadoop installation. The more serious players have started to realize some of the real values of Hadoop beyond the usual hype and have put their minds in using it to their business advantage. The more serious usage of Hadoop has resulted in the emergence of a few patterns. Although it is too early to stamp the 'best practice' authority seal on the emerging usage patterns, the usages have some merits which warrants some discussion.

The following are some of the emerging themes and patterns of Hadoop usage:

  • As a data dumping ground - Enterprises today, who have not had their current state analytics leverage the entire gamut of data set, primarily the unstructured type, are provisioning all their ingested data from various unstructured and semi-structured sources into a Hadoop file system. Storage is not too much of an issue as commodity hardware is becoming the usage norm to provision very large data sets. Such large data sets once provisioned are made available for any subsequent analytics to be performed on the same.
  • High speed processing - Traditional data warehouses were not built to support data analysis or querying on data volumes which breaks the terabyte or petabyte barriers. The SQL queries would still work but the time taken for them to return the result sets would typically be in inordinately large (e.g. the order of days). Although the data warehouse technologies are catching up one has to acknowledge the fact that the fundamental design premise of data warehouse were not optimized to handle such ultra large data sets. Data in a Hadoop file system can be processed at very high speeds. The MapReduce technology enables programmers to write massively parallel processing logic which makes the same queries, which takes in the order of days to return results, to execute in a matter of minutes or hours. This multiple order of magnitude improvement cannot get unnoticed and enterprises are using this usage patterns much more regularly and consistently.
  • Storage of only the relevant unstructured data elements - Lot of enterprises have already started to ingest data from non-conventional (read it as - non-structured) data sources. However, they are aware that not all ingested data is of relevance to business decision making. In such scenarios, enterprises are deploying stream computing pre-processing before data is stored in a Hadoop file system. In these cases, data filtering algorithms are used on the real time data ingest. Such algorithms work on the deconstructed data sets and filter out the data elements which bear no importance to any analytical processing for the enterprise. The data elements which pass the processing filters are provisioned into Hadoop. Smart enterprises are keeping their storage and data management and maintenance costs down by employing such real time data filtering technologies.
  • Perform data analytics on the entire data set - Assuming that the most relevant data elements have all their data points stored in Hadoop, enterprises are now running analytics on top of the huge data set. It is quite natural that such analytics (on the data set volume which was not available before) is yielding more insights into patterns which were hitherto unknown or untapped. Patterns are also evolving on how further insights can be developed when such unstructured data elements are correlated with the structured data which already resides in the data warehouses. Once such unstructured data elements are identified by employing sophisticated mathematical and statistical models, the identified data elements are further processed (cleansed, quality-checked, etc.) and then passed on to be stored in the data warehouses. This pattern has the advantage of adding only those data elements to the data warehouse which are enabling enterprises to develop better, more robust predictive models from the transactional data records in the data warehouses.

Other usage patterns for Hadoop will emerge and some will be hardened with repeatable success to be imprinted as best practices. Till then, we continue to innovate a multitude of ways to get the best usage out of Hadoop.

I encourage you to add to this repertoire of Hadoop usage patterns and let us create a compendium for the usage of Hadoop for Big Data Analytics!

No comments:

Post a Comment