Friday, August 31, 2012

Hadoop Usage Patterns

There has been enough talk about how exactly to use Hadoop in an enterprise. Many companies have jumped into the Big Data bandwagon with their first step typically being to get a Hadoop installation. The more serious players have started to realize some of the real values of Hadoop beyond the usual hype and have put their minds in using it to their business advantage. The more serious usage of Hadoop has resulted in the emergence of a few patterns. Although it is too early to stamp the 'best practice' authority seal on the emerging usage patterns, the usages have some merits which warrants some discussion.

The following are some of the emerging themes and patterns of Hadoop usage:

  • As a data dumping ground - Enterprises today, who have not had their current state analytics leverage the entire gamut of data set, primarily the unstructured type, are provisioning all their ingested data from various unstructured and semi-structured sources into a Hadoop file system. Storage is not too much of an issue as commodity hardware is becoming the usage norm to provision very large data sets. Such large data sets once provisioned are made available for any subsequent analytics to be performed on the same.
  • High speed processing - Traditional data warehouses were not built to support data analysis or querying on data volumes which breaks the terabyte or petabyte barriers. The SQL queries would still work but the time taken for them to return the result sets would typically be in inordinately large (e.g. the order of days). Although the data warehouse technologies are catching up one has to acknowledge the fact that the fundamental design premise of data warehouse were not optimized to handle such ultra large data sets. Data in a Hadoop file system can be processed at very high speeds. The MapReduce technology enables programmers to write massively parallel processing logic which makes the same queries, which takes in the order of days to return results, to execute in a matter of minutes or hours. This multiple order of magnitude improvement cannot get unnoticed and enterprises are using this usage patterns much more regularly and consistently.
  • Storage of only the relevant unstructured data elements - Lot of enterprises have already started to ingest data from non-conventional (read it as - non-structured) data sources. However, they are aware that not all ingested data is of relevance to business decision making. In such scenarios, enterprises are deploying stream computing pre-processing before data is stored in a Hadoop file system. In these cases, data filtering algorithms are used on the real time data ingest. Such algorithms work on the deconstructed data sets and filter out the data elements which bear no importance to any analytical processing for the enterprise. The data elements which pass the processing filters are provisioned into Hadoop. Smart enterprises are keeping their storage and data management and maintenance costs down by employing such real time data filtering technologies.
  • Perform data analytics on the entire data set - Assuming that the most relevant data elements have all their data points stored in Hadoop, enterprises are now running analytics on top of the huge data set. It is quite natural that such analytics (on the data set volume which was not available before) is yielding more insights into patterns which were hitherto unknown or untapped. Patterns are also evolving on how further insights can be developed when such unstructured data elements are correlated with the structured data which already resides in the data warehouses. Once such unstructured data elements are identified by employing sophisticated mathematical and statistical models, the identified data elements are further processed (cleansed, quality-checked, etc.) and then passed on to be stored in the data warehouses. This pattern has the advantage of adding only those data elements to the data warehouse which are enabling enterprises to develop better, more robust predictive models from the transactional data records in the data warehouses.

Other usage patterns for Hadoop will emerge and some will be hardened with repeatable success to be imprinted as best practices. Till then, we continue to innovate a multitude of ways to get the best usage out of Hadoop.

I encourage you to add to this repertoire of Hadoop usage patterns and let us create a compendium for the usage of Hadoop for Big Data Analytics!

Thursday, August 9, 2012

Stream Computing (Streams) versus Complex Event Processing (CEP)

There is a general notion around IT professionals that stream computing (a.k.a. Streams) is just another buzz term for the traditional complex event processing (CEP). Although there are conceptual similarities between the Streams and CEP and acknowledging the fact that both of them fall under the analytical discipline of 'Continuous Intelligence', there are a few fundamental differences which put them into different leagues.

CEP is primarily used for analysis/analytics on discrete business events. Events are correlated in time using simple IF/THEN/ELSE logic. The events need not be of a single type or category. The data encapsulated in the business events are primarily structured in their form. The common CEP engines support modest data rates or around 10K messages/second with a latency typically in the 'seconds' range. The maximum data processing rates can scale up to around 100K events/second.

Streams on the other hand is designed to handle processing rates which are an order of magnitude higher than CEP. It can handle around millions of events per second with built-in linear scalability. Streams data sources are typically of a single event type e.g. camera feeds from traffic signals, sensor data generated from a pipeline or medical device, and so on. Streams is designed to handle the full gamut of unstructured data and contrary to IF/THEN/ELSE based logic in CEP, it is capable of performing advanced analytics on the data set. Examples of advanced analytics are only limited by the power of the mathematical and statistical models. Fast Fourier Transforms, Holt Winters algorithm, time series analysis algorithms would be some real world examples.

To summarize, although both Streams and CEP fall under the category of 'Continuous Intelligence', keep the following image in mind when any of your colleagues engage in the discussion: