Thursday, November 1, 2012

Analytic Governance

Any major discipline in IT which is considered to be a serious undertaking in an enterprise requires some sub-form of IT governance to regulate its implementation. Different IT sub-disciplines require its own governance: data governance for data architecture, implementation and maintenance; integration governance for middleware architecture and infrastructure; SOA governance for service design, versioning and funding and so on. Analytics is increasingly assuming its own space and place in an enterprise. The discipline of analytics ties in the business and IT organizations very closely; it is imperative that it does. The strategy to invest in the right area of the business where analytical capabilities and insight is required to gain a distinctive advantage should be closely coupled with the data scientists who develop the preditive models and IT system designers and implementers who orchestrate analytics driven business processes and deploy the solutions to production.

The critical significance of the analytics discipline warrants a separate, dedicated and focused analytic governance model or framework. Analytic governance is expected to be in its infancy considering the fact that analytics itself is maturing with time as we see more and more implementations being undertaken. The maturity of analytics governance is expected to closely follow the maturty of analytics itself.

Although nowhere close to any stable state of maturity, the following provides some rationale and areas where analytic governance processes and policies ought to be defined:

  • Data Reconciliation - Data sets used by data scientists, to develop the analytical and preditive models, are typically in the form of extracts from the real sources and merges from various sources to get to flattened Excel-like format which is then subsequently used to develop the models. This data may not be the same in the production systems where the models are ultimately deployed. A proper data reconciliation process needs to be established and followed to ensure that the models work on the right type of data so that the correct predictions are generated.
  • Model Currency - Based on the latency between model development and its deployment, the data may have changed. As an example, more data types and, or, different data sources may have been introduced. The currency of the model and its applicability at the time of deployment needs to be assessed and hence a well engineered process for the same must be developed.
  • Analytics Sandbox - The data scientists require an analytic sandbox where they have the analytical tools and the data required for them to perform data mining and exploratory techniques to identify the right algorithms and models. Such activities often requires data and compute intensive executions. Proper workload must be dedicated to such computations while ensuring that such workloads do not affect the transactional systems. Workload planning, guidelines, infrastructure and best practices must be developed and implemented.
  • Business Rules Vitality - The applicabilty of outcomes from predictive models are contingent upon the regulatory requirements, business policies and mandates, etc, which contextualize the correct application of the model outputs in the context of a business process. Business rules are formulated and codified to bring model outputs to the enterprise business processes. Such business rules needs to be revisited for validity and conformity on a periodic basis to ensure that any regulatory changes or internal business policies are appropriately enforced.
  • Model deployment - A proper process must be designed and followed for deploying newer versions of models into production. Model versioning will assume significance when multiple versions of various models are put in production to test for best fit. Guidelines on how to reduce the latency between model development and deployment must be developed and commensurate IT infrastructure must be developed to support such capabilities.
  • Communication - The use of predictive models to predict the outcome or suggest the Next Best Optimized Action will require a paradigm shift from the traditional human expertise and intuition driven 'sense and respond' mode of business operations to one which adopts a real-time, fact-driven 'predict and act' modus operandi. This cultural shift is going to be the hardest one to address as it requires humans to start thinking and behaving differently - to start relying on system predictions more than their own judgement! Unless, the value of predictive models is not socialized adequately right from the very onset, its value and adoption will pose significant cultural and adaptability challenges. A proper education and communication plan needs to be devised and followed.

I cannot enforce on the point that analytics and analytic governance is still in its infancy and both these disciplines have a long way to go before they can be etched in stone. Nonetheless, my effort, in this post has been to raise the awareness among enterprises that analytic governance is fast becoming a growing imperative.

Sunday, October 28, 2012

Big Data - An Appliance Approach

The growth of data has been exponential and there is no respite in the near horizon. The IT industry, in an effort to brace itself for the information deluge, has to devise multiple lines of attack at approaching the problem. One such approach is the usage of purpose built appliance geared towards Massive Parallel Processing (MPP). MPP is driven on the principle of a shared-nothing architecture which postulates that multiple dedicated banks of [processor,disk,network,memory] arranged in ways to take advantage of parallel processing algorithms can provide realistic means of addressing analytical processing of massive data sets. One such product is offered by IBM and is called IBM Netezza, a very smart IBM acquisition in November, 2010.

The architecture of IBM Netezza is very interesting which I am going to share today.

IBM Netezza is a data warehouse. A major part of IBM Netezza data warehouse appliance’s performance advantage comes from its unique Asymmetric Massively Parallel Processing (AMPP) architecture, which combines a Symmetric Multi-Processing (SMP) front-end with a shared-nothing Massively Parallel Processing (MPP) back-end for query processing.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a thousand of these customized MPP streams can work together to “divide and conquer” the workload.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a 1000 of these customized MPP streams work together to “divide and conquer” the workload.

Referring to figure above, the following is a brief description of the architectural building blocks of AMPP:
Hosts - The SMP hosts are high-performance IBM servers running Linux that are set up in an active-passive configuration for high-availability. The active host presents a standardized interface to external tools and applications. It compiles SQL queries into executable code segments called snippets, creates optimized query plans and distributes the snippets to the MPP nodes for execution.

Snippet Blades (S-Blades) - S-Blades are intelligent processing nodes that make up the MPP engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of RAM, all balanced and working concurrently to deliver peak performance. The CPU cores are designed with ample headroom to run complex algorithms against large data volumes for advanced analytics applications.

Disk enclosures - The disk enclosures contain high-density, high-performance storage disks that are RAID protected. Each disk contains a slice of the data in a database table. The disk enclosures are connected to the S-Blades via high-speed interconnects that allow all the disks in IBM to simultaneously stream data to the S-Blades at the maximum rate possible.

Network fabric - All system components are connected via a high-speed network fabric. IBM runs a customized IP-based protocol that fully utilizes the total cross-sectional bandwidth of the fabric and eliminates congestion even under sustained, bursty network traffic. The network is optimized to scale to more than a thousand nodes, while allowing each node to initiate large data transfers to every other node simultaneously.

Inside the S-Blade
The extreme performance happens at the core of the IBM Netezza appliance: the S-Blade which contains a FAST Engine.

A dedicated high-speed interconnect from the storage array allows data to be delivered to memory as quickly as it can stream off the disk. Compressed data is cached in memory using a smart algorithm, which ensures that the most commonly accessed data is served right out of memory instead of requiring a disk access. FAST Engines running in parallel inside the FPGAs uncompress and filter out 95-98 percent of table data at physics speed, keeping only the data that is relevant to answer the query. The remaining data in the stream is processed concurrently by CPU cores, also running in parallel. The process is repeated on more than a thousand of these parallel Snippet Processors running in an IBM Netezza data warehouse appliance.

The FPGA is a critical enabler of the price-performance advantages of IBM Netezza data warehouse appliances. Each FPGA contains embedded engines that perform filtering and transformation functions on the data stream. These FAST engines are dynamically reconfigurable, allowing them to be modified or extended through software. They are customized for every snippet through parameters provided during query execution and act on the data stream delivered by a Direct Memory Access (DMA) module at extremely high speed.

The FAST Engine includes:
Compress engine – uncompresses data at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse-the hard disk.

Project and Restrict engines - further enhances performance by filtering out columns and rows respectively, based on the parameters in the SELECT and WHERE clauses in an SQL query.

Visibility engine - plays a critical role in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. It filters out rows that should not be “seen” by a query; e.g. rows belonging to a transaction that is not yet committed.

How a query is optimized
Let us take an example of how IBM Netezza optimizes the queries in its core query processing engine.

Referring to the figure above, one can think of the way that data streaming works in the IBM Netezza as similar to an assembly line. The IBM Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases the performance relative to other architectures.

Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk.

Here are the high level steps which are followed:
1. The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk.
2. The disk block is then passed on to the Project Engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed.
3. The assembly line then moves the data block to the Restrict Engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause.
4. The Visibility Engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet.
5. The Processor Core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code.

I hope this gives you a decent overview of the built in architectural smarts of IBM Netezza which makes it one of the strong data warehouse and analytical appliances in the industry.

The New Era of Enterprise Business Intelligence: Using Analytics to Achieve a Global Competitive Advantage (Google Affiliate Ad)
Marketing Performance Measurement and Management (Google Affiliate Ad)

Thursday, October 18, 2012

The Optimization Edge - Using Big Data Analytics

A google search on Big Data, these days, will return an insurmountable amount of content, possibly too much to consume. There are many valuable treatments on demonstrating the value of Big Data technologies and solutions. And I remain a student trying to pick up as much as my few gray cells would accommodate! :-)

In this post I wanted to highlight a paradigm shift which I have been noticing and which I wanted to share.

The traditional approach on analytics has been primarily based on building applications to automate business processes. The information generated from these applications, across the enterprise are consolidated in data warehouses for historical analysis of specific domains and business subject areas. Such analytical efforts have been leveraged over time and have enabled companies to macro optimize parts of their enterprise operations. Production planning, capital investments, marketing strategy and budgeting are some examples of organizational strategies which take advantage of analytics - all this helps in delivering macro optimization to the enterprise operations.

The approach that underpins the traditional macro optimization is still rooted in a human expertise and intuition driven mode of business operations. I believe that the competitive advantage holy grail lies in the mindset, ability and strategy to breakaway from the traditional approaches towards macro optimization which is system enabled, human driven in nature to one from which the next generation efficiencies may be achieved by providing precise, contextual analytics at the point of business impact, thereby adopting a more predict and act modus operandi.

Ability to provide and inject analytics at the point of business impact enables a more real-time, fact driven mode of business operations, which attempts to continuously develop the 'Next Best Action' which minimizes risk and maximizes opportunity. This paradigm assumes that such analytics is performed on the data much closer to the space and time where it is generated as opposed to the traditional method of provisioning the data first before executing any analytics on the data set. This is a paradigm shift from the traditional approach and attempts to provide micro optimization of the enterprise business operations. The "micro" qualifier in this case is more attributed to the latency between data generated and its transformation to insightful information and also to the smaller scales in which optimization can affect the business i.e. it need not always be scales as big as capital investment, marketing strategy for the next one year, etc.

With Big Data tools, techniques, technologies and solutions, we cannot but relate its potential with the "Art of the Possible". Let me leave you with a few scenarios.
Imagine if ...:

  • We could predict the onset of fatal diseases on premature babies, much before they get hit with them and save their lives!
  • Leverage social information to identify customer churn and take immediate remedial action!
  • Offer and send discount coupons to a customer and display them on her car dashboard screen or her smart phone, when she drives by an area where a shop exists!
  • Imagine if you detect the imminent failure of a costly machine part much ahead of the next scheduled maintenance window and proactively fix or repair the same thereby saving costly production downtime!

Yes, I started by saying "Imagine If ...". But let me tell you that current day technologies, tools and capabilities are shrinking the time warp and making this futuristic "Art of the Possible", real today, right here, right now!

To summarize, Big Data technologies and solutions enable analytics to assist in business transformation through iterative micro optimization of the enterprise value chain. And this is made possible by developing and providing a platform for fact-driven, real time and an optimized decision systems.

P.S.: The above imaginations are all implemented today as real-time solutions and are enabling our connected enterprise micro optimized at every opportunity possible.

Decision Management Systems By Taylor, James (Google Affiliate Ad)

Monday, October 1, 2012

Agile Big Data Analytics: How to start a Big Data Analytics project

I was at a client briefing today where I was speaking and a question was asked: How do you start a big data analytics project? Do you first try to get the veracity and quality of all the data correct before you embark on such an initiative? Or are there alternative ways of kicking off such an initiative?

It was an interesting question and one I am sure many companies are faced with regardless of whether they explicitly accept it or not.

It is important to understand that every company has good data and not crib about data being not available or not usable. Indeed there is opportunity to increase the data set and the data veracity but waiting for that to happen would only delay your analytics initiative further and further - not a good idea. Here is an option to consider, with a few steps to implement it:

  1. Use unsupervised machine learning algorithm - a technique which may be used to employ machine learning algorithms to work on the available data set, disregarding any data quality issues to begin with. An unsupervised machine learning algorithm would cluster the data sets into different categories, classes based on the information that is available. The clusters, classes or categories may be analyzed to reveal insight and information inherent in the data. Remember that the data set may be data in its raw or native form in a schema-less (think Hadoop) storage, in transactional systems or data warehouses, or a combination of both.
  2. Present the initial findings - Polish the analytical output and make them presentable in human understandable, business-centric lingo and present it to the stakeholders. Allow them to chew on the findings as you explain the same. Remember, the good thing here is that you take no blame in your ability or inability to extract the insight that client is looking for; it is their data with no quality profiling or transformations!
  3. Gather feedback from client - The feedback would most probably come back as a combination of the following:
    • Pleasantly surprised with the initial insight and its value.
    • The insight is not complete and requires further tuning.
  4. Perform data analysis - At this point, given the type of analytical insight being sought after, it is time to look more closely at the data. The completeness and the quality of the data requires analysis and determine which subset of the data may have issues (quality, completeness among others) which requires to be fixed. This is the time when the proper data profiling and transformations will be required to put in place to go after the insight that is expected.

  5. Add more data sources - Depending on the nature of insight being sought after, after the initial analysis and assessment, it be required a different data source or data set. I'd recommend not to perform a full-scale data profiling on the new data set; just bring it in! However, focus your efforts on data quality for the original data sets.
I hope you can detect the pattern here. The idea is to bring the data in and apply analytics to the same at the onset and derive as much value as the raw data is able to provide. Analyze the outcome of the initial analytics and then introduce the necessary data quality, profiling and transformations to the portions necessary. Repeat this process in an iterative manner till the results are as expected.

The advantage of this approach is it nature of allowing you to start very quickly on an analytics solution and then progressively iterate over it till the commensurate level of prediction accuracies are obtained.

Some may call this approach as Agile Analytic Development. I am OK with that. Ask me before using that term though. I may patent it! :-)

Tuesday, September 25, 2012

Difference between Descriptive, Predictive and Prescriptive Analytics

Analytics as a discipline has matured beyond its age propelled by the new era in which the human-machine-network is more Instrumented, Interconnected and Intelligent. The new era has made data more accessible than ever before which can be leveraged by the science of analytics to not only increase the accuracy of future predictions but also to up the ante by one level and start optimizing the best outcome from a set of predicted possibilities.

Analytics has a maturity curve, or more of a roadmap which starts from Descriptive Analytics and works its way up to Predictive Analytics and ultimately to Prescriptive Analytics.

Descriptive Analytics, often called after-the-fact analytics, reports on what happened, the frequency of occurrences of a certain event or action and provide drill down capabilities to get to the root cause of the problem. It provides various reporting views based on user roles; summary views for the executive dashboards, metric views for the mid-level managers and drill down root cause analysis details for the engineers and domain experts. Descriptive Analytics is rooted in what is known as traditional BI reporting.

Predictive Analytics focuses on simulating what could happen in the future, given the conditions of the recent past and forecasting the next possible events if the current trend continued for a given period of time. Predictive Analytics is rooted in building supervised and unsupervised machine learning algorithms and models.

Prescriptive Analytics builds on top of Predictive Analytics and focuses on evaluating the various possible outcomes from predictive models and coming up with the best possible outcome by employing optimization algorithms. Such algorithms are also capable of factoring in the effects of variability. Prescriptive Analytics leverages stochastic optimization algorithms and models.

It is imperative to realize that there is no short cut for any enterprise to achieve the highest maturity levels in Analytics (i.e. Prescriptive Analytics) without developing a solid and sound foundation of descriptive analytics followed by predictive analytics.

Enterprises need also to realize that, just by virtue of being in the new era of instrumented, interconnected and intelligent human-machine-network does not give them a free ticket to accessing the data; the data that is required for analytics to be useful. A solid foundation of data access with key focus on ensuring the veracity and viscosity of the data is of superior quality is the very first step to reap the benefits of modern day analytic processing.

Friday, September 14, 2012

Data Virtualization - Virtualize more than Consolidate

Data consolidation continues to be a persistent IT challenge, a source of constant frustration and IT spend. The days of a full-time IT spend on continuous data consolidation on a ever moving target of data sources and data types should be over. Well, even if it is not that strong as "over", at a minimum enterprises should be seriously considering other alternatives. This is where Data Virtualization comes to the party!

Data virtualization focuses on an abstracted layer which provides the necessary hooks to take a business-centric query and deconstruct the same into a set of atomic queries. Each such atomic query, focuses on a sub-set of data elements/types (from the original business-centric query) and determines which data source(s) to go against to retrieve the data. Each such atomic query is executed by the Data Virtualization layer and the returned data sets are then processed (joined) to form the final consolidated result set which is then made ready to be returned as a result to the business-centric query. The mode of data return can be standard SQL, Web Services or any other format which is standard enough to be consumable by business and/or enterprise applications.

The technology is real today. It is only important that enterprises take a close look at Data Virtualization and consider leveraging the same as a part of their overall enterprise data architecture strategy.

And yes, the technologies today can virtualize across both structured and unstructured data spread across databases and schema-less file systems.

Saturday, September 1, 2012

The Genesis of Big Data

Big Data, big data, big data! It is the hype that has taken the IT industry by storm. The term Big Data which has been formed from a combination of two of the simplest words - big and data, has, with their combination have had a profound impact. Enterprises are intrigued by Big Data and all of them feel that there is something in in it for them.

There is no doubt that data is grown and that too grown in huge proportions. If you think about it from a different angle, this data was already there. What has changed is that technology has now allowed enterprises to get access to this huge ocean of data. The fundamental shift is that enterprises traditionally had access to the structured data sets which were primarily generated from business transactions and internal business process executions. Such data were primarily resident in databases, data warehouses where they were captured in a well structured form. However, with the new era of social computing, of data feeds from myriad of sources that are external to the enterprise, the enterprise is all of a sudden exposed to the internet of things which were traditionally not under their control. The industry has come to the realization that such data has a profound impact to the way business are and will be run in the future.

The ability to capture customer sentiments, their desires, feelings, product feedback and intentions in real time, as they happen and be able to influence the next business action or decision is going to provide that competitive advantage which has the potential to make of break product brands and improve our lifestyle through real time up to date decision support system. Some examples of the following may be:

  • Capturing a customer segment's negative sentiment and take prompt decisions to take corrective action
  • Predicting customer movements e.g. commenting on making a move from one mobile carrier to another based on bad experiences
  • Providing location-based product offers e.g. offering a $2 off on a subway sandwich if she is driving by a Subway sandwich store
  • Informing rush hour travelers on the optimum route to take to their destination based on real-time data feeds from traffic surveillance cameras
  • ...
the list is endless and each industry can come up with their own such list of untapped potential.

This non-traditional data does not follow the norms of database structures and designs; they are typically in the form of semi-structured textual data in social networks like Facebook, Twitter, LinkedIn or unstructured data from audio and video feeds.
The popular belief is that the combination of the semi-structured and unstructured data sets forms around 80% of the world's current data. The enterprises have realized that their future business decisions have been traditionally developed based on only 20% of the data (the structured forms) and the rest 80% is untapped!

This 4 fold increase of data, its sheer Volume and Variety based on the entire gamut of semi-structured and unstructured data is going to be a force to reckon with. Throw in the fact that the rate at which the non-traditional data is created is staggering and uncontrolled i.e. its Velocity has not been dealt with before. The IT industry has come to the realization that based on the sheer Volume, Variety and Velocity of the untapped data it is something very Big - a phenomena which our traditional technologies and infrastructure were incapable of handle. Big, in this context means that it is beyond the current comprehension. Data that is so Big that we have not seen before and have had a need to handle and process it. That is the genesis of Big Data!

Technologies are catching up with the 3 V's and we have started to realize that what was Big Data a year or two back may not be that big anymore now i.e. we are capable of handling it. It is important to understand that the term Big Data is temporal which means that what is big today may not and will not be 'big' tomorrow. Another important concept has unfolded recently. Although we have been able to come to terms with the volume, variety and velocity of the data, with this huge influx of data enterprises are faced with yet another challenge - how do I know that the data that I am gathering from non-conventional sources are indeed authentic and truthful? How do I believe in the Veracity of the data? So, add yet another V and now we get the 4 V's of Big Data Volume, Variety, Velocity, Veracity .

I strongly feel that the focus for us going forward is not about whether the data is big, medium of small; it is more about what we can do with the data. How can we empower the business with the next best decision and that with a level of confidence which can empower the executives to make decisions with confidence. Think about it! Let's talk about it another day.

Friday, August 31, 2012

Hadoop Usage Patterns

There has been enough talk about how exactly to use Hadoop in an enterprise. Many companies have jumped into the Big Data bandwagon with their first step typically being to get a Hadoop installation. The more serious players have started to realize some of the real values of Hadoop beyond the usual hype and have put their minds in using it to their business advantage. The more serious usage of Hadoop has resulted in the emergence of a few patterns. Although it is too early to stamp the 'best practice' authority seal on the emerging usage patterns, the usages have some merits which warrants some discussion.

The following are some of the emerging themes and patterns of Hadoop usage:

  • As a data dumping ground - Enterprises today, who have not had their current state analytics leverage the entire gamut of data set, primarily the unstructured type, are provisioning all their ingested data from various unstructured and semi-structured sources into a Hadoop file system. Storage is not too much of an issue as commodity hardware is becoming the usage norm to provision very large data sets. Such large data sets once provisioned are made available for any subsequent analytics to be performed on the same.
  • High speed processing - Traditional data warehouses were not built to support data analysis or querying on data volumes which breaks the terabyte or petabyte barriers. The SQL queries would still work but the time taken for them to return the result sets would typically be in inordinately large (e.g. the order of days). Although the data warehouse technologies are catching up one has to acknowledge the fact that the fundamental design premise of data warehouse were not optimized to handle such ultra large data sets. Data in a Hadoop file system can be processed at very high speeds. The MapReduce technology enables programmers to write massively parallel processing logic which makes the same queries, which takes in the order of days to return results, to execute in a matter of minutes or hours. This multiple order of magnitude improvement cannot get unnoticed and enterprises are using this usage patterns much more regularly and consistently.
  • Storage of only the relevant unstructured data elements - Lot of enterprises have already started to ingest data from non-conventional (read it as - non-structured) data sources. However, they are aware that not all ingested data is of relevance to business decision making. In such scenarios, enterprises are deploying stream computing pre-processing before data is stored in a Hadoop file system. In these cases, data filtering algorithms are used on the real time data ingest. Such algorithms work on the deconstructed data sets and filter out the data elements which bear no importance to any analytical processing for the enterprise. The data elements which pass the processing filters are provisioned into Hadoop. Smart enterprises are keeping their storage and data management and maintenance costs down by employing such real time data filtering technologies.
  • Perform data analytics on the entire data set - Assuming that the most relevant data elements have all their data points stored in Hadoop, enterprises are now running analytics on top of the huge data set. It is quite natural that such analytics (on the data set volume which was not available before) is yielding more insights into patterns which were hitherto unknown or untapped. Patterns are also evolving on how further insights can be developed when such unstructured data elements are correlated with the structured data which already resides in the data warehouses. Once such unstructured data elements are identified by employing sophisticated mathematical and statistical models, the identified data elements are further processed (cleansed, quality-checked, etc.) and then passed on to be stored in the data warehouses. This pattern has the advantage of adding only those data elements to the data warehouse which are enabling enterprises to develop better, more robust predictive models from the transactional data records in the data warehouses.

Other usage patterns for Hadoop will emerge and some will be hardened with repeatable success to be imprinted as best practices. Till then, we continue to innovate a multitude of ways to get the best usage out of Hadoop.

I encourage you to add to this repertoire of Hadoop usage patterns and let us create a compendium for the usage of Hadoop for Big Data Analytics!

Thursday, August 9, 2012

Stream Computing (Streams) versus Complex Event Processing (CEP)

There is a general notion around IT professionals that stream computing (a.k.a. Streams) is just another buzz term for the traditional complex event processing (CEP). Although there are conceptual similarities between the Streams and CEP and acknowledging the fact that both of them fall under the analytical discipline of 'Continuous Intelligence', there are a few fundamental differences which put them into different leagues.

CEP is primarily used for analysis/analytics on discrete business events. Events are correlated in time using simple IF/THEN/ELSE logic. The events need not be of a single type or category. The data encapsulated in the business events are primarily structured in their form. The common CEP engines support modest data rates or around 10K messages/second with a latency typically in the 'seconds' range. The maximum data processing rates can scale up to around 100K events/second.

Streams on the other hand is designed to handle processing rates which are an order of magnitude higher than CEP. It can handle around millions of events per second with built-in linear scalability. Streams data sources are typically of a single event type e.g. camera feeds from traffic signals, sensor data generated from a pipeline or medical device, and so on. Streams is designed to handle the full gamut of unstructured data and contrary to IF/THEN/ELSE based logic in CEP, it is capable of performing advanced analytics on the data set. Examples of advanced analytics are only limited by the power of the mathematical and statistical models. Fast Fourier Transforms, Holt Winters algorithm, time series analysis algorithms would be some real world examples.

To summarize, although both Streams and CEP fall under the category of 'Continuous Intelligence', keep the following image in mind when any of your colleagues engage in the discussion:


Saturday, July 7, 2012

Combining Data At-Rest Analytics with Data In-Motion Analytics

Take a look at this short video first.

The video provides a sneak peek at the 'Art of the Possible' - how traditional analytics based on existing data in data warehouses and data marts can be combined with real time analytics based on streaming data feeds to develop a closed look continuous feedback improvement system.

The structured data set residing in the traditional data warehouses and the marts account for only ~20% of the worlds data. The rest 80% is the world of unstructured, ambiguous, naturally unrelated data set.

The fundamental premise of combining at-rest analytics with that of in-motion analytics is the following:
1. Leverage the wealth of existing data to develop statistical models which can be used to detect patterns on unknown data as well as predict future outcomes with a high degree of confidence and certainty.
2. Deploy such parametrized models to a stream computing environment where data comes in real time and is primarily unstructured i.e. textual data, video, audio and any other form of unstructured data feeds.
3. Allow the real time data, typically as single records or small data sets captured in short time windows, to be fed as parameters to the predictive models.
4. Allow the models to track the real time data feeds and provide real time predictions.
5. If the models cannot detect patterns and the amount of 'unknowns' rise over a certain threshold, then trigger a mechanism to recalibrate the original statistical model.
6. The statistical models will ideally leverage not only the existing data in the warehouses and marts, but also leverage the more current data and other relevant and related data from other sources. The expectations is the model to be able to predict more events and detect more patterns.
7. Deploy the recalibrated model back into the streaming computing environment and expect the models to detect more events that are happening in real time.

In a subsequent blog, when I find some time, I will explain how, what you saw in the video was implemented using a set of products and techniques.

Stay tuned!

And, as usual, drop me a note with any questions or topics on Big Data that you want to discuss.

Saturday, June 9, 2012

Real Time Analytics in Smart Grids - The Art of the Possible

The next generation, forward thinking utility companies will differentiate themselves from their competition when they create an infrastructure that enables them to obtain what I call as the 'two second advantage'. Such an infrastructure depends on taking the base components, i.e. Distribution Management System, Outage Management System, Energy Management System, Power Generation, Mobile Workforce Management System, to name a few, and making them adaptive to real time changes in end to end value chain. Such responsiveness to changes that can be made effectual in real time and on-demand is where I feel the utility companies will differentiate themselves.
Let me take a step by step example on how analytics, based on both information and rest and information in motion can be combined to provide a compelling value proposition:
  1. Using historical data (data at rest) from the consumer premises through the Meter Data Management systems (smart meters, concentrators, etc), one can create predictive models on how each region, territory and other geo-spreads have load usage. Predictive models can also be developed on the usage profiles on an hourly (or any other granular time unit) basis at a per consumer level or any other collective higher level of conglomorate.
  2. Using historical data from power generation units, develop predictive models on generation trends and forecasts from each generation units.
  3. Using historical data, develop predictive models on geo-areas which are more likely to be outage prone. Based on such models, develop predictive models on how the Mobile Workforce Management system would distribute or locate responders and dispatches.
  4. Using the historical data from Asset Management systems, develop predictive models on assets which will require maintenance at given times. Such models will develop the analytical basis of a Condition Based Maintenance (CBM) plan.
The above 1, 2, 3 & 4 are just a few examples of how predictive models can be used to develop optimum Distribution Management System (DMS), Demand Forecasting, CBM and other plans, depending on historical data.

Now, the above are all good first steps in empowering the Smart Grid system with predictive analytics. The differentiated capabilities for the next generation Smart Grid comes from the next level of evolution from the above scenario wherein real time streaming information (information in motion) is overlaid on the parametric predictive models (as above) and enable dynamic adaptation of the predictive models and their real time redeployment so that immediate, real time actions can be taken. Such immediate, real time actions will empower the utility company to gain a significant advantage over its competitors and enable a real "Smart" grid framework for the society. And there is technology today to make this happen.

Let me cite an example on how the real time stream computing can facilitate what I just described above: Real time streaming data from the consumer smart meters flowing through the adaptive analytical engine can detect intra day shifts in usage patterns at households. The streaming data from weather forecasts can also predict hotter/colder intra day climates which along with the streaming data from the concentrators can dynamically recalibrate the static predictive models (Read 1 - 4 above) to create real time intra day demand forecasts or usages. This dynamically recalibrated load usage predictions will be used as a dynamic input to the generation unit. The generation unit can, in real time, reroute power based on such real time consumer usage in intra days. And this is just a start! In a related scenario, the real time streaming feeds from weather reports, social networking data and other sources can be correlated with data from the "Alternate Source" generation units to get a real time intra-day profile of power generation. Such a real time information can be used to recalibrate the distribution models at the generation units to dynamically find pockets of regions where real time load levels differ from the predicted ones and the models can work in conjunction in real time to dynamically conserve, reroute power. And think about how such real time, sub-millisecond dynamism can assist energy trading! In another scenario, the real time streaming textual data coming from social sites that captures sentimental and attitudinal data can be used to get a real time, on the fly capture of customer sentiments which can be used to influence the Customer Insight and Support system so that proactive measures can be taken to safeguard brand image and drive higher customer satisfaction. And lastly, Advanced Condition Based Maintenance is a current reality by mashing real time data from asset sensors along with advanced business rules to extend the component life of the critical and costly assets.

The power of real time stream computing coupled with data at rest is where the real value of information analytics is brought to the fore. This requires not only a futuristic look through the technology lens but also requires the culture and attitude in the company to be responsive and adaptive to changes.