Big Data and Analytics - The Currency of the 21st century

Thursday, November 1, 2012

Analytic Governance

Any major discipline in IT which is considered to be a serious undertaking in an enterprise requires some sub-form of IT governance to regulate its implementation. Different IT sub-disciplines require its own governance: data governance for data architecture, implementation and maintenance; integration governance for middleware architecture and infrastructure; SOA governance for service design, versioning and funding and so on. Analytics is increasingly assuming its own space and place in an enterprise. The discipline of analytics ties in the business and IT organizations very closely; it is imperative that it does. The strategy to invest in the right area of the business where analytical capabilities and insight is required to gain a distinctive advantage should be closely coupled with the data scientists who develop the preditive models and IT system designers and implementers who orchestrate analytics driven business processes and deploy the solutions to production.

The critical significance of the analytics discipline warrants a separate, dedicated and focused analytic governance model or framework. Analytic governance is expected to be in its infancy considering the fact that analytics itself is maturing with time as we see more and more implementations being undertaken. The maturity of analytics governance is expected to closely follow the maturty of analytics itself.

Although nowhere close to any stable state of maturity, the following provides some rationale and areas where analytic governance processes and policies ought to be defined:

Data Reconciliation - Data sets used by data scientists, to develop the analytical and preditive models, are typically in the form of extracts from the real sources and merges from various sources to get to flattened Excel-like format which is then subsequently used to develop the models. This data may not be the same in the production systems where the models are ultimately deployed. A proper data reconciliation process needs to be established and followed to ensure that the models work on the right type of data so that the correct predictions are generated.
Model Currency - Based on the latency between model development and its deployment, the data may have changed. As an example, more data types and, or, different data sources may have been introduced. The currency of the model and its applicability at the time of deployment needs to be assessed and hence a well engineered process for the same must be developed.
Analytics Sandbox - The data scientists require an analytic sandbox where they have the analytical tools and the data required for them to perform data mining and exploratory techniques to identify the right algorithms and models. Such activities often requires data and compute intensive executions. Proper workload must be dedicated to such computations while ensuring that such workloads do not affect the transactional systems. Workload planning, guidelines, infrastructure and best practices must be developed and implemented.
Business Rules Vitality - The applicabilty of outcomes from predictive models are contingent upon the regulatory requirements, business policies and mandates, etc, which contextualize the correct application of the model outputs in the context of a business process. Business rules are formulated and codified to bring model outputs to the enterprise business processes. Such business rules needs to be revisited for validity and conformity on a periodic basis to ensure that any regulatory changes or internal business policies are appropriately enforced.
Model deployment - A proper process must be designed and followed for deploying newer versions of models into production. Model versioning will assume significance when multiple versions of various models are put in production to test for best fit. Guidelines on how to reduce the latency between model development and deployment must be developed and commensurate IT infrastructure must be developed to support such capabilities.
Communication - The use of predictive models to predict the outcome or suggest the Next Best Optimized Action will require a paradigm shift from the traditional human expertise and intuition driven 'sense and respond' mode of business operations to one which adopts a real-time, fact-driven 'predict and act' modus operandi. This cultural shift is going to be the hardest one to address as it requires humans to start thinking and behaving differently - to start relying on system predictions more than their own judgement! Unless, the value of predictive models is not socialized adequately right from the very onset, its value and adoption will pose significant cultural and adaptability challenges. A proper education and communication plan needs to be devised and followed.

I cannot enforce on the point that analytics and analytic governance is still in its infancy and both these disciplines have a long way to go before they can be etched in stone. Nonetheless, my effort, in this post has been to raise the awareness among enterprises that analytic governance is fast becoming a growing imperative.

Sunday, October 28, 2012

Big Data - An Appliance Approach

The growth of data has been exponential and there is no respite in the near horizon. The IT industry, in an effort to brace itself for the information deluge, has to devise multiple lines of attack at approaching the problem. One such approach is the usage of purpose built appliance geared towards Massive Parallel Processing (MPP). MPP is driven on the principle of a shared-nothing architecture which postulates that multiple dedicated banks of [processor,disk,network,memory] arranged in ways to take advantage of parallel processing algorithms can provide realistic means of addressing analytical processing of massive data sets. One such product is offered by IBM and is called IBM Netezza, a very smart IBM acquisition in November, 2010.

The architecture of IBM Netezza is very interesting which I am going to share today.

IBM Netezza is a data warehouse. A major part of IBM Netezza data warehouse appliance’s performance advantage comes from its unique Asymmetric Massively Parallel Processing (AMPP) architecture, which combines a Symmetric Multi-Processing (SMP) front-end with a shared-nothing Massively Parallel Processing (MPP) back-end for query processing.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a thousand of these customized MPP streams can work together to “divide and conquer” the workload.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a 1000 of these customized MPP streams work together to “divide and conquer” the workload.

Referring to figure above, the following is a brief description of the architectural building blocks of AMPP:
Hosts - The SMP hosts are high-performance IBM servers running Linux that are set up in an active-passive configuration for high-availability. The active host presents a standardized interface to external tools and applications. It compiles SQL queries into executable code segments called snippets, creates optimized query plans and distributes the snippets to the MPP nodes for execution.

Snippet Blades (S-Blades) - S-Blades are intelligent processing nodes that make up the MPP engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of RAM, all balanced and working concurrently to deliver peak performance. The CPU cores are designed with ample headroom to run complex algorithms against large data volumes for advanced analytics applications.

Disk enclosures - The disk enclosures contain high-density, high-performance storage disks that are RAID protected. Each disk contains a slice of the data in a database table. The disk enclosures are connected to the S-Blades via high-speed interconnects that allow all the disks in IBM to simultaneously stream data to the S-Blades at the maximum rate possible.

Network fabric - All system components are connected via a high-speed network fabric. IBM runs a customized IP-based protocol that fully utilizes the total cross-sectional bandwidth of the fabric and eliminates congestion even under sustained, bursty network traffic. The network is optimized to scale to more than a thousand nodes, while allowing each node to initiate large data transfers to every other node simultaneously.

Inside the S-Blade
The extreme performance happens at the core of the IBM Netezza appliance: the S-Blade which contains a FAST Engine.

A dedicated high-speed interconnect from the storage array allows data to be delivered to memory as quickly as it can stream off the disk. Compressed data is cached in memory using a smart algorithm, which ensures that the most commonly accessed data is served right out of memory instead of requiring a disk access. FAST Engines running in parallel inside the FPGAs uncompress and filter out 95-98 percent of table data at physics speed, keeping only the data that is relevant to answer the query. The remaining data in the stream is processed concurrently by CPU cores, also running in parallel. The process is repeated on more than a thousand of these parallel Snippet Processors running in an IBM Netezza data warehouse appliance.

The FPGA is a critical enabler of the price-performance advantages of IBM Netezza data warehouse appliances. Each FPGA contains embedded engines that perform filtering and transformation functions on the data stream. These FAST engines are dynamically reconfigurable, allowing them to be modified or extended through software. They are customized for every snippet through parameters provided during query execution and act on the data stream delivered by a Direct Memory Access (DMA) module at extremely high speed.

The FAST Engine includes:
Compress engine – uncompresses data at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse-the hard disk.

Project and Restrict engines - further enhances performance by filtering out columns and rows respectively, based on the parameters in the SELECT and WHERE clauses in an SQL query.

Visibility engine - plays a critical role in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. It filters out rows that should not be “seen” by a query; e.g. rows belonging to a transaction that is not yet committed.

How a query is optimized
Let us take an example of how IBM Netezza optimizes the queries in its core query processing engine.

Referring to the figure above, one can think of the way that data streaming works in the IBM Netezza as similar to an assembly line. The IBM Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases the performance relative to other architectures.

Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk.

Here are the high level steps which are followed:
1. The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk.
2. The disk block is then passed on to the Project Engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed.
3. The assembly line then moves the data block to the Restrict Engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause.
4. The Visibility Engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet.
5. The Processor Core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code.

I hope this gives you a decent overview of the built in architectural smarts of IBM Netezza which makes it one of the strong data warehouse and analytical appliances in the industry.

The New Era of Enterprise Business Intelligence: Using Analytics to Achieve a Global Competitive Advantage (Google Affiliate Ad)
Marketing Performance Measurement and Management (Google Affiliate Ad)

Thursday, October 18, 2012

The Optimization Edge - Using Big Data Analytics

A google search on Big Data, these days, will return an insurmountable amount of content, possibly too much to consume. There are many valuable treatments on demonstrating the value of Big Data technologies and solutions. And I remain a student trying to pick up as much as my few gray cells would accommodate! :-)

In this post I wanted to highlight a paradigm shift which I have been noticing and which I wanted to share.

The traditional approach on analytics has been primarily based on building applications to automate business processes. The information generated from these applications, across the enterprise are consolidated in data warehouses for historical analysis of specific domains and business subject areas. Such analytical efforts have been leveraged over time and have enabled companies to macro optimize parts of their enterprise operations. Production planning, capital investments, marketing strategy and budgeting are some examples of organizational strategies which take advantage of analytics - all this helps in delivering macro optimization to the enterprise operations.

The approach that underpins the traditional macro optimization is still rooted in a human expertise and intuition driven mode of business operations. I believe that the competitive advantage holy grail lies in the mindset, ability and strategy to breakaway from the traditional approaches towards macro optimization which is system enabled, human driven in nature to one from which the next generation efficiencies may be achieved by providing precise, contextual analytics at the point of business impact, thereby adopting a more predict and act modus operandi.

Ability to provide and inject analytics at the point of business impact enables a more real-time, fact driven mode of business operations, which attempts to continuously develop the 'Next Best Action' which minimizes risk and maximizes opportunity. This paradigm assumes that such analytics is performed on the data much closer to the space and time where it is generated as opposed to the traditional method of provisioning the data first before executing any analytics on the data set. This is a paradigm shift from the traditional approach and attempts to provide micro optimization of the enterprise business operations. The "micro" qualifier in this case is more attributed to the latency between data generated and its transformation to insightful information and also to the smaller scales in which optimization can affect the business i.e. it need not always be scales as big as capital investment, marketing strategy for the next one year, etc.

With Big Data tools, techniques, technologies and solutions, we cannot but relate its potential with the "Art of the Possible". Let me leave you with a few scenarios.
Imagine if ...:

We could predict the onset of fatal diseases on premature babies, much before they get hit with them and save their lives!
Leverage social information to identify customer churn and take immediate remedial action!
Offer and send discount coupons to a customer and display them on her car dashboard screen or her smart phone, when she drives by an area where a shop exists!
Imagine if you detect the imminent failure of a costly machine part much ahead of the next scheduled maintenance window and proactively fix or repair the same thereby saving costly production downtime!

Yes, I started by saying "Imagine If ...". But let me tell you that current day technologies, tools and capabilities are shrinking the time warp and making this futuristic "Art of the Possible", real today, right here, right now!

To summarize, Big Data technologies and solutions enable analytics to assist in business transformation through iterative micro optimization of the enterprise value chain. And this is made possible by developing and providing a platform for fact-driven, real time and an optimized decision systems.

P.S.: The above imaginations are all implemented today as real-time solutions and are enabling our connected enterprise micro optimized at every opportunity possible.

Decision Management Systems By Taylor, James (Google Affiliate Ad)

Monday, October 1, 2012

Agile Big Data Analytics: How to start a Big Data Analytics project

I was at a client briefing today where I was speaking and a question was asked: How do you start a big data analytics project? Do you first try to get the veracity and quality of all the data correct before you embark on such an initiative? Or are there alternative ways of kicking off such an initiative?

It was an interesting question and one I am sure many companies are faced with regardless of whether they explicitly accept it or not.

It is important to understand that every company has good data and not crib about data being not available or not usable. Indeed there is opportunity to increase the data set and the data veracity but waiting for that to happen would only delay your analytics initiative further and further - not a good idea. Here is an option to consider, with a few steps to implement it:

Use unsupervised machine learning algorithm - a technique which may be used to employ machine learning algorithms to work on the available data set, disregarding any data quality issues to begin with. An unsupervised machine learning algorithm would cluster the data sets into different categories, classes based on the information that is available. The clusters, classes or categories may be analyzed to reveal insight and information inherent in the data. Remember that the data set may be data in its raw or native form in a schema-less (think Hadoop) storage, in transactional systems or data warehouses, or a combination of both.
Present the initial findings - Polish the analytical output and make them presentable in human understandable, business-centric lingo and present it to the stakeholders. Allow them to chew on the findings as you explain the same. Remember, the good thing here is that you take no blame in your ability or inability to extract the insight that client is looking for; it is their data with no quality profiling or transformations!
Gather feedback from client - The feedback would most probably come back as a combination of the following:
- Pleasantly surprised with the initial insight and its value.
- The insight is not complete and requires further tuning.
Perform data analysis - At this point, given the type of analytical insight being sought after, it is time to look more closely at the data. The completeness and the quality of the data requires analysis and determine which subset of the data may have issues (quality, completeness among others) which requires to be fixed. This is the time when the proper data profiling and transformations will be required to put in place to go after the insight that is expected.

Add more data sources - Depending on the nature of insight being sought after, after the initial analysis and assessment, it be required a different data source or data set. I'd recommend not to perform a full-scale data profiling on the new data set; just bring it in! However, focus your efforts on data quality for the original data sets.

I hope you can detect the pattern here. The idea is to bring the data in and apply analytics to the same at the onset and derive as much value as the raw data is able to provide. Analyze the outcome of the initial analytics and then introduce the necessary data quality, profiling and transformations to the portions necessary. Repeat this process in an iterative manner till the results are as expected.

The advantage of this approach is it nature of allowing you to start very quickly on an analytics solution and then progressively iterate over it till the commensurate level of prediction accuracies are obtained.

Some may call this approach as Agile Analytic Development. I am OK with that. Ask me before using that term though. I may patent it! :-)

Tuesday, September 25, 2012

Difference between Descriptive, Predictive and Prescriptive Analytics

Analytics as a discipline has matured beyond its age propelled by the new era in which the human-machine-network is more Instrumented, Interconnected and Intelligent. The new era has made data more accessible than ever before which can be leveraged by the science of analytics to not only increase the accuracy of future predictions but also to up the ante by one level and start optimizing the best outcome from a set of predicted possibilities.

Analytics has a maturity curve, or more of a roadmap which starts from Descriptive Analytics and works its way up to Predictive Analytics and ultimately to Prescriptive Analytics.

Descriptive Analytics, often called after-the-fact analytics, reports on what happened, the frequency of occurrences of a certain event or action and provide drill down capabilities to get to the root cause of the problem. It provides various reporting views based on user roles; summary views for the executive dashboards, metric views for the mid-level managers and drill down root cause analysis details for the engineers and domain experts. Descriptive Analytics is rooted in what is known as traditional BI reporting.

Predictive Analytics focuses on simulating what could happen in the future, given the conditions of the recent past and forecasting the next possible events if the current trend continued for a given period of time. Predictive Analytics is rooted in building supervised and unsupervised machine learning algorithms and models.

Prescriptive Analytics builds on top of Predictive Analytics and focuses on evaluating the various possible outcomes from predictive models and coming up with the best possible outcome by employing optimization algorithms. Such algorithms are also capable of factoring in the effects of variability. Prescriptive Analytics leverages stochastic optimization algorithms and models.

It is imperative to realize that there is no short cut for any enterprise to achieve the highest maturity levels in Analytics (i.e. Prescriptive Analytics) without developing a solid and sound foundation of descriptive analytics followed by predictive analytics.

Enterprises need also to realize that, just by virtue of being in the new era of instrumented, interconnected and intelligent human-machine-network does not give them a free ticket to accessing the data; the data that is required for analytics to be useful. A solid foundation of data access with key focus on ensuring the veracity and viscosity of the data is of superior quality is the very first step to reap the benefits of modern day analytic processing.

Friday, September 14, 2012

Data Virtualization - Virtualize more than Consolidate

Data consolidation continues to be a persistent IT challenge, a source of constant frustration and IT spend. The days of a full-time IT spend on continuous data consolidation on a ever moving target of data sources and data types should be over. Well, even if it is not that strong as "over", at a minimum enterprises should be seriously considering other alternatives. This is where Data Virtualization comes to the party!

Data virtualization focuses on an abstracted layer which provides the necessary hooks to take a business-centric query and deconstruct the same into a set of atomic queries. Each such atomic query, focuses on a sub-set of data elements/types (from the original business-centric query) and determines which data source(s) to go against to retrieve the data. Each such atomic query is executed by the Data Virtualization layer and the returned data sets are then processed (joined) to form the final consolidated result set which is then made ready to be returned as a result to the business-centric query. The mode of data return can be standard SQL, Web Services or any other format which is standard enough to be consumable by business and/or enterprise applications.

The technology is real today. It is only important that enterprises take a close look at Data Virtualization and consider leveraging the same as a part of their overall enterprise data architecture strategy.

And yes, the technologies today can virtualize across both structured and unstructured data spread across databases and schema-less file systems.

Saturday, September 1, 2012

The Genesis of Big Data

Big Data, big data, big data! It is the hype that has taken the IT industry by storm. The term Big Data which has been formed from a combination of two of the simplest words - big and data, has, with their combination have had a profound impact. Enterprises are intrigued by Big Data and all of them feel that there is something in in it for them.

There is no doubt that data is grown and that too grown in huge proportions. If you think about it from a different angle, this data was already there. What has changed is that technology has now allowed enterprises to get access to this huge ocean of data. The fundamental shift is that enterprises traditionally had access to the structured data sets which were primarily generated from business transactions and internal business process executions. Such data were primarily resident in databases, data warehouses where they were captured in a well structured form. However, with the new era of social computing, of data feeds from myriad of sources that are external to the enterprise, the enterprise is all of a sudden exposed to the internet of things which were traditionally not under their control. The industry has come to the realization that such data has a profound impact to the way business are and will be run in the future.

The ability to capture customer sentiments, their desires, feelings, product feedback and intentions in real time, as they happen and be able to influence the next business action or decision is going to provide that competitive advantage which has the potential to make of break product brands and improve our lifestyle through real time up to date decision support system. Some examples of the following may be:

Capturing a customer segment's negative sentiment and take prompt decisions to take corrective action
Predicting customer movements e.g. commenting on making a move from one mobile carrier to another based on bad experiences
Providing location-based product offers e.g. offering a $2 off on a subway sandwich if she is driving by a Subway sandwich store
Informing rush hour travelers on the optimum route to take to their destination based on real-time data feeds from traffic surveillance cameras
...

the list is endless and each industry can come up with their own such list of untapped potential.

This non-traditional data does not follow the norms of database structures and designs; they are typically in the form of semi-structured textual data in social networks like Facebook, Twitter, LinkedIn or unstructured data from audio and video feeds.
The popular belief is that the combination of the semi-structured and unstructured data sets forms around 80% of the world's current data. The enterprises have realized that their future business decisions have been traditionally developed based on only 20% of the data (the structured forms) and the rest 80% is untapped!

This 4 fold increase of data, its sheer Volume and Variety based on the entire gamut of semi-structured and unstructured data is going to be a force to reckon with. Throw in the fact that the rate at which the non-traditional data is created is staggering and uncontrolled i.e. its Velocity has not been dealt with before. The IT industry has come to the realization that based on the sheer Volume, Variety and Velocity of the untapped data it is something very Big - a phenomena which our traditional technologies and infrastructure were incapable of handle. Big, in this context means that it is beyond the current comprehension. Data that is so Big that we have not seen before and have had a need to handle and process it. That is the genesis of Big Data!

Technologies are catching up with the 3 V's and we have started to realize that what was Big Data a year or two back may not be that big anymore now i.e. we are capable of handling it. It is important to understand that the term Big Data is temporal which means that what is big today may not and will not be 'big' tomorrow. Another important concept has unfolded recently. Although we have been able to come to terms with the volume, variety and velocity of the data, with this huge influx of data enterprises are faced with yet another challenge - how do I know that the data that I am gathering from non-conventional sources are indeed authentic and truthful? How do I believe in the Veracity of the data? So, add yet another V and now we get the 4 V's of Big Data Volume, Variety, Velocity, Veracity .

I strongly feel that the focus for us going forward is not about whether the data is big, medium of small; it is more about what we can do with the data. How can we empower the business with the next best decision and that with a level of confidence which can empower the executives to make decisions with confidence. Think about it! Let's talk about it another day.