Sunday, October 28, 2012

Big Data - An Appliance Approach

The growth of data has been exponential and there is no respite in the near horizon. The IT industry, in an effort to brace itself for the information deluge, has to devise multiple lines of attack at approaching the problem. One such approach is the usage of purpose built appliance geared towards Massive Parallel Processing (MPP). MPP is driven on the principle of a shared-nothing architecture which postulates that multiple dedicated banks of [processor,disk,network,memory] arranged in ways to take advantage of parallel processing algorithms can provide realistic means of addressing analytical processing of massive data sets. One such product is offered by IBM and is called IBM Netezza, a very smart IBM acquisition in November, 2010.

The architecture of IBM Netezza is very interesting which I am going to share today.

IBM Netezza is a data warehouse. A major part of IBM Netezza data warehouse appliance’s performance advantage comes from its unique Asymmetric Massively Parallel Processing (AMPP) architecture, which combines a Symmetric Multi-Processing (SMP) front-end with a shared-nothing Massively Parallel Processing (MPP) back-end for query processing.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a thousand of these customized MPP streams can work together to “divide and conquer” the workload.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a 1000 of these customized MPP streams work together to “divide and conquer” the workload.


Referring to figure above, the following is a brief description of the architectural building blocks of AMPP:
Hosts - The SMP hosts are high-performance IBM servers running Linux that are set up in an active-passive configuration for high-availability. The active host presents a standardized interface to external tools and applications. It compiles SQL queries into executable code segments called snippets, creates optimized query plans and distributes the snippets to the MPP nodes for execution.

Snippet Blades (S-Blades) - S-Blades are intelligent processing nodes that make up the MPP engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of RAM, all balanced and working concurrently to deliver peak performance. The CPU cores are designed with ample headroom to run complex algorithms against large data volumes for advanced analytics applications.

Disk enclosures - The disk enclosures contain high-density, high-performance storage disks that are RAID protected. Each disk contains a slice of the data in a database table. The disk enclosures are connected to the S-Blades via high-speed interconnects that allow all the disks in IBM to simultaneously stream data to the S-Blades at the maximum rate possible.

Network fabric - All system components are connected via a high-speed network fabric. IBM runs a customized IP-based protocol that fully utilizes the total cross-sectional bandwidth of the fabric and eliminates congestion even under sustained, bursty network traffic. The network is optimized to scale to more than a thousand nodes, while allowing each node to initiate large data transfers to every other node simultaneously.

Inside the S-Blade
The extreme performance happens at the core of the IBM Netezza appliance: the S-Blade which contains a FAST Engine.

A dedicated high-speed interconnect from the storage array allows data to be delivered to memory as quickly as it can stream off the disk. Compressed data is cached in memory using a smart algorithm, which ensures that the most commonly accessed data is served right out of memory instead of requiring a disk access. FAST Engines running in parallel inside the FPGAs uncompress and filter out 95-98 percent of table data at physics speed, keeping only the data that is relevant to answer the query. The remaining data in the stream is processed concurrently by CPU cores, also running in parallel. The process is repeated on more than a thousand of these parallel Snippet Processors running in an IBM Netezza data warehouse appliance.

The FPGA is a critical enabler of the price-performance advantages of IBM Netezza data warehouse appliances. Each FPGA contains embedded engines that perform filtering and transformation functions on the data stream. These FAST engines are dynamically reconfigurable, allowing them to be modified or extended through software. They are customized for every snippet through parameters provided during query execution and act on the data stream delivered by a Direct Memory Access (DMA) module at extremely high speed.

The FAST Engine includes:
Compress engine – uncompresses data at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse-the hard disk.

Project and Restrict engines - further enhances performance by filtering out columns and rows respectively, based on the parameters in the SELECT and WHERE clauses in an SQL query.

Visibility engine - plays a critical role in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. It filters out rows that should not be “seen” by a query; e.g. rows belonging to a transaction that is not yet committed.


How a query is optimized
Let us take an example of how IBM Netezza optimizes the queries in its core query processing engine.


Referring to the figure above, one can think of the way that data streaming works in the IBM Netezza as similar to an assembly line. The IBM Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases the performance relative to other architectures.

Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk.

Here are the high level steps which are followed:
1. The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk.
2. The disk block is then passed on to the Project Engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed.
3. The assembly line then moves the data block to the Restrict Engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause.
4. The Visibility Engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet.
5. The Processor Core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code.

I hope this gives you a decent overview of the built in architectural smarts of IBM Netezza which makes it one of the strong data warehouse and analytical appliances in the industry.

The New Era of Enterprise Business Intelligence: Using Analytics to Achieve a Global Competitive Advantage (Google Affiliate Ad)
Marketing Performance Measurement and Management (Google Affiliate Ad)

Thursday, October 18, 2012

The Optimization Edge - Using Big Data Analytics

A google search on Big Data, these days, will return an insurmountable amount of content, possibly too much to consume. There are many valuable treatments on demonstrating the value of Big Data technologies and solutions. And I remain a student trying to pick up as much as my few gray cells would accommodate! :-)

In this post I wanted to highlight a paradigm shift which I have been noticing and which I wanted to share.

The traditional approach on analytics has been primarily based on building applications to automate business processes. The information generated from these applications, across the enterprise are consolidated in data warehouses for historical analysis of specific domains and business subject areas. Such analytical efforts have been leveraged over time and have enabled companies to macro optimize parts of their enterprise operations. Production planning, capital investments, marketing strategy and budgeting are some examples of organizational strategies which take advantage of analytics - all this helps in delivering macro optimization to the enterprise operations.

The approach that underpins the traditional macro optimization is still rooted in a human expertise and intuition driven mode of business operations. I believe that the competitive advantage holy grail lies in the mindset, ability and strategy to breakaway from the traditional approaches towards macro optimization which is system enabled, human driven in nature to one from which the next generation efficiencies may be achieved by providing precise, contextual analytics at the point of business impact, thereby adopting a more predict and act modus operandi.

Ability to provide and inject analytics at the point of business impact enables a more real-time, fact driven mode of business operations, which attempts to continuously develop the 'Next Best Action' which minimizes risk and maximizes opportunity. This paradigm assumes that such analytics is performed on the data much closer to the space and time where it is generated as opposed to the traditional method of provisioning the data first before executing any analytics on the data set. This is a paradigm shift from the traditional approach and attempts to provide micro optimization of the enterprise business operations. The "micro" qualifier in this case is more attributed to the latency between data generated and its transformation to insightful information and also to the smaller scales in which optimization can affect the business i.e. it need not always be scales as big as capital investment, marketing strategy for the next one year, etc.

With Big Data tools, techniques, technologies and solutions, we cannot but relate its potential with the "Art of the Possible". Let me leave you with a few scenarios.
Imagine if ...:

  • We could predict the onset of fatal diseases on premature babies, much before they get hit with them and save their lives!
  • Leverage social information to identify customer churn and take immediate remedial action!
  • Offer and send discount coupons to a customer and display them on her car dashboard screen or her smart phone, when she drives by an area where a shop exists!
  • Imagine if you detect the imminent failure of a costly machine part much ahead of the next scheduled maintenance window and proactively fix or repair the same thereby saving costly production downtime!

Yes, I started by saying "Imagine If ...". But let me tell you that current day technologies, tools and capabilities are shrinking the time warp and making this futuristic "Art of the Possible", real today, right here, right now!

To summarize, Big Data technologies and solutions enable analytics to assist in business transformation through iterative micro optimization of the enterprise value chain. And this is made possible by developing and providing a platform for fact-driven, real time and an optimized decision systems.

P.S.: The above imaginations are all implemented today as real-time solutions and are enabling our connected enterprise micro optimized at every opportunity possible.

Decision Management Systems By Taylor, James (Google Affiliate Ad)

Monday, October 1, 2012

Agile Big Data Analytics: How to start a Big Data Analytics project

I was at a client briefing today where I was speaking and a question was asked: How do you start a big data analytics project? Do you first try to get the veracity and quality of all the data correct before you embark on such an initiative? Or are there alternative ways of kicking off such an initiative?

It was an interesting question and one I am sure many companies are faced with regardless of whether they explicitly accept it or not.

It is important to understand that every company has good data and not crib about data being not available or not usable. Indeed there is opportunity to increase the data set and the data veracity but waiting for that to happen would only delay your analytics initiative further and further - not a good idea. Here is an option to consider, with a few steps to implement it:

  1. Use unsupervised machine learning algorithm - a technique which may be used to employ machine learning algorithms to work on the available data set, disregarding any data quality issues to begin with. An unsupervised machine learning algorithm would cluster the data sets into different categories, classes based on the information that is available. The clusters, classes or categories may be analyzed to reveal insight and information inherent in the data. Remember that the data set may be data in its raw or native form in a schema-less (think Hadoop) storage, in transactional systems or data warehouses, or a combination of both.
  2. Present the initial findings - Polish the analytical output and make them presentable in human understandable, business-centric lingo and present it to the stakeholders. Allow them to chew on the findings as you explain the same. Remember, the good thing here is that you take no blame in your ability or inability to extract the insight that client is looking for; it is their data with no quality profiling or transformations!
  3. Gather feedback from client - The feedback would most probably come back as a combination of the following:
    • Pleasantly surprised with the initial insight and its value.
    • The insight is not complete and requires further tuning.
  4. Perform data analysis - At this point, given the type of analytical insight being sought after, it is time to look more closely at the data. The completeness and the quality of the data requires analysis and determine which subset of the data may have issues (quality, completeness among others) which requires to be fixed. This is the time when the proper data profiling and transformations will be required to put in place to go after the insight that is expected.

  5. Add more data sources - Depending on the nature of insight being sought after, after the initial analysis and assessment, it be required a different data source or data set. I'd recommend not to perform a full-scale data profiling on the new data set; just bring it in! However, focus your efforts on data quality for the original data sets.
I hope you can detect the pattern here. The idea is to bring the data in and apply analytics to the same at the onset and derive as much value as the raw data is able to provide. Analyze the outcome of the initial analytics and then introduce the necessary data quality, profiling and transformations to the portions necessary. Repeat this process in an iterative manner till the results are as expected.

The advantage of this approach is it nature of allowing you to start very quickly on an analytics solution and then progressively iterate over it till the commensurate level of prediction accuracies are obtained.

Some may call this approach as Agile Analytic Development. I am OK with that. Ask me before using that term though. I may patent it! :-)