Sunday, October 28, 2012

Big Data - An Appliance Approach

The growth of data has been exponential and there is no respite in the near horizon. The IT industry, in an effort to brace itself for the information deluge, has to devise multiple lines of attack at approaching the problem. One such approach is the usage of purpose built appliance geared towards Massive Parallel Processing (MPP). MPP is driven on the principle of a shared-nothing architecture which postulates that multiple dedicated banks of [processor,disk,network,memory] arranged in ways to take advantage of parallel processing algorithms can provide realistic means of addressing analytical processing of massive data sets. One such product is offered by IBM and is called IBM Netezza, a very smart IBM acquisition in November, 2010.

The architecture of IBM Netezza is very interesting which I am going to share today.

IBM Netezza is a data warehouse. A major part of IBM Netezza data warehouse appliance’s performance advantage comes from its unique Asymmetric Massively Parallel Processing (AMPP) architecture, which combines a Symmetric Multi-Processing (SMP) front-end with a shared-nothing Massively Parallel Processing (MPP) back-end for query processing.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a thousand of these customized MPP streams can work together to “divide and conquer” the workload.

Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. Up to a 1000 of these customized MPP streams work together to “divide and conquer” the workload.

Referring to figure above, the following is a brief description of the architectural building blocks of AMPP:
Hosts - The SMP hosts are high-performance IBM servers running Linux that are set up in an active-passive configuration for high-availability. The active host presents a standardized interface to external tools and applications. It compiles SQL queries into executable code segments called snippets, creates optimized query plans and distributes the snippets to the MPP nodes for execution.

Snippet Blades (S-Blades) - S-Blades are intelligent processing nodes that make up the MPP engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of RAM, all balanced and working concurrently to deliver peak performance. The CPU cores are designed with ample headroom to run complex algorithms against large data volumes for advanced analytics applications.

Disk enclosures - The disk enclosures contain high-density, high-performance storage disks that are RAID protected. Each disk contains a slice of the data in a database table. The disk enclosures are connected to the S-Blades via high-speed interconnects that allow all the disks in IBM to simultaneously stream data to the S-Blades at the maximum rate possible.

Network fabric - All system components are connected via a high-speed network fabric. IBM runs a customized IP-based protocol that fully utilizes the total cross-sectional bandwidth of the fabric and eliminates congestion even under sustained, bursty network traffic. The network is optimized to scale to more than a thousand nodes, while allowing each node to initiate large data transfers to every other node simultaneously.

Inside the S-Blade
The extreme performance happens at the core of the IBM Netezza appliance: the S-Blade which contains a FAST Engine.

A dedicated high-speed interconnect from the storage array allows data to be delivered to memory as quickly as it can stream off the disk. Compressed data is cached in memory using a smart algorithm, which ensures that the most commonly accessed data is served right out of memory instead of requiring a disk access. FAST Engines running in parallel inside the FPGAs uncompress and filter out 95-98 percent of table data at physics speed, keeping only the data that is relevant to answer the query. The remaining data in the stream is processed concurrently by CPU cores, also running in parallel. The process is repeated on more than a thousand of these parallel Snippet Processors running in an IBM Netezza data warehouse appliance.

The FPGA is a critical enabler of the price-performance advantages of IBM Netezza data warehouse appliances. Each FPGA contains embedded engines that perform filtering and transformation functions on the data stream. These FAST engines are dynamically reconfigurable, allowing them to be modified or extended through software. They are customized for every snippet through parameters provided during query execution and act on the data stream delivered by a Direct Memory Access (DMA) module at extremely high speed.

The FAST Engine includes:
Compress engine – uncompresses data at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse-the hard disk.

Project and Restrict engines - further enhances performance by filtering out columns and rows respectively, based on the parameters in the SELECT and WHERE clauses in an SQL query.

Visibility engine - plays a critical role in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. It filters out rows that should not be “seen” by a query; e.g. rows belonging to a transaction that is not yet committed.

How a query is optimized
Let us take an example of how IBM Netezza optimizes the queries in its core query processing engine.

Referring to the figure above, one can think of the way that data streaming works in the IBM Netezza as similar to an assembly line. The IBM Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases the performance relative to other architectures.

Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk.

Here are the high level steps which are followed:
1. The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk.
2. The disk block is then passed on to the Project Engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed.
3. The assembly line then moves the data block to the Restrict Engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause.
4. The Visibility Engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet.
5. The Processor Core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code.

I hope this gives you a decent overview of the built in architectural smarts of IBM Netezza which makes it one of the strong data warehouse and analytical appliances in the industry.

The New Era of Enterprise Business Intelligence: Using Analytics to Achieve a Global Competitive Advantage (Google Affiliate Ad)
Marketing Performance Measurement and Management (Google Affiliate Ad)

No comments:

Post a Comment