Source:http://www.ironsidegroup.com/
The focus of this article is defining how Netezza works and the importance of data distribution in achieving optimal performance in a Netezza system.
Let’s first look at several of the key benefits associated with Netezza systems, which include:
Before we discuss the data distribution mechanism, let us first understand how Netezza stores the data on disk. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive and the data on this drive is called a data slice. Each disk is divided into three partitions: Primary (user data), Mirror, and Temp (intermediate processing data). All the user data and temp space from each primary partition is copied to the mirror partition in another disk, which is called replication. Tables are split across SPUs and data slices and the data is stored in groups according to rows, while data is compressed according to identical column values (columnar compression).
The actual distribution of data across disks is determined by the distribution key listed as part of the table definition. There are two types of distribution methods, Hash and Random. If the DISTRIBUTE ON clause is not specified, the system defaults to using the first column as the Distribution column, using the hash algorithm.
The maximum number of columns that can participate in the distribution key is four. When the system creates records, it assigns them to a logical data slice based on their distribution key value.
The performance of the system is directly related to uniform distribution of user data across all of the data slices in the system. As depicted in the graphic below, the overall response time is impacted by the slowest performing data slice (S-Blade). In the diagram this would be slice 2.
Processing Skew is caused when the distribution key selected is a Boolean value, e.g. True/False or Y/N values. It will distribute data to any two data slices since it will have only two hash values.
A distribution method that distributes data evenly across all data slices is the single most important factor that can influence overall performance. Bad distribution methods and keys can result in uneven distribution of a table across data slices and SPUs (causing skew), causing data to be redistributed or broadcast, which results in bottlenecks and bad performance. As such, it is extremely important to correctly identify the right distribution key and method to ensure effective data distribution.
August 7, 2012 by
What is Netezza?
Netezza is a dedicated data warehouse appliance that uses a proprietary architecture called Asymmetric Massively Parallel Processing (AMPP) that combines open blade-based servers and disk storage with a proprietary data filtering process using field-programmable gate arrays (FPGAs). Netezza integrates a database, server, and storage, which are all interconnected by a powerful network fabric into a single, easy to manage system that requires minimal set-up and ongoing administration, leading to shorter deployment cycles and faster time to value for business analytics.The focus of this article is defining how Netezza works and the importance of data distribution in achieving optimal performance in a Netezza system.
Let’s first look at several of the key benefits associated with Netezza systems, which include:
- Speed
- Simple
- Scalable
- Smart
Before we discuss the data distribution mechanism, let us first understand how Netezza stores the data on disk. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive and the data on this drive is called a data slice. Each disk is divided into three partitions: Primary (user data), Mirror, and Temp (intermediate processing data). All the user data and temp space from each primary partition is copied to the mirror partition in another disk, which is called replication. Tables are split across SPUs and data slices and the data is stored in groups according to rows, while data is compressed according to identical column values (columnar compression).
The actual distribution of data across disks is determined by the distribution key listed as part of the table definition. There are two types of distribution methods, Hash and Random. If the DISTRIBUTE ON clause is not specified, the system defaults to using the first column as the Distribution column, using the hash algorithm.
The maximum number of columns that can participate in the distribution key is four. When the system creates records, it assigns them to a logical data slice based on their distribution key value.
The performance of the system is directly related to uniform distribution of user data across all of the data slices in the system. As depicted in the graphic below, the overall response time is impacted by the slowest performing data slice (S-Blade). In the diagram this would be slice 2.
Processing Skew is caused when the distribution key selected is a Boolean value, e.g. True/False or Y/N values. It will distribute data to any two data slices since it will have only two hash values.
A distribution method that distributes data evenly across all data slices is the single most important factor that can influence overall performance. Bad distribution methods and keys can result in uneven distribution of a table across data slices and SPUs (causing skew), causing data to be redistributed or broadcast, which results in bottlenecks and bad performance. As such, it is extremely important to correctly identify the right distribution key and method to ensure effective data distribution.
No comments:
Post a Comment