Wednesday, January 30, 2008

Partitioning a Powerful Tool for DBAs

Partitioning tables and indexes in a database is one of the most powerful features that can help the performance and manageability of a database with large tables. With a large database, there are usually only a handful of very large tables. These large tables typically get most of the I/O activity and have the biggest impact on performance and management.

Partitioning tables and indexes breaks a big problem (large tables) into a number of smaller manageable pieces (table partitions). When a table is partitioned, it is broken into separate physical chunks of data (partitions). For example, there is a large sales table with 10 million records. When generating a monthly or quarterly report, enough records are accessed so an index scan is not efficient, so a full table scan has to be performed. So every month more records get added to the table and every month the monthly report takes longer to run because it is generating a full table scan to go through the data.

An example of how partitioning can have an impact: a sales table can be divided into partitions (i.e. monthly increments) based on the sales date (partition key). So one table partition will have January's data, another partition can have February's data and so on. When a monthly report is generated, an I/O scan will occur on one month of data instead of the entire table. The table partitioning is transparent to the applications. While processing the SQL statement, the optimizer understands the table is partitioned and checks to see if individual partitions can be scanned instead of the entire table.

Partition Pruning
Partition pruning allows a query to run on specific partitions instead of an entire table. Leveraging this benefit requires understanding the data and how the data is accessed to determine the best partitioning method and what column(s) should be the partition key. The partition key is the column(s) used to divide the data into separate partitions. Sufficient time needs to be spent determining the correct column(s) to define as the partition key and defining the correct partition type for the table.

Partitioning offers a number of advantages:
  • Partition pruning allows I/O to be performed on smaller data sets instead of the entire table. As tables grow larger, it can be a big advantage to perform partition scans.
  • Being able to perform administration operations at the partition level versus the table level. This can decrease maintenance times and improve availability.
  • Partitioning is transparent at the application level. Applications (SQL and stored procedure code) will not be impacted by going to a partitioning solution. The optimizer will consider partition pruning (selecting individual partitions) during the optimization phase of processing a SQL command if the partition key column(s) is used in the WHERE clause.
  • Supports information lifecycle management so as data gets older it can be truncated or deleted in older partitions and not impact the more current partitions.
  • Partitioning on indexes breaks one big index into smaller indexes which can significantly reduce I/O on the indexes.
Some examples of using partitioning:
Here are two examples where partitioning can have a positive impact.
  1. A key transaction table is expected to grow by 30% over the next year. It may not be acceptable for the I/O performance to decrease as more data is added to a table. Being able to internally break a table into separate chunks of data that contain data for an individual month or region can offer significant advantages. Monthly reports or regional reports can perform partition scans on specific months or regions and not have to perform full table scans.
  2. A table needs to store data for 13 months. Every month when the oldest partition expires it can be truncated instead of going through and deleting all the old records that have expired.
Different types of partitioning
Range or list partitioning is good to use when there are recognized patterns for how the data is accessed or organizing the data into defined data sets makes sense from an administration perspective. Hash and key partitioning is good to use when someone is just as likely to access one record as another or there is not a clear recognized way to divide the data but there are so many records in a table the benefits of partitioning can be leveraged.
* RANGE
* LIST
* HASH
* KEY
* Composite partitioning- RANGE AND LIST partitions can be sub-partitioned by HASH or KEY.
- RANGE-HASH
- RANGE-KEY
- LIST-HASH
- LIST-KEY

Code examples:
In this example, we have selected the sdate column as the partition key. The table will be broken into separate data chunks (partitions) based on the VALUES LESS THAN clause. Here we have stated that all data before January 1, 2008 will go in the first partition. Partition ranges are not inclusive. Which means Partition Jan2008 will store records where the sdate value is 2008-01-01 or greater and less than 2008-02-01. For 2008 the data is broken into monthly increments. The partition key is defined by the columns listed in the partition clause. The values in the partition key columns determine which partitions the records will be put in.

Range Example:
Range partition example has one partition for dates before 2008 and monthly partitions after that. The MAXVALUE keyword will accept any records with a sales date that is 2008-04-01 or greater. Once we get into May, a new partition for the new month can be created.
CREATE TABLE sales (
invoice_id INT NOT NULL AUTO_INCREMENT,
sdate DATE NOT NULL,
samt DECIMAL (10,2) UNSIGNED NOT NULL,
region_id INT (3),
cust_id INT,
INDEX ( invoice_id),
INDEX ( sdate) ) ENGINE=innodb
PARTITION BY RANGE (to_days(sdate)) (
PARTITION before2008 VALUES LESS THAN (to_days('2008-01-01')) ,
PARTITION Jan2008 VALUES LESS THAN (to_days('2008-02-01')) ,
PARTITION Feb2008 VALUES LESS THAN (to_days('2008-03-01')) ,
PARTITION Mar2008 VALUES LESS THAN (to_days('2008-04-01')) ,
PARTITION plast VALUES LESS THAN MAXVALUE );

List Example:
List partition example partitions based on the region_id column. The value for the region_id will determine which partition the data goes into.
CREATE TABLE sales (
invoice_id INT NOT NULL AUTO_INCREMENT,
sdate DATE NOT NULL,
samt DECIMAL (10,2) UNSIGNED NOT NULL,
region_id INT (3),
cust_id INT,
INDEX ( invoice_id),
INDEX ( region_id) ) ENGINE=myisam
PARTITION BY LIST (region_id) (
PARTITION North VALUES IN (1, 2, 5, 6),
PARTITION South VALUES IN (7,8,15,16 ),
PARTITION West VALUES IN (20, 21, 22, 23, 24),
PARTITION East VALUES IN (30, 31, 32, 34, 35) );

Hash example:
Hash table example generates 16 partitions that will be used to try and evenly divide the data. Hash partitioning uses the modulus operator to distribute the data.
CREATE TABLE download(
id INT NOT NULL,
pname VARCHAR(60),
ddate DATE NOT NULL,
pcode INT,
region_id INT ) ENGINE=myisam
PARTITION BY HASH(region_id)
PARTITIONS 16;

Key partition example:
Key table table breaks the data into 16 partitions. With key partitioning, the MySQL server will use an internal algorithm to try to evenly distribute the data.
CREATE TABLE download(
id INT NOT NULL,
pname VARCHAR(60),
ddate DATE NOT NULL,
pcode INT,
region_id INT ) ENGINE=innodb
PARTITION BY KEY (id)
PARTITIONS 16;

Composite Partitioning
If the number of records that go into a range or list partition are still too too big, then a composite partitioning can be performed. Range or list partitioned tables can have a lower level of partitioning performed. This is called sub partitioning.

RANGE-HASH partitioning
This RANGE-HASH partitioning will have four has subpartitions for every range.
CREATE TABLE sales (
invoice_id INT NOT NULL AUTO_INCREMENT,
sdate DATE NOT NULL,
samt DECIMAL (10,2) UNSIGNED NOT NULL,
region_id INT (3),
cust_id INT,
INDEX ( invoice_id),
INDEX ( sdate) ) ENGINE=innodb
PARTITION BY RANGE (to_days(sdate))
SUBPARTITION BY HASH( region_id)
SUBPARTITIONS 4(
PARTITION before2008 VALUES LESS THAN (to_days('2008-01-01')) ,
PARTITION Jan2008 VALUES LESS THAN (to_days('2008-02-01')) ,
PARTITION Feb2008 VALUES LESS THAN (to_days('2008-03-01')) ,
PARTITION Mar2008 VALUES LESS THAN (to_days('2008-04-01')) ,
PARTITION plast VALUES LESS THAN MAXVALUE );

This RANGE-HASH partition table defines the number of subpartitions to be defined for each range. Different numbers of subpartitions can be defined when different months can have different volumes of sales. So it may be necesssary for some some ranges to have a larger or smaller number of partitions.

CREATE TABLE sales (
invoice_id INT NOT NULL AUTO_INCREMENT,
sdate DATE NOT NULL,
samt DECIMAL (10,2) UNSIGNED NOT NULL,
region_id INT (3),
cust_id INT,
INDEX ( invoice_id),
INDEX ( sdate) ) ENGINE=innodb
PARTITION BY RANGE (to_days(sdate))
SUBPARTITION BY HASH( region_id) (
PARTITION before2008 VALUES LESS THAN (to_days('2008-01-01'))
( SUBPARTITION s0,
SUBPARTITION s1 ),
PARTITION Jan2008 VALUES LESS THAN (to_days('2008-02-01'))
( SUBPARTITION s3,
SUBPARTITION s4 ),
PARTITION Feb2008 VALUES LESS THAN (to_days('2008-03-01'))
( SUBPARTITION s5,
SUBPARTITION s6 ),
PARTITION Mar2008 VALUES LESS THAN (to_days('2008-04-01'))
( SUBPARTITION s7,
SUBPARTITION s8,
SUBPARTITION s9,
SUBPARTITION s10),
PARTITION plast VALUES LESS THAN MAXVALUE
(SUBPARTITION s11,
SUBPARTITION s12)
);

To leverage partition pruning, the partition key column (sdate) needs to be in a WHERE clause, for the optimizer to consider performing partition scans. For example, the following query will only access data that is in the Jan2008 instead of the full table. So if we use the partition key in the WHERE clause then monthly and quarterly reports will perform partition scans and not full table scans.

SELECT * FROM sales
WHERE sdate BETWEEN '2008-01-01' AND '2008-01-01';

A few points about partitioning in MySQL 5.1
This is the first release of partitioning in MySQL so the focus is on key features of partitioning. It's a little tough due to some of the restrictions (listed below) but they are offset by the fact that MySQL offers partitioning for free. This can be an expensive license for other database vendors. This allows MySQL users the ability to leverage the great benefits of partitioning without any additional licensing costs.

People always ask me how big should a table be before we start considering partitioning. I then give my favorite answer, "it depends". Once a table gets to the size that the following become issues, it is time to consider partitioning:
  • Full tables scans are too expense.
  • Maintenance operations on an entire table take too long.
  • Availability is being impacted by the table being too large.
  • ILM can be leveraged by going to partitioning.

Some restrictions of using partitioning in MySQL 5.1
  • Partitioned tables must be partitioned on the partition key using an integer or must contain an expression that evaluates to an integer expression (key partitioning is an exception to this). Partition pruning can still be performed on date columns if you use the TO_DAYS() or YEAR() functions.
  • There can be a maximum of 1024 partitions.
  • Foreign keys are not supported.
  • If a table contains a primary key, the columns in the partition key must be part of the primary key.

No comments: