I've got a really large table (10+ million rows) that is starting to show signs of performance degradation for queries. Since this table will probably double or triple in size relatively soon I'm looking into partitioning the table to squeeze out some query performance.
The table looks something like this:
CREATE TABLE [my_data] (
[id] [int] IDENTITY(1开发者_Go百科,1) NOT NULL,
[topic_id] [int] NULL,
[data_value] [decimal](19, 5) NULL
)
So, a bunch of values for any given topic. Queries on this table will always be by topic ID, so there's a clustered index on (id, topic_id).
Anyway, since topic IDs aren't bounded (any number of topics could be added) I'd like to try partitioning this table on a modulus function of the topic IDs. So something like:
topic_id % 4 == 0 => partition 0
topic_id % 4 == 1 => partition 1
topic_id % 4 == 2 => partition 2
topic_id % 4 == 3 => partition 3
However, I haven't seen any way to tell "create partition function" or "create partition scheme" to perform this operation when deciding on a partition.
Is this even possible? How can we make a partition function based on an operation performed on the input value?
You just need to create your modulus column as a PERSISTED computed column.
Blue Peter style, here's one I made earlier (although I'm not 100% sure I have the partition values clause right):
CREATE PARTITION FUNCTION [PF_PartitonFour] (int)
AS RANGE RIGHT
FOR VALUES (
0,
1,
2)
GO
CREATE PARTITION SCHEME [PS_PartitionFourScheme]
AS PARTITION [PF_PartitonFour]
TO ([TestPartitionGroup1],
[TestPartitionGroup2],
[TestPartitionGroup3],
[TestPartitionGroup4])
GO
CREATE TABLE [my_data] (
[id] [int] IDENTITY(1,1) NOT NULL,
[topic_id] [int] NULL,
[data_value] [decimal](19, 5) NULL
[PartitionElement] AS [topic_id] % 4 PERSISTED,
) ON [PS_PartitionFourScheme] (PartitionElement);
GO
Hash partitioning is not available in SQL Server 2005/2008. You must use range partitioning.
That being said, you should be aware that partitioning is primarily a storage option, see Partitioned Table and Index Concepts:
Partitioning makes large tables or indexes more manageable, because partitioning enables you to manage and access subsets of data quickly and efficiently, while maintaining the integrity of a data collection. By using partitioning, an operation such as loading data from an OLTP to an OLAP system takes only seconds, instead of the minutes and hours the operation takes in earlier versions of SQL Server. Maintenance operations that are performed on subsets of data are also performed more efficiently because these operations target only the data that is required, instead of the whole table.
As you can see, the introduction of partitioning in MSDN focuses on maintenance, manageability and data load. In my experience partitioning gives, at best, 0 performance gain. Specially in SQL 2005. Usualy it gives performance degradation. To improve performance you should use a correct clustered index and properly designed non-clustered indexes.
In SQL 2008 there are improvements in the parallel operators in regard to partitions if they are properly distributed from an IO point of view, see Designing Partitions to Improve Query Performance. Their benefit are marginal though and overshadowed by the benefits of a properly designed set of clustered and non-clustered indexes. Case in point a clustered index in (id, topic_id) where id is an identity is usefull solely for single item lookup by id. On the other hand a clustered index by (topic_id, id) would benefit any queries that look for specific topic(s). I don't know your system requirements and the queries you run, but 10M rows performance problems on such a narrow table smell like indexing ands querying issue, no partitioning issue.
From the documentation, it seems like you have to give values to the function:
To create 4 partitions...
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
Couldn't you just do your computations above this call and find the proper values to split on? Substitute the values into the call? Or am I missing why you want to use the modulus? Based on the possibility of your ID's having gaps, you may need to use some statistics math to find out where to partition.
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (@low, @Med, @High);
10 million rows isn't that many for SQL server to handle; regular index design would probably solve this without the need for partitioning. As has been noted, try clustering on different sets of columns; clustering on topicid, id seems like something to test out, especially if most queries have topicid as a criterion. A clustered index like that has approximately the same effect as paritioning, at least in that it groups the related rows of data together on disk and allows a range scan to fetch them quickly.
If that design works, all you have to worry about is fragmentation from inserts, but that's manageable. After getting the indexing right, make sure you have enough RAM, and that you don't have a disk bottleneck.
精彩评论