Postgres Partition by Character Prefix - database

Good Day,
I would like to check what the best way is to partition a Postgres table on a columns prefix. I have a large table (+-300 750 Million rows x 10 columns) and I would like to partition it on a prefix of column 1.
Data looks like:
ABCDEF1xxxxxxxx
ABCDEF1xxxxxxxy
ABCDEF1xxxxxxxz
ABCDEF2xxxxxxxx
ABCDEF2xxxxxxxy
ABCDEF2xxxxxxxz
ABCDEF3xxxxxxxx
ABCDEF3xxxxxxxz
ABCDEF4xxxxxxxx
ABCDEF4xxxxxxxy
Their will only ever by 10 partitions i.e. ABCDEF0...->ABCDEF9...
What I've currently done is make tables like:
CREATE TABLE public.mydata_ABCDEF1 (
CHECK ( col1 like 'ABCDEF1%' )
) INHERITS (public.mydata);
CREATE TABLE public.mydata_ABCDEF2 (
CHECK ( col1 like 'ABCDEF2%' )
) INHERITS (public.mydata);
etc. Then the trigger with similar logic:
IF ( NEW.col1 like 'ABCDEF1%' ) THEN
INSERT INTO public.mydata_ABCDEF1 VALUES (NEW.*);
ELSIF ( NEW.imsi like 'ABCDEF2%' ) THEN
INSERT INTO public.simdata_ABCDEF2 VALUES (NEW.*);
I'm concerned if partitioning in this way will speed up query time? or if I should consider partitioning on substr (not sure how), or if I should make a new column with the prefix and partition on that column?
Any advise is appreciated.

I know this is an old question, but I am adding this answer in case anyone else needs a solution.
Postgres 10 allows range partitioning https://www.postgresql.org/docs/10/static/ddl-partitioning.html.
While the examples in the docs use date ranges, you can also use string ranges since Postgres (mostly) uses ASCII ordering. The below code creates a parent table and then two child tables, which depending on your specific codes, should automatically bin any alphanumeric based on the prefixes provided. The ranges do have to be non-overlapping, which is why I simply cannot create a range from ABCDEF1 to ABCDEF2.
CREATE TABLE mydata (...) PARTITION BY RANGE (col1);
CREATE TABLE mydata_abcdef1 PARTITION OF mydata
FOR VALUES FROM ('ACBCDEF1') to ('ABCDEF1z');
CREATE TABLE mydata_abcdef1 PARTITION OF mydata
FOR VALUES FROM ('ACBCDEF2') to ('ABCDEF2z');

It will significantly speed-up your queries when each one of the partitioned tables have their indexes partitioned as appropriately, e.g.:
CREATE INDEX ON public.mydata_ABCDEF1 (...) WHERE col1 like 'ABCDEF1%';

The short answer is "probably not," but it really depends on exactly what your queries are.
The question is really- what are you trying to accomplish with the partitioning? Generally speaking, PostgreSQL's btree index is very fast and efficient at finding the specific records you are asking for- faster than PostgreSQL is at figuring out which table out of a set of partitioned tables you have data stored in.
Where partitioning is extremely useful is when it helps with data management. The reason it is useful there is that you can often partition based on time and then, when the data has aged long enough, simply remove the older partitioning instead of having to issue "DELETE" queries that mark records as deleted, which then have to be VACUUM'd to have the space reclaimed, and ends up causing bloat in the table and indexes.
300M records is about the point where I might consider partitioning, but I wouldn't jump to partitioning the data at that point without a clear reason why having the data partitioned will be helpful.
Also, be aware that PostgreSQL's query planner does not handle very large numbers of partitions very well; hundreds and thousands of partitions will slow down planning time. That's not very obvious with pre-9.5 versions, but in 9.5 an "EXPLAIN ANALYZE" will return the planning time required for a given query:
=*> explain analyze select * from downloads;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on downloads (cost=0.00..38591.76 rows=999976 width=193) (actual time=23.863..2088.732 rows=
Planning time: 0.219 ms
Execution time: 2552.878 ms
(3 rows)

Related

Sql server table: create partitions?

I have an sql datadas, where among other things I have a prices table, where I have one price per product per store.
There are 50 stores and over 500000 products, so this table Will easily have 25 to 30 million records.
This table is feed daily over night with prices updates, and has huge read operations during day. Reads are made with readonly intent.
All queries contain storeid as part of identifying the record to update or read.
I m not able yet to determine how this Will behave since I m expecting external supply of prices but I m expecting performance issues at least on read operations, even though indexes are in place for now...
My question is if I should consider table partition by store since it is always part of queries. But then I have indexes where storeid is not the only column that is part of the index.
Based on this scenario, would you recommend partitioning? The alternative I see is having 50 tables one per store, but it seems painless and if possible to avoid the better
if I should consider table partition by store since it is always part of queries
Yes. That sounds promising.
But then I have indexes where storeid is not the only column that is part of the index.
That's fine. So long as the partitioning column is one of the clustered index columns, you can partition by it. In fact with partitioning, you can get partition elimination for a trailing column of the clustered index, then a clustered index seek within the target partition.
Hi all and thank you for your replies.
I was able to generate significant information on a contained environment where I was able to confirm that I can achieve excelent performance indicators by using only the appropriate indexes.
So for now we will keep it "as is" and have the partition strategy on hand just in case.
Thanks again, nice tips guys

Partition or Index large table in SQL Server

I have a large table consisting of 4 Billion+ rows and 50 columns, most of which are either datetime or numeric except a few which are varchar.
Data will be inserted into the table on a weekly basis (about 20 million rows).
I expect queries with where clauses on some of the datetime columns, and a couple of the the varchar columns. There is no primary key in the table.
There are no indexes, nor the table is partitioned. I am using SQL Server 2016.
I understand that I need to partition or index the table, but I am not sure which approach to take or both in-fact.
Since the table is large, should I create the indexes first or should I create the partitions first? If I do create the indexes and then create the partitions, what should I do to maintain these with new data coming in weekly.
EDIT: Also, minimal updates and deletes are expected on the table
I understand that I need to partition or index the table
You need to understand what you gain from partitioning. It is not at all the case that SQL Server requires partitioning on big tables to function adequately. SQL Server scales to arbitrary tables sizes without any inherent issues.
Common benefits of partitioning are:
Mass deletion in constant time
Different storage for older partitions
Not backing up old partitions
Sometimes in special situations (e.g. columnstore), partitioning can help as a strategy to speed up queries. Normally, indexing is better for that.
Essentially, partitioning splits the table physically into multiple sub tables. Most often this has a negative effect on query plans. Indexes are perfectly capable of restricting the set of data that needs to be touched. Partitions are worse for that.
Most of the queries will be filtering on the datetime columns and on some of the varchar columns. Like, get data for a certain daterange for a certain entity. With the indexes, it will be fragmented a lot because of new inserts and rebuilding/reorganising the indexes will also consume a lot of time. I can do it but again not sure which approach.
It seems you can best solve this by indexing:
Index according to the queries you expect.
Maintain the indexes properly. This is not too hard. For example, rebuild them after the weekly load.
Since the table is large, should I create the indexes first or should I create the partitions first?
Set up that partitioning objects first. Then, create or rebuild the clustered index on the new partitioning scheme. If possible drop other indexes first and recreate them afterwards (might not work due to availability restrictions).
what should I do to maintain these with new data coming in weekly.
What concerns do you have? New data will be stored in the appropriate partitions automatically. Make sure to create new partitions before loading the data. Keep partitions ready for 2 weeks in advance. The latest partitions must always be empty to avoid costly splits.
There is no primary key in the table.
Most often this is a not a good design. Most tables should have a primary key and a clustered index. If there is no natural key use an artifical one such as a bigint identity.
You definitely can apply partitioning but my feeling is that it will not gain you what you maybe expect. But it will force you to take on additional maintenance burdens, possibly reduce performance and there is risk of making mistakes that threaten availability. Simplicity is important.

Approaches to table partitioning in SQL Server

The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.

SQL split/merge of table partitions: What is the best approach to implement?

Microsoft in its MSDN entry about altering SQL 2005 partitions, listed a few possible approaches:
Create a new partitioned table with the desired partition function, and then insert the data from the old table into the new table by using an INSERT INTO...SELECT FROM statement.
Create a partitioned clustered index on a heap
Drop and rebuild an existing partitioned index by using the Transact-SQL CREATE INDEX statement with the DROP EXISTING = ON clause.
Perform a sequence of ALTER PARTITION FUNCTION statements.
Any idea what will be the most efficient way for a large scale DB (millions of records) with partitions based on the dates of the records (something like monthly partitions), where data spreads over 1-2 years?
Also, if I mostly access (for reading) recent information, will it make sense to keep a partition for the last X days, and all the rest of the data will be another partition? Or is it better to partition the rest of the data too (for any random access based on date range)?
I'd recommend the first approach - creating a new partitioned table and inserting into it - because it gives you the luxury of comparing your old and new tables. You can test query plans against both styles of tables and see if your queries are indeed faster before cutting over to the new table design. You may find there's no improvement, or you may want to try several different partitioning functions/schemes before settling on your final result. You may want to partition on something other than date range - date isn't always effective.
I've done partitioning with 300-500m row tables with data spread over 6-7 years, and that table-insert approach was the one I found most useful.
You asked about how to partition - the best answer is to try to design your partitions so that your queries will hit a single partition. If you tend to concentrate queries on recent data, AND if you filter on that date field in your where clauses, then yes, have a separate partition for the most recent X days.
Be aware that you do have to specify the partitioned field in your where clause. If you aren't specifying that field, then the query is probably going to hit every partition to get the data, and at that point you won't have any performance gains.
Hope that helps! I've done a lot of partitioning, and if you want to post a few examples of table structures & queries, that'll help you get a better answer for your environment.

What columns generally make good indexes?

As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...
Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.
Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!
It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.
Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'
In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)
It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.
Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.
Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.
A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.
The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data
It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.
Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.

Resources