Let say i am having one table
RecordingsByAccountaId ( AccountId, a,b,c,x,y,z)
Partitioning key : AccountId
Clustering key : a,b
I need to fetch data for one Account inside my code, so performing
Select * from RecordingsByAccountaId where accountId = 'accountId';
Is it a costly operation ???
Objective is to update 2-3 rows of this table but i don't have any information more then accountId.
Is it almost same to query one row or whole partition ? Because the time I saw to fetch between 200 rows n one row has difference of 20-30 milliseconds?
It's mostly depend on the size of your partition - how many rows it includes. Another factor is how fragmented is your partition - is it located in the single SSTable (it's compacted) or in the multiple SSTables, so you will read data from multiple files.
But usually, reading a partition inside the single file is sequential operation, as all rows that belong to same partition are written sequentially, and if partition size isn't very big, then the performance shouldn't suffer dramatically (but this may depend on your hardware as well).
P.S. How do you make decision on which rows you'll update?
Related
I have an sql datadas, where among other things I have a prices table, where I have one price per product per store.
There are 50 stores and over 500000 products, so this table Will easily have 25 to 30 million records.
This table is feed daily over night with prices updates, and has huge read operations during day. Reads are made with readonly intent.
All queries contain storeid as part of identifying the record to update or read.
I m not able yet to determine how this Will behave since I m expecting external supply of prices but I m expecting performance issues at least on read operations, even though indexes are in place for now...
My question is if I should consider table partition by store since it is always part of queries. But then I have indexes where storeid is not the only column that is part of the index.
Based on this scenario, would you recommend partitioning? The alternative I see is having 50 tables one per store, but it seems painless and if possible to avoid the better
if I should consider table partition by store since it is always part of queries
Yes. That sounds promising.
But then I have indexes where storeid is not the only column that is part of the index.
That's fine. So long as the partitioning column is one of the clustered index columns, you can partition by it. In fact with partitioning, you can get partition elimination for a trailing column of the clustered index, then a clustered index seek within the target partition.
Hi all and thank you for your replies.
I was able to generate significant information on a contained environment where I was able to confirm that I can achieve excelent performance indicators by using only the appropriate indexes.
So for now we will keep it "as is" and have the partition strategy on hand just in case.
Thanks again, nice tips guys
I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.
I have over 500 million records in Azure SQL Data Warehouse.
I'm trying to do some benchmark in order to understand in which way keep the records. Rowstore or Columnstore.
I'll not join table with other tables, it's not an analytical fact table.
Both tables are distributed as a Round Robin and both of them contains 17 partitions. And both of them has 45 columns.
When I query to sum two columns, I expect Columnstore table perform much better than rowstore , however the reality is that I get my sum result from Rowstore somewhere around 2.5 min and for columnstore around 10 min. I don't use any filter or group by.
On the other hand , when i query count(*) , columnar table performs much much better than rowstore.
EDIT
Though I can't share all the details with you because its private,
here is some of them just to have understanding what's going on.
I run queries on smallrc and 100DWU.
Table is loaded one CTAS and contains pre joined information from several tables and is going to serve queries over custom defined protocol(sort/group/filter/paging) from our internal application.
The domain is gambling and from 45 Columns we have 43 could be used as filter. The output set usually contains 3 to 4 columns plus two sum columns with no more than 1000 row per query.
I partitioned both tables monthly via EventDate assuming to have each month a new partition. Mostly my queries contains EventDate as a filter.
My Rowstroe table contains EventDate as a clustered index in addition to partitions which are the same as for columnstore.
Adding EventDate as a secondary index for columnstore gave some improvement but performance still far behind rowstore.
EventDate is in int format and values pattern are yyyyMMdd (20180101).
Every DW optimized for elasticity has 60 distributions while the lower skews for DW optimzied for compute also have 60 distributions.
SQL Server's columnstore creates row groups based on row count (as opposed to Parquet for example, where row groups are created based on disk size). Row groups should ideally have 1M rows (see the link that #GregGalloway added), but row groups can get COMPRESSED if they have at least 100k rows loaded in a single bulk load. When a row group is not compressed it is stored in row format in delta stores (they are regular B-trees, with a MD/access overhead since they are part of the columnstore index. Note that you cannot specify the indexing, since they are part of the clustered columnstore index).
I assume that you have 500M rows in 60 distributions, that is 8.3M rows per distribution; assuming your partitioning is homogeneous with 17 partitions you'd have ~490k rows per partition.
When bulk loading into partitioned table you need to be careful about the memory requirements/resource class you're loading with, since the sort iterator on top of the bulk load is not spilling so it will feed the bulk load only that many rows that it can sort.
Ensure that your index has good quality. If you'll do only aggregates over the table without much filtering then 1 partition is ideal, even if you do filtering remember that columnstore does segment elimination so if your data is loaded in the right order you'll be fine.
You should ensure that you have at least a few million rows per partition and that you have COMPRESSED row groups to have good performance. Given your scan results you have most if not all of your columnstore data in OPEN row groups (delta stores).
What do you mean by much better performance in case of count(*)?
Also were these runs cold or warm? If it's a warm run for count(*) CS might be just grabbing the row group MD and up the row count - though in both cases the compiled plans show full table scan.
I got confused when learning indexes concept,
for ex: I have this simple query
select productId,productName from product where productId='11107' and productName='Watch';
and product is very large table, productId and productName are two attributes of product table and 11107 and Watch are two values.
I consider primary index on productId and a secondary index on productName assuming that 1000 records satisfy condition
productId='11107' and 50 records satisfy condition productName='Watch' and each datapage can store 100 records
and the cost of a random IO is 10 times of that of a sequential I/O.
now which of two indexes be used to evaluate this query?
solution:
As per my understanding it should be primary index because the primary index attribute "productId" returns multiple records say 1000 here,
when compared to seconday index attribute "productName" which returns only 50 records.
Also as each datapage stores 100 records then for primary index we need 10 pages and for secondary index 1 page.
As the table ""product" is very large so only less records say 50 satisfies condition for sequential access(records are scanned one at a time).
is my evaluation correct or anything needs to be added. any suggestions.
There are many things will be considered while evaluating best execution plan by Oracle.
Type of Index (unique, normal etc)
Below statistics from dba_indexes view
LEAF_BLOCKS
DISTINCT_KEYS
CLUSTERING_FACTOR
NUM_ROWS
For example, for equal condition Oracle gives lesser cost for the index for which DISTINCT_KEYS closer to NUM_ROWS
In your case, assuming both are normal indexes and all statistics are current - index which have more distinct keys might be preferred over the other.
If I understand you correctly, it seems like your basic logic is backwards. You seem to be saying that the primary index will be used because it will return more rows, which is the opposite of the basic rule of thumb - the more selective index is generally preferred.
However, another potential flaw in your logic is here:
as each datapage stores 100 records then for primary index we need 10
pages and for secondary index 1 page
You should say "between 10 and 1000" and "between 1 and 50". Just because n records can fit into a single "datapage" (or block, in Oracle terminology) does not mean that any n records you are looking for will actually be in the same block. In your example, 10 blocks is the minimum to hold 1000 rows; but it is possible that the 1000 rows for the given productId are actually in 1000 different blocks. (Assuming that the table is at least 1000 blocks in size.)
The question is not really about how many rows each index would return ("row selectivity"), but how many different blocks those rows are in ("block selectivity). The optimizer uses the CLUSTERING_FACTOR value for each index to estimate how closely row and block selectivity match up with one another; low clustering factor generally means better block selectivity.
Going a bit beyond the scope of your question, it is also entirely possible that the optimizer would use neither index, or both.
At some point, the effort required to scan the index (which also requires I/O) then read the corresponding table blocks can be more than the effort required to simply read the entire table. Again, CLUSTERING_FACTOR and other statistics figure into this decision.
In some cases, including your example, it is also possible that the optimizer will choose to do scans on both indexes and join the resulting index entries by the ROWID values, without ever accessing the table blocks at all. This is possible because the query only uses the columns that are in the indexes; if you added another column to your select list, the query would have to read the table blocks to get that data.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.