Cassandra Partition Key and Clustering Column Size - database

How does cassandra calculates the size of partitioning key and clustering key . We have tables with with relatively large partitioning keys (UUID and combination of UUID) along with large clustering key for example
mydb/parent/6E219A7E21044B48B8816B931925CCDB/child1/29E6E709854D49CFAC72ECD5E1AEBFA3/
mydb/parent/6E219A7E21044B48B8816B931925CCDB/child2/29E6E709854D49CFAC72ECD5E1AEBFA4/
mydb/parent/6E219A7E21044B48B8816B931925CCDB/child3/29E6E709854D49CFAC72ECD5E1AEBFA5/
here PK - 6E219A7E21044B48B8816B931925CCDB
Clustering Column is - /child1/29E6E709854D49CFAC72ECD5E1AEBFA3/
We have child level upon nth level (right now we are doing till 100 level)
Now does having large keys have performance impact when we have huge data ~300 million , also what will be impact on disk usage

Having large partition key or clustering key is not a issue. It has no impact on performance.
Only thing you should avoid is having large partitions. For example in your case, you have 100 rows in a single partition. So if the size of all rows combined is within 10MB (Ideal size of a Cassandra partition is equal to or lower than 10MB with a maximum of 100MB.), then you are doing fine. You can refer this link for calculating your partition size.
If your partition size is large, then you have to refine your data model so as to reduce your partition size. Following are some of the techniques generally applied for reducing the partition size
Bucketing - Introduce a number with your partition key. Generally applied for time series data. (More can be read here.
Introducing another column from your table as part of partition key.

Related

Is it costly to read one full partition from Cassandra?

Let say i am having one table
RecordingsByAccountaId ( AccountId, a,b,c,x,y,z)
Partitioning key : AccountId
Clustering key : a,b
I need to fetch data for one Account inside my code, so performing
Select * from RecordingsByAccountaId where accountId = 'accountId';
Is it a costly operation ???
Objective is to update 2-3 rows of this table but i don't have any information more then accountId.
Is it almost same to query one row or whole partition ? Because the time I saw to fetch between 200 rows n one row has difference of 20-30 milliseconds?
It's mostly depend on the size of your partition - how many rows it includes. Another factor is how fragmented is your partition - is it located in the single SSTable (it's compacted) or in the multiple SSTables, so you will read data from multiple files.
But usually, reading a partition inside the single file is sequential operation, as all rows that belong to same partition are written sequentially, and if partition size isn't very big, then the performance shouldn't suffer dramatically (but this may depend on your hardware as well).
P.S. How do you make decision on which rows you'll update?

SQL - define keys to table

Is there any considerations to define keys for table that has lot of records already and most of operation that are operated on it are Insert ?
Key definition ultimately comes down to how you can uniquely and efficiently identify any specific row in a table. If a business key value fulfills that requirement, then it is a suitable candidate. An ideal key is also skinny. A GUID is horrible for this (IMHO) because it is far larger than it needs to be.
If insert performance is the most important priority and a suitable business key is not available, you can use an integer based identity key. If you expect more than 2.1 billion records within a few years, use bigint (9 quintillion records) instead.
Keep in mind that every index you make on the table will always include the PK. Having a skinny PK can make your indexes more efficient, using less storage, memory and CPU.
Insert speed is affected by the clustered index sort order as well as the number and sort order of all non-clustered indexes on the table. Column-store indexes are not sorted and have minimal overhead on inserts.
If you have a PK that store ID-number is more heavy then auto increases number, therefore when you define key keep in mind that it bather to create another column of PK for auto increases number.

UUID Primary Key in Postgres, What Insert Performance Impact?

I am wondering about the performance impact of using a non-sequential UUID as the primary key in a table that will become quite large in PosgreSQL.
In DBMS's that use clustered storage for table records it is a given that using a UUID is going to increase the cost of inserts due to having to read from disk to find the data page into which to perform the insert, once the table is too big to hold in memory. As I understand it, Postgres does not maintain row clustering on inserts, so I imagine that in Postgres using a UUID PK does not hurt the performance of that insert.
But I would think that it makes the insert into the index that the primary key constraint creates much more expensive once the table is large, because it will have to constantly be read from disk to update the index on insertion of new data. Whereas with a sequential key the index will only be updated at the tip which will always be in memory.
Assuming that I understand the performance impact on the index correctly, is there any way to remedy that or are UUIDs simply not a good PK on a large, un-partitioned table?
As I understand it, Postgres does not maintain row clustering on inserts
Correct at the moment. Unfortunately.
so I imagine that in Postgres using a UUID PK does not hurt the performance of that insert.
It still does have a performance cost because of the need to maintain the PK, and because the inserted tuple is bigger.
The uuid is 4 times as wide as a typical 32-bit integer synthetic key, so the row to write is 12 bytes bigger and you can fit fewer rows into a given amount of RAM
The b-tree index that implements the primary key will be 4x as large (vs a 32-bit key), taking longer to search and requiring more memory to cache. It also needs more frequent page splits.
Writes will tend to be random within indexes, not appends to hot, recently accessed rows
is there any way to remedy [the performance impact on the index] or are UUIDs simply not a good PK on a large, un-partitioned table?
If you need a UUID key, you need a UUID key. You shouldn't use one if you don't require one, but if you cannot rely on a central source of synthetic keys and there is no suitable natural key to use, it's still the way to go.
Partitioning won't help much unless you can confine writes to one partition. Also, you won't be able to usefully use constraint exclusion on searches for the key if writing only to one partition at a time, so you'll still have to search all the partitions' indexes for a key when doing queries. I can only see it being useful if your UUID forms part of a composite key and you can partition on the other part of the composite key.
It should be mentioned that you will get more WALs generated if you have btree index on UUID column with full_page_writes option enabled. This happens because of UUID randomness - the values are not sequential so each insert is likely to touch completely new leaf index leaf page. You can read more in On the impact of full-page writes article.

mysql 5.1 partitioning - do I have to remove the index/key element?

I have a table with several indexes. All of them contain an specific integer column.
I'm moving to mysql 5.1 and about to partition the table by this column.
Do I still have to keep this column as key in my indexes or I can remove it since partitioning will take care of searching only in the relevant keys data efficiently without need to specify it as key?
Partition field must be part of index so the answer is that I kave to keep the partitioning column in my index.
Partitioning will only slice the values/ranges of that index into separate partitions according to how you set it up. You'd still want to have indexes on that column so the index can be used after partition pruning has been done.
Keep in mind there's a big impact on how many partitions you can have, if you have an integer column with only 4 distinct values in it, you might create 4 partitions, and an index would likely not benefit you much depending on your queries.
If you got 10000 distinct values in your integer column, you hit system limits if you try to create 10k partitions - you'll have to partition on large ranges (e.g. 0-1000,1001-2000, etc.) in such a case you'll benefit from an index (again depending on how you query the tables)

Cluster the index on ever-increasing datetime column on logging table?

I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.

Resources