Cassandra Table Compaction for timeseries - database

My program generates large amount time-series data into the following table:
CREATE TABLE AccountData
(
PartitionKey text,
RowKey text,
AccountId uuid,
UnitId uuid,
ContractId uuid,
Id uuid,
LocationId uuid,
ValuesJson text,
PRIMARY KEY (PartitionKey, RowKey)
)
WITH CLUSTERING ORDER BY (RowKey ASC)
The PartitionKey is a dictionary value (one of 10) and the RowKey is DateTime converted to long.
Now due to the crazy amount of data that is being generated by the program, every ContractId has a different retention policy in the code. The code goes and deletes old data based on the retention for the specific ContractId.
I am now running into problems where during a SELECT statement it picks up too many Tombstones and I get an error.
What Table Compaction strategy should I use to solve this Tombstone problem?

PartitionKey is a dictionary value (one of 10)
I think this is likely your problem. Basically, all of the data in the cluster is ending up on 10 partitions. Those are going to get extremely large as time progresses. In general, you want to keep your partitions between 1MB-10MB in size. The lower the better.
I would recommend splitting the partition up. If it's time related, take a time unit which makes the most sense to your query pattern. For example, if most of the queries are month-based, perhaps something like this might work:
PRIMARY KEY ((month,PartitionKey),RowKey)
That will create a partition for each combination of month and the current PartitionKey.
Likewise, most time series use cases tend to query most-recent data more often. To that end, it usually makes sense to sort data in the partitions by time, in descending order. That is of course, if RowKey is indeed a data/time value.
WITH CLUSTERING ORDER BY (RowKey DESC)
Also, a nice little side-effect of this model, is that any old data which is tombstoned is now at the "bottom" of the partition. So, depending on the delete patterns, tombstones will still exist. But if the data is clustered in descending order...the tombstones are never/rarely queried.
What Table Compaction strategy should I use to solve this Tombstone problem?
So I do not believe that simply changing the compaction strategy will be the silver bullet to solve this problem. That being said, I suggest looking into the TimeWindowCompactionStrategy. That compaction strategy stores its SSTable files by a designated time period (window). This prevents files full of old, obsoleted, or tombstoned data from being queried.

Related

Cassandra: How to use 'ORDER BY' without PRIMARY KEY restricted by EQ or IN?

I have a table in Scylla (a Cassandra compatible database) defined as the following:
create table s.items (time timeuuid, name text, primary key (time));
I want to run a query that gets all items after a certain time, similar to the following:
select * from s.items where time>7e204790-43bf-11e9-9759-000000000004 order by time asc;
But I am told that ORDER BY is only supported when the partition key is restricted by an EQ or an IN. To get around this I can make a table and query similar to the following:
create table s.items (yes boolean, time timeuuid, name text, primary key (yes, time));
select * from s.items where yes=true and time>7e204790-43bf-11e9-9759-000000000004 order by time asc;
While this works, it doesn't seem like the best solution. As I'm fairly new to Scylla and CQL, is there a better/proper way to do this?
Your solution of adding that one boolean key and always setting it to yes basically creates one huge partition with all your data. This is rarely what you really want. If this one partition is your entire data, it means that even if you have a 10-node cluster with 8 CPUs on each node, just 3 CPUs out of all 80 in your cluster will be doing any work (because each partition belongs to a certain CPU, and with RF=3 there are three replicas).
If you're wondering why your original solution didn't work and Scylla refused the "ORDER BY", well, the problem is that although Scylla can scan the entire table to look for entries after time X (you'll need to add 'ALLOW FILTERING' to the query), it has no efficient way to sort what it finds by time. Internally, the different partitions are not sorted by the partition key, but rather by a "token", a hash function of the the partition key. This hashing with its randomizing effect is important to balance the load between all CPUs on the cluster, but prevents Scylla (or Cassandra) from reading the partitions in the original key order.
One thing you can do is to do what Alex suggested above, which is a middle-ground between your original setup and your proposed solution: Don't have one item per partition, or all the items in a single partition, but something in the middle: For example, imagine that in your workload, every day you collect 100MB of data. So you use the day number as the partition key (instead of your bool). All the data of one particular day will sit in one partition. Inside each day's partition, the different entries (rows) will be sorted by the clustering-key order, which will be time. With this setup, to retrieve all the items after some specific day, just start querying each individual day, one by one. E.g., query day 134, then day 135, they 136, then, etc... Inside each day, the results will be already sorted. So problem solved.
This technique is a fairly well-known "time series" data modeling. Scylla (and Cassandra) even have a special compaction strategy tuned for this modeling, TWCS (time-window compaction strategy).

CQL data model to bypass secondary index issuee

I have model which looks like
StateChange:
row_id
group_name
timestamp
user_id
I aim to query as follows:
Query 1 = Find all state changes with row_id = X ORDER BY Timestamp DESC
Query 2 = Find all state changes with row_id = X and group_name = Y ORDER BY Timestamp DESC
Using my limited CQL knowledge, the only way to do so was to create 2 query tables one for each query mentioned above
For query 1:
CREATE TABLE state_change (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY (row_id, timestamp)
)
For query 2:
CREATE TABLE state_change_by_group_name (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY ((row_id, group_name), timestamp)
)
This does solve the problem but I have duplicated data in Cassandra now.
Note: Creating an group_name index on table works but I cannot ORDER BY timestamp anymore as its is the secondary index now.
Looking for a solution which requires only one table.
The solution you're looking for does not exists. Two different queries requires two different tables (or at least a secondary index which creates a table under the hood). Denormalization is the norm in Cassandra so you should not think at data duplication as an anti-pattern -- indeed it's the suggested pattern
Carlo is correct in that your multiple table solution is the proper approach here.
This does solve the problem but I have duplicated data in Cassandra now.
...
Looking for a solution which requires only one table.
Planet Cassandra recently posted an article on this topic: Escaping From Disco-Era Data Modeling
(Full disclosure: I am the author)
But two of the last paragraphs really address your point (especially, the last sentence):
That is a very 1970′s way of thinking. Relational database theory
originated at a time when disk space was expensive. In 1975, some
vendors were selling disk space at a staggering eleven thousand
dollars per megabyte (depending on the vendor and model). Even in
1980, if you wanted to buy a gigabyte’s worth of storage space, you
could still expect to spend around a million dollars. Today (2014),
you can buy a terabyte drive for sixty bucks. Disk space is cheap;
operation time is the expensive part. And overuse of secondary
indexes will increase your operation time.
Therefore, in Cassandra, you should take a query-based modeling
approach. Essentially, (Patel, 2014) model your column families
according to how it makes sense to query your data. This is a
departure from relational data modeling, where tables are built
according to how it makes sense to store the data. Often, query-based
modeling results in storage of redundant data (and sometimes data that
is not dependent on its primary row key)…and that’s ok.

Clustered index consideration in regards to distinct valus and large result sets and a single vertical table for auditing

I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.

Using a meaningless ID as my clustered index rather than my primary key

I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.

What's your approach for optimizing large tables (+1M rows) on SQL Server?

I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.
At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
No! This kind of optimization is so premature it's probably stillborn.
Another approach would be using the partition feature of SQL Server.
Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.
What would be other good approachs to make this the fastest possible?
The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.
A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?
As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.
I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.
Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.
You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.
reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.
My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.
After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.
I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?
But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...
I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.
With regards to select speed, I would certainly take advantage of caching in your circumstance.
You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).
I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.
The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?
The execution plan doesn't seems to show anything about lock issues.
Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.
another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p
The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.
The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.
Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.
When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.
For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.

Resources