I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)
Later I should be able to query per customer_year_month and per customer_year_calendar_week.
That means that I will only query for the intervals:
week
and month
Now I'am asking myself how to partition this table?
I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?
so the question of partitioning a table is quite complex and should consider a lot of things. Among others:
What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size
Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot.
So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.
create table "customer-logging" (
customer_id long,
log string,
ts timestamp,
week as date_trunc('week', ts)
) partitioned by (week) into 8 shards
Please only use 8 shards if you have an appropriate amount of CPU cores ;)
Docs: date_trunc(), partitioned tables
Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)
Cheers, Claus
Related
I am very new to Cassandra, I have worked with Oracle SQL and Mongo DB and I am trying to learn Apache Cassandra to use it in a project I am working on.
I have a certain number of sensors (let's say 20), that might increase in the future. They send the data to store every 10 seconds. I am aware of bucketing to deal with this type of situations but wondering which one is better.
PRIMARY KEY ((sensor_id, day_month_year), reported_at);
PRIMARY KEY ((sensor_id, month_year), reported_at);
I don't know if using month_year is too much data for a single partition and on the other hand I think that if I use day_month_year it creates too many partitions and it slows reading too much when trying to get data since it has to access multiple partitions.
Which one should I use? If you have other good suggestions or just some explanations for me I'd like to hear them.
Posting my answer here you also asked on https://community.datastax.com/questions/10596/.
Sensor data collected every 10 seconds is equivalent to:
6 entries per minute
360 entries per hour
8,640 entries per day
260K entries per month
Depending on what other data you store for each row, it will be difficult to keep the size of each partition to the recommended 100MB. This isn't a hard limit so your partitions can go beyond 100MB but you are trading off performance the larger your partition gets.
On its own, Cassandra isn't ideal for performing analytics queries because it is optimised for OLTP workloads where you are reading one partition for each app request. If you need to do OLAP, you will need to do in Spark for efficiency. Cheers!
There is a table with 5 columns and no more. The size of each row is less then 200 bytes but the number of the table rows may be increased to several tens of billions during the time.
The application will be storing data at a rate of 100 per second or more. Once these data are stored, they will never be updated but they will be removed after 1 year. They will not be read many times though, but may be queried by selecting within a time range, e.g. selecting rows for a given hour in a given day.
Questions
Which type of Nosql database is suited for this?
Which of these databases would be best suited? (Doesn't have to be listed)
If your Oracle license includes the partitioning option, partition by month or year, and if most/all of your queries include the date column you partitioned on, that will help dramatically. It also makes dropping a year's worth of data take a few seconds.
As others noted in the comments, depends on how much data is being returned by a query. If a query is returning millions of rows, then yes, it may take 15 minutes. Oracle can handle queries against billion row tables in a few seconds if the criteria is restricting enough and appropriate indexes are present, and statistics gathered appropriately.
So how many rows are returned by your 15 minute query?
I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.
We have a database that is currently 1.5TB in size and grows by a gigabyte worth of data every day (a text file) that is 5 million records - and it grows daily
It has many columns, but a notable one is START_TIME which has the date and time -
We run many queries against a date range -
We keep 90 days worth of records inside of our database, and we have a larger table which has ALL of the records -
Queries run against the 90 days worth of records are pretty fast, etc. but queries run against ALL of the data are slow -
I am looking for some very high level answers, best practices
We are THINKING about upgrading to SQL Server enterprise and using table partitioning, and splitting the partition based on month (12) or days (31)
Whats the best way to do this?
Virtual Physical, a SAN, how many disks, how many partitions, etc. -
Sas
You don't want to split by day, because you will touch all partitions every month. Partitioning allows you not to touch certain data.
Why do you want to partition? Can you clearly articulate why? If not (which I assume) you shouldn't do it. Partitioning does not improve performance per-se. It improves performance in some scenarios and it takes performance in others.
You need to understand what you gain and what you loose. Here is what you gain:
Fast deletion of whole partitions
Read-Only partitions can run on a different backup-schedule
Here is what you loose:
Productivity
Standard Edition
Lower performance for non-aligned queries (in general)
Here is what stays the same:
Performance for partition-aligned queries and indexes
If you want to partition, you will probably want to do it on date or month, but in a continuous way. So don't make your key month(date). Make it (year(date) + '-' + month(date)). Never touch old partitions again.
If your old partitions are truly read-only, put each of them in a read-only file-group and exclude it from backup. That will give you really fast backup and smaller backups.
Because you only keep 90 days of data you probably want to have one partition per day. Every day at midnight you kill the last partition and alter the partition function to make room for a new day.
There is not enough information here to answer anything about hardware.
I have a scenario in which there's a huge amount of status data about an item.
The item's status is updated from minute to minute, and there will be about 50,000 items in the near future. So that, in one month, there will be about 2,232,000,000 rows of data. I must keep at least 3 months in the main table, before archieving older data.
I must plan to achieve quick queries, based on a specific item (its ID) and a data range (usually, up to one month range) - e.g. select A, B, C from Table where ItemID = 3000 and Date between '2010-10-01' and '2010-10-31 23:59:59.999'
So my question is how to design a partitioning structure to achieve that?
Currently, I'm partitioning based on the "item's unique identifier" (an int) mod "the number of partitions", so that all partitions are equally distributed. But it has the drawback of keeping one additional column on the table to act as the partition column to the partition function, therefore, mapping the row to its partition. All that add a little bit of extra storage. Also, each partition is mapped to a different filegroup.
Partitioning is never done for query performance. With partitioning the performance will always be worse, the best you can hope for is no big regression, but never improvement.
For query performance, anything a partition can do, and index can do better, and that should be your answer: index appropriately.
Partitioning is useful for IO path control cases (distribute on archive/current volumes) or for fast switch-in switch-out scenarios in ETL loads. So I would understand if you had a sliding window and partition by date so you can quickly switch out the data that is no longer needed to be retained.
Another narrow case for partitioning is last page insert latch contention, like described in Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads.
Your partition scheme and use case does not seem to fit any of the scenarios in which it would benefit (maybe is the last scenario, but is not clear from description), so most likely it hurts performance.
I do not really agree with Remus Rusanu. I think the partitioning may improve performance if there's a logical reason (related to your use cases). My guess is that you could partition ONLY on the itemID. The alternative would be to use the date as well, but if you cannot predict that a date range will not cross the boundaries of a given partition (no queries are sure to be with a single month), then I would stick to itemId partitioning.
If there are only a few items you need to compute, another option is to have a covering index: define an INDEX on you main differentiation field (the itemId) which INCLUDEs the fields you need to compute.
CREATE INDEX idxTest ON itemId INCLUDE quantity;
Applicative partitioning actually CAN be beneficial for query performance. In your case you have 50K items and 2G rows. You could for example create 500 tables, each named status_nnn where nnn is between 001 and 500 and "partition" your item statuses equally among these tables, where nnn is a function of the item id. This way, given an item id, you can limit your search a priori to 0.2% of the whole data (ca. 4M rows).
This approach has a lot of disadvantages, as you'll probably have to deal with dynamic sql and a other unpleasant issues, especially if you need to aggregate data from different tables. BUT, it will definitely improve performance for certain queries, s.a. the ones you mention.
Essentially applicative partitioning is similar to creating a very wide and flat index, optimized for very specific queries w/o duplicating the data.
Another benefit of applicative partitioning is that you could in theory (depending on your use case) distribute your data among different databases and even different servers. Again, this depends very much on your specific requirements, but I've seen and worked with huge data sets (billions of rows) where applicative partitioning worked very well.