I have some database of ~2 billion documents and ~8 TB which I store for 90 days before dropping the documents. However, several of these fields contain much more data than the rest, and I only need them for a shorter time, say 30 days. After 30 days, I want to clear the fields out to free up space, before archiving the document entirely later on.
It doesn't seem that MongoDB has native functionality for TTL on individual fields.
The database is both write and read heavy.
I'm thinking about writing some script to query Mongo every 1 minute, and then do some query like:
timestamp: $gt -30 days 1 hour AND $lt -30 days and then updateMany to write "" to these fields.
So essentially run a script every minute with a rolling window of one hour (just to ensure no documents escape) and doing an updateMany.
Is this a decent approach? Are there any design considerations I should be aware of when addressing this problem?
Related
I have about 1 billion events daily. I need to store these events in the database for last 30 days, so it's about 30 billion rows.
Let's say it is athletes database, each row has only 4 column (athlete name, athlete's discipline, athlete rank, date). I need to retrieve data only by athlete name and date. For example build a graph for the last 30 days for particular athlete.
Initially I was using Google Big Query, this is great tool, extremely cheap, with daily sharding out of the box and linear scalability but with few drawbacks. Querying 3 billions table takes about 5 seconds, too much for my case. When data is inserted it appears in the "Streaming buffer" and can't be query for some time (about 5-10 minutes )
Another approach use Postgres and store all the data in the one table with proper indexes. Also I can use daily sharding (create new table automatically at the beginning of the day) But I have concerns whether Postgres can handle billion rows. Also if I want to get historical data for last 30 days, I have to make 30 SELECT queries when sharding data in such way.
I don't want to bother with over-complicated solutions like Cassandra (have never tried it though). Also I don't think I will get any benefits from using column-oriented database, because I have only 4 columns.
Looking for something similar to Big Query but without mentioned drawbacks. I think data can be stored in one node.
The data can be stored using only one node. Actually, 1 billion rows per day is not much. It's only about 32K writes/second. For comparison, Akumuli can handle about 1.5 million inserts / second on m4.xlarge AWS instance with SSD (almost half of that with EBS volume with default settings but you can provision more IOPS). To store 30B data-points you will need less than 200GB of disk space (it depends on your data but it's safe to assume that the data-point will take less than 5 bytes on disk).
The data model is simple in your case. The series name would look like this:
athlet_rank name=<Name> discipline=<Discipline>
You will be able to query the data by name:
{
"select": "athlete_rank",
"range": { "from": "20170501T000000",
"to": "20170530T000000" },
"where": { "name": <Name> }
}
You shouldn't chose Akumuli if you have large cardinality (many unique series). It consumes about 12KB of RAM per series, e.g. to handle the database with 1 million series you will need a server with at least 16GB of RAM (the actual number depend on series size). This will be improved eventually but at the moment this is what we've got.
Disclaimer: I'm the author of Akumuli so I'm a bit biased. But I'll be happy to get any feedback, good or bad.
I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?
Me and 10 students are doing a big project where we need to receive temperature data from hardware in form av nodes, that should be uploaded and stored on a server. As we are all engineers in embedded systems and having minor database knowledge, I am turning to you guys.
I want to receive data from the nodes lets say, every 30 seconds. The table that will store that data in the database would quickly become very long if you store: [nodeId, time, temp] in a table. Do you have any suggestions how to store the data in another way?
A solution could be to store it like mentioned for a period of time and then "compromize" it somehow to a matrix of some sort? I still want to be able to reach old data.
One row every 30 seconds is not a lot of data. It's 2880 rows per day per node. I once designed a database which had 32 million rows added per day, every day. I haven't looked at it for a while but I know it's currently got more than 21 billion rows in it.
The only thing to bear in mind is that you need to think about how you're going to query it, and make sure it has appropriate indexes.
Have fun!
We have a database that is currently 1.5TB in size and grows by a gigabyte worth of data every day (a text file) that is 5 million records - and it grows daily
It has many columns, but a notable one is START_TIME which has the date and time -
We run many queries against a date range -
We keep 90 days worth of records inside of our database, and we have a larger table which has ALL of the records -
Queries run against the 90 days worth of records are pretty fast, etc. but queries run against ALL of the data are slow -
I am looking for some very high level answers, best practices
We are THINKING about upgrading to SQL Server enterprise and using table partitioning, and splitting the partition based on month (12) or days (31)
Whats the best way to do this?
Virtual Physical, a SAN, how many disks, how many partitions, etc. -
Sas
You don't want to split by day, because you will touch all partitions every month. Partitioning allows you not to touch certain data.
Why do you want to partition? Can you clearly articulate why? If not (which I assume) you shouldn't do it. Partitioning does not improve performance per-se. It improves performance in some scenarios and it takes performance in others.
You need to understand what you gain and what you loose. Here is what you gain:
Fast deletion of whole partitions
Read-Only partitions can run on a different backup-schedule
Here is what you loose:
Productivity
Standard Edition
Lower performance for non-aligned queries (in general)
Here is what stays the same:
Performance for partition-aligned queries and indexes
If you want to partition, you will probably want to do it on date or month, but in a continuous way. So don't make your key month(date). Make it (year(date) + '-' + month(date)). Never touch old partitions again.
If your old partitions are truly read-only, put each of them in a read-only file-group and exclude it from backup. That will give you really fast backup and smaller backups.
Because you only keep 90 days of data you probably want to have one partition per day. Every day at midnight you kill the last partition and alter the partition function to make room for a new day.
There is not enough information here to answer anything about hardware.
I have a database containing records collected every 0.1 seconds, and I need to time-average the data from a given day to once every 20 minutes. So I need to return a day's worth of data averaged to every 20 minutes which is 24*3 values.
Currently I do a separate AVG call to the database for each 20-minute period within the day, which is 24*3 calls. My connection to the database seems a little slow (it is remote) and it takes ~5 minutes to do all the averages. Would it be faster to do a single query in which I access the entire day's worth of data then average it to every 20 minutes? If it helps to answer the question, I have to do some arithmetic to the data before averaging, namely multiplying several table columns.
You can calculate the number of minutes since midnight like:
datepart(hh,datecolumn)*60 + datepart(mi,datecolumn)
If you divide that by 20, you get the number of the 20 minute interval. For example, 00:10 would fall in interval 0, 00:30 in interval 1, and 15:30 in interval 46, and so on. With this formula, you can group on 20 minute intervals like:
select
(datepart(hh,datecolumn)*60 + datepart(mi,datecolumn)) / 20 as IntervalNr
, avg(value)
from YourTable
group by (datepart(hh,datecolumn)*60 + datepart(mi,datecolumn)) / 20
You can do math inside the avg call, like:
avg(col1 * col2 - col3 / col4)
In general reducing the number of queries is a good idea. Aggregate and do whatever arithmetic/filtering/grouping you can in the query (i.e. in the database), and then do 'iterative' computations on the server side (e.g. in PHP).
To be sure whether it would be faster or not, it should be measured.
However it should be faster, as you have a slow connection to the database, and this way the number of roundtrips has a bigger impact on the total time of execution.
How about a stored procedure on your database? If your database engine doesn't support one, how about having a script or something doing the math and populating a separate 'averages' table on your database server. Then you only have to read the averages from the remote client once a day only.
Computation in one single query would be slightly faster. Think of the overhead on multiple requests like setting up the connection, parsing the query or loading the stored procedure, etc.
But also make sure that you've accurate indicies which may result in a hugh performance increase. Some operations on hugh databases may last from minutes to hours.
If you are sending a lot of data, and the connection is the bottleneck, how and when you group and send the data doesn't matter. There is no good way to send 100MB every 10 minutes over a 56k modem. Figure out the size of your data and bandwidth and be sure you can even send it.
That said:
First be certain the network is the bottleneck. If so, try to work with a smaller data set if possible, and test different scenarios. In general, 1 large record set will use less bandwidth than 2 recordsets that are half the size.
If possible add columns to your table and compute and store the column product and interval index (see Andomar's post) every time you post data to the database.