TLDR: How to save a state of each user on each entity of my application avoiding cartesian-product?
Say I have 10K entities and 10K users, and each user has a status per entity. In order to manage my users it feels like I'm going to have to save an entry per user on some cartesian table, which will have 100 million entries, and that seems irrational..
I've been thinking maybe this table could be sorted by the user's primary key, so my queries could be more efficient, but it still seems bad way to handle this situation.
Any solution would be highly appreciated, whether it's with the DB choice itself - relational or not, different ERDs or anything else
Thanks!
Firstly: Based on you question, all users can have status per each entity and all of this information should be saved. You have 100 million entries with same meta data (entityID, userID, statusID). So you have Cartesian-Product here between User and Entity. There is not any ERD improvements by Data Modeling Patterns.
Secondly: You have 100 million entries as MAXIMUM. I think your data is not BIG enough and does not cover 3V's of BIG Data (Volume: amount of data, Velocity: speed of data in and out, Variety: range of data types and sources). You can handle it by relational DBMS's. However, if you have more than 100 million entries (for example 100 million entries per day or per week), you should use BIG Data technologies.
Thirdly: To have maximum performance in queries, you can use some In-Memory DBMS's. You have 100 million entries with 3 Long (or BigInt) IDs. So you approximately need maximum 100 * 3 * 8 MB (2400 MB=2.4 GB) memory. (I'm sure that you have Big Server to handle 10k Users with 10k Entities.)
Related
I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.
I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.
I have about 1 billion events daily. I need to store these events in the database for last 30 days, so it's about 30 billion rows.
Let's say it is athletes database, each row has only 4 column (athlete name, athlete's discipline, athlete rank, date). I need to retrieve data only by athlete name and date. For example build a graph for the last 30 days for particular athlete.
Initially I was using Google Big Query, this is great tool, extremely cheap, with daily sharding out of the box and linear scalability but with few drawbacks. Querying 3 billions table takes about 5 seconds, too much for my case. When data is inserted it appears in the "Streaming buffer" and can't be query for some time (about 5-10 minutes )
Another approach use Postgres and store all the data in the one table with proper indexes. Also I can use daily sharding (create new table automatically at the beginning of the day) But I have concerns whether Postgres can handle billion rows. Also if I want to get historical data for last 30 days, I have to make 30 SELECT queries when sharding data in such way.
I don't want to bother with over-complicated solutions like Cassandra (have never tried it though). Also I don't think I will get any benefits from using column-oriented database, because I have only 4 columns.
Looking for something similar to Big Query but without mentioned drawbacks. I think data can be stored in one node.
The data can be stored using only one node. Actually, 1 billion rows per day is not much. It's only about 32K writes/second. For comparison, Akumuli can handle about 1.5 million inserts / second on m4.xlarge AWS instance with SSD (almost half of that with EBS volume with default settings but you can provision more IOPS). To store 30B data-points you will need less than 200GB of disk space (it depends on your data but it's safe to assume that the data-point will take less than 5 bytes on disk).
The data model is simple in your case. The series name would look like this:
athlet_rank name=<Name> discipline=<Discipline>
You will be able to query the data by name:
{
"select": "athlete_rank",
"range": { "from": "20170501T000000",
"to": "20170530T000000" },
"where": { "name": <Name> }
}
You shouldn't chose Akumuli if you have large cardinality (many unique series). It consumes about 12KB of RAM per series, e.g. to handle the database with 1 million series you will need a server with at least 16GB of RAM (the actual number depend on series size). This will be improved eventually but at the moment this is what we've got.
Disclaimer: I'm the author of Akumuli so I'm a bit biased. But I'll be happy to get any feedback, good or bad.
I have some design question to ask.
Suppose I'm having some TBL_SESSIONS table where I keep every logged-in user-id and some Dirty-flag (indicating whether the content he has was changed since he got it).
Now, this table should be periodically be asked every few seconds by any signed-in user about that Dirty-flag, i.e., there are a lot of read calls on that table, and also this Dirty-flag is changed a lot.
Suppose I'm expecting many users to be logged in at the same time. I was wondering if there is any reason in creating like say 10 such tables and have the users distributed (say according their user-ids) between those table.
I'm asking this from two aspects. First, in terms of performance. Second, in terms of scalability.
I would love to hear your opinions
For 1m rows use a single table.
With correct indexing the access time should not be a problem. If you use multiple tables with users distributed across them you will have additional processing necessary to find which table to read/update.
Also, how many users will you allocate to each table? And what happens when the number of users grows... add more tables? I think you'll end up with a maintenance nightmare.
Having several tables makes sence if you have really lots of records in it, like billions, and high load that your server cannot handle. This way you can perform sharding -> split one table to several on different servers. E.g. have first 100 millions of records on server A (id 1 to 100 000 000), next 100 million on server B (100 000 001 to 200 000 000) etc. Other way - is having same data on different and querying data through some kind of balancer, wich may be harder (replication isn't key feature of RDBMS, so it depends on engine)
I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.
I am working on a medical software and my goal is to store lots of custom actions to database. Since it is very important to keep track who has done what, an action is generated every time user does something meaningful (e.g. writes comment, adds some medical information etc.). Now the problem is that over time there will be lots of actions, let's say 10000 per patient, and there might be 50000 patients, resulting in total of 500 million actions (or even more).
Currently database model looks something like this:
[Patient] 1 -- 1 [ActionBlob]
So every patient simply has one big blob which contains all actions as big serialized byte array. Of course this won't work when table grows big because I have to transfer the whole byte array all the time back and forth between database and client.
My next idea was to have list of individually serialized actions (not as a big chunk), i.e
[Patient] 1 -- * [Action]
but I started to wonder if this is a good approach or not. Now when I add new action I don't have to serialize all other actions and transfer them to database but simply serialize one action and add it to Actions table. But how about loading data, will it be superslow since there may be 500 million rows in one table?
So basically the question are:
Can sql server handle loading of 10000 row from table with 500 million rows? (These numbers may be even larger)
Can entity framework handle materialization of 10000 entities without being very slow?
Your second idea is correct, having smaller million items is not problem for SQL database, also if you index some useful columns in action table, it will result in faster performance.
Storing actions as blob is very bad idea, as everytime you will have to convert from blobs to individual records to search and it will not offer any benefits of search etc.
Properly indexed billion records are not at all problem for SQL server.
And in no user interface, we ever will see million records at once, we will always page records like 1 to 99, 100 to 199 and so on.
We have tables with nearly 10 million rows, but everything is smooth, because frequently searched columns are indexed, foreign keys are indexed.
Short answer for questions 1 and 2: yes.
But, if you're making these "materialization" in one move, you'd rather use SqlBulkCopy.
I'd recommend you take a look at the following:
How to do a Bulk Insert -- Linq to Entities
http://archive.msdn.microsoft.com/LinqEntityDataReader
About your model, you definitely shouldn't use a blob to store Actions. Have an Action table that has the Patient foreign key, and be sure to have a timestamp column at this table.
This way, whenever you have to load the Actions for a given Patient, you can use time as a filtering criteria (for example, load the Actions for the last 2 month).
As you're likely going to fetch Actions for a given Patient, make sure to set the Patient FK as an index.
Hope this helps.
Regards,
Calil