Entity framework with lots of rows - sql-server

I am working on a medical software and my goal is to store lots of custom actions to database. Since it is very important to keep track who has done what, an action is generated every time user does something meaningful (e.g. writes comment, adds some medical information etc.). Now the problem is that over time there will be lots of actions, let's say 10000 per patient, and there might be 50000 patients, resulting in total of 500 million actions (or even more).
Currently database model looks something like this:
[Patient] 1 -- 1 [ActionBlob]
So every patient simply has one big blob which contains all actions as big serialized byte array. Of course this won't work when table grows big because I have to transfer the whole byte array all the time back and forth between database and client.
My next idea was to have list of individually serialized actions (not as a big chunk), i.e
[Patient] 1 -- * [Action]
but I started to wonder if this is a good approach or not. Now when I add new action I don't have to serialize all other actions and transfer them to database but simply serialize one action and add it to Actions table. But how about loading data, will it be superslow since there may be 500 million rows in one table?
So basically the question are:
Can sql server handle loading of 10000 row from table with 500 million rows? (These numbers may be even larger)
Can entity framework handle materialization of 10000 entities without being very slow?

Your second idea is correct, having smaller million items is not problem for SQL database, also if you index some useful columns in action table, it will result in faster performance.
Storing actions as blob is very bad idea, as everytime you will have to convert from blobs to individual records to search and it will not offer any benefits of search etc.
Properly indexed billion records are not at all problem for SQL server.
And in no user interface, we ever will see million records at once, we will always page records like 1 to 99, 100 to 199 and so on.
We have tables with nearly 10 million rows, but everything is smooth, because frequently searched columns are indexed, foreign keys are indexed.

Short answer for questions 1 and 2: yes.
But, if you're making these "materialization" in one move, you'd rather use SqlBulkCopy.
I'd recommend you take a look at the following:
How to do a Bulk Insert -- Linq to Entities
http://archive.msdn.microsoft.com/LinqEntityDataReader
About your model, you definitely shouldn't use a blob to store Actions. Have an Action table that has the Patient foreign key, and be sure to have a timestamp column at this table.
This way, whenever you have to load the Actions for a given Patient, you can use time as a filtering criteria (for example, load the Actions for the last 2 month).
As you're likely going to fetch Actions for a given Patient, make sure to set the Patient FK as an index.
Hope this helps.
Regards,
Calil

Related

Fast data retrieval without indexes on a table with data insertions at every 10 seconds (short time span)

I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.

Which data store is best for my scenario

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?

one large sql server table of several smaller ones?

I have some design question to ask.
Suppose I'm having some TBL_SESSIONS table where I keep every logged-in user-id and some Dirty-flag (indicating whether the content he has was changed since he got it).
Now, this table should be periodically be asked every few seconds by any signed-in user about that Dirty-flag, i.e., there are a lot of read calls on that table, and also this Dirty-flag is changed a lot.
Suppose I'm expecting many users to be logged in at the same time. I was wondering if there is any reason in creating like say 10 such tables and have the users distributed (say according their user-ids) between those table.
I'm asking this from two aspects. First, in terms of performance. Second, in terms of scalability.
I would love to hear your opinions
For 1m rows use a single table.
With correct indexing the access time should not be a problem. If you use multiple tables with users distributed across them you will have additional processing necessary to find which table to read/update.
Also, how many users will you allocate to each table? And what happens when the number of users grows... add more tables? I think you'll end up with a maintenance nightmare.
Having several tables makes sence if you have really lots of records in it, like billions, and high load that your server cannot handle. This way you can perform sharding -> split one table to several on different servers. E.g. have first 100 millions of records on server A (id 1 to 100 000 000), next 100 million on server B (100 000 001 to 200 000 000) etc. Other way - is having same data on different and querying data through some kind of balancer, wich may be harder (replication isn't key feature of RDBMS, so it depends on engine)

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

Database design--billions of records in one table?

Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace

Resources