Here is my situation,
I have an application in which I need to store information about the results of different tests made on blood samples. I am currently using ASP.Net core for the web application and SQL Server for the database. (Might switch to Postgres as I will surely host on Linux and SQL Server for Linux is not totally available yet)
All the tests have some information in common, who performed it, at what time, any other related information for tracking purposes. But then all of them also have specific variables that I need to save for reporting/further calculations.
As of now I have about 20 different types of tests we perform on the samples we receive. The question I have is what would be the best way to save that data?
The two options I see are the following:
Have 20 different tables, all containing the general sample tracking info + specific test variables. This way, when I need to fetch the info, everything for a specific type of test is easily accessible. But then I need to query all these tables by join queries whenever I want to generate a report or modify sample results information (as all the test results/variables entry forms are in a single page). There if very few moments where I need to query only a specific type of test, most of the time, I need to retrieve them all at once, which means that I will always (mostly) query the 20+ tables every time I need to access sample data.
Have one big table containing all the results for the different tests performed and serialize (JSON format) only the specific test variables. So I would have all tracking information available (queryable, searchable, etc....) but the variables and results of each test would be in a single serialized column.
It is important to know that the variables/results won't be queried directly, I don't need to filter by them or anything like that (yet at the very least).
Now I wonder what would give me the best performance in the long term between using the multiple tables with join queries vs using serialization/deserialization that needs to take place whenever I access the data.
Also, I am aware that by serializing the test results/variables, I am losing ability to query by the information they contain (except for SQL server 2016 that now includes a way to query JSON information if I'm not mistaken...).
I also try to follow best practices by normalizing the database but I'm not a pro and I don't know what would be the best approach between my two options (or any other option if there is a better alternative, I'm totally open to better ideas)
So what would be the best approach and why?
Usage estimate
There might be around 15 to 30 millions tests performed every year. Of which I would say 2/3 would be of 5 different blood tests and the other third would be all the other tests performed.
Different table for different test is a good idea to work with.
Reason 1:If only 10 tests are performed on the sample rest of the column will unnecessary waste DB space.
Reason 2:Creating report will be easy in future according to samples
Reason 3:Filtering of data will be easy
Reason 4:maintenance will be easy
If in case of tests are mandatory go with 1 table
Related
I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.
I am working on a group project and we are having a discussion about whether to calculate data that we want from an existing database and store it in a new database to query later, or calculate the data from the existing database every time we need to use it. I was wondering what the pros and cons may be for either implementation. Is there any advice you could give?
Edit: Here is more elaborate explanation. We have a large database that has a lot of information being submitted to it daily. We are building a system to track certain points of data. For example, we are getting the count of how many times a user does something that is entered in the database. Using this example (are actual idea is a bit more complex), we are discussing to methods of getting the count of actions per users. The first method is to create a database that stores the users and their action count, and query this database every time we need the action count. The second method would be to query the large database and count the actions per user every time we need to use it. I hope this explanation helps explain. Thoughts?
Edit 2: Two more things that may be useful to point out is 1: I only have read access to the large database and 2: My ultimate goal is to display this information on a web page for end users.
This is a generic question about optimization by caching. The following was my answer to essentially the same question. Even though that question provided a bunch of different details, none of them were specific enough to merit a non-generic answer either:
The more you want to calculate at query time, the more you want views,
calculated columns and stored or user routines. The more you want to
calculate at normalized base update time, the more you want cascades
and triggers. The more you want to calculate at some other (scheduled
or ad hoc) time, the more you use snapshots aka materialized views and
updated denormalized bases. You can combine these. Any time the
database is accessed it can be enabled by and restricted by stored
routines or other api.
Until you can show that they are in adequate, views and calculated
columns are the simplest.
The whole idea of a DBMS is to store a representation of your
application state as the database (which normalization reduces the
redundancy of) and then you query and let the DBMS implement and
optimize calculation of the answer. You haven't presented a reason for
not doing that in the most straightforward way possible.
[sic]
Always make sure an application is reading its own personal ("external") database that is a view of "the" ("conceptual") database so that when you change the implemention of the former (plus the rest of some combined interfact) by the latter (plus the rest of some compbined mechanisms) your applications do not have to change ("logical independence"). Here the applications are your users' and your trackers'.
Ultimately you must instrument and guestimate. When it is worth it you start caching. Preferably as much as possible in terms of high-level notions like views and snapshots and as little as possible in non-DBMS code. One of he benefits of the relational model is that it is easy to describe a strightforward relational interface in terms of another straightforward relational interface. You protect your applications from change by offering an interface that hides secrets of implementation or which of a family of interfaces is the current one.
I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?
For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.
We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.
With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)
First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.
Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.
There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Try to keep old data in separate table here your can use table partitioning or can use a different kind of approach where you can store your old data in file system and can serve them directly from your application without connecting to database, this way your database will be free. I am doing this for one of my project and it already has more than 50GB of data but it is running very smoothly.
Try to index table columns but be careful as it will affect your insertion speed.
Try batch processing for your insertion or select queries. you can handle this issue very smartly here.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
You can also use some ORM like Hibernate. They will use some caching techniques to boost speed of your data.
For any further query you can mail me on ranjeet1985#gmail.com
I work with an application that it switching from filebased datastorage to database based. It has a very large amount of code that is written specifically towards the filebased system. To make the switch I am implementing functionality that will work as the old system, the plan is then making more optimal use of the database in new code.
One problem is that the filebased system often was reading single records, and read them repeatedly for reports. This have become alot of queries to the database, which is slow.
The idea I have been trying to flesh out is using two datasets. One dataset to retrieve an entire table, and another dataset to query against the first, thereby decreasing communication overhead with the database server.
I've tried to look at the DataSource property of TADODataSet but the dataset still seems to require a connection, and it asks the database directly if Connection is assigned.
The reason I would prefer to get the result in another dataset, rather than navigating the first one, is that there is already implemented a good amount of logic for emulating the old system. This logic is based on having a dataset containing only the results as queried with the old interface.
The functionality only have to support reading data, not writing it back.
How can I use one dataset to supply values for another dataset to select from?
I am using Delphi 2007 and MSSQL.
You can use a ClientDataSet/DataSetProvider pair to fetch data from an existing DataSet. You can use filters on the source dataset, filters on the ClientDataSet and provider events to trim the dataset only to the interesting records.
I've used this technique with success in a couple of migrating projects and to mitigate similar situation where a old SQL Server 7 database was queried thousands of times to retrieve individual records with painful performance costs. Querying it only one time and then fetching individual records to the client dataset was, at the time, not only an elegant solution but a great performance boost to that particular application: The most great example was an 8 hour process reduced to 15 minutes... poor users loved me that time.
A ClientDataSet is just a TDataSet you can seamlessly integrate into existing code and UI.
Spreading test data across multiple small data sets seems to me to create a maintenance headache whenever the schema is tweaked. Anybody see a problem with create a single larger test data set? By "larger" I'm still only talk about a couple hundred records in total.
I would not use a unique large dataset (you want to avoid any overhead if you don't need it) and follow DbUnit's Best Practices recommendations:
Use multiple small datasets
Most of your tests do not require the
entire database to be re-initialized.
So, instead of putting your entire
database data in one large dataset,
try to break it into many smaller
chunks.
These chunks could roughly
corresponding to logical units, or
components. This reduces the overhead
caused by initializing your database
for each test. This also facilitates
team development since many developers
working on different components can
modify datasets independently.
For integrated testing, you can still
use the CompositeDataSet class to
logically combine multiple datasets
into a large one at run time.
Some more feedback from the Unitils folks:
Automatic test database maintenance
When writing database tests, keep in mind following guidelines:
Use small sets of test data, containing as few data as possible. In your data files, only specify columns that are used in join columns or the where clause of the tested query.
Make data sets test class specific. Don't reuse data sets between different test classes, for example do not use 1 big domain data set for all your test classes. Doing so will make it very difficult to make changes to your test data for a test without braking anything for another test. You are writing a unit test and such a test should be independent of other tests.
Don't use too many data sets. The more data sets you use, the more maintenance is needed. Try to reuse the testclass data set for all tests in that testclass. Only use method data sets if it makes your tests more understandable and clear.
Limit the use of expected result data sets. If you do use them, only include the tables and columns that are important for the test and leave out the rest.
Use a database schema per developer. This allows developers to insert test data and run tests without interfering with each other.
Disable all foreign key and not null constraints on the test databases. This way, the data files need to contain no more data than absolutely necessary
Using small datasets with just enough data has worked decently for us in the past. Sure, there is some maintenance if you tweak the database but this is manageable with some organization.