What is the best way to optimize a database with millions of records and hundreds of simultaneous queries?
The database holds 2 significant tables, in a one to many relationship (table1 has a column for the key of table2).
Indexing has been applied for the relevant columns.
Caching is not very effective because each record is being read only few times after it has been inserted or updated, an not in a known time frame.
The data can be logically arranged to be distributed between different databases without the need for cross-database query.
What is the best database engine and configuration for this table structure?
Can something be done in other layers of the application?
Related
A lot of people don't want to use ClickHouse just to do analytics for their company or project. They want to use it as the backbone for SaaS data/analytics projects. Which most of the time would require supporting semi-structured json data, which could result in creating a lot of columns for each user you have.
Now, some experinced ClickHouse users say less tables means more performance. So having a seperate table for each user is not an option.
Also, having the data of too many users into the same database will result in a very large number of columns, which some experiments say it could make CH unresponsive.
So what about something like 20 users per database having each user limited to 50 columns.
But what if you got thousands of users? Should you create thousands of databases?
What is the best solution possible to this problem?
Note: In our case, data isolation is not an issue, we are solving it on the application level.
There is no difference between 1000 tables in a single database and 1000 databases with a single table.
There is ALMOST no difference between 1000 tables and a table with *1000 partitions partition by (tenant_id, .some_expression_from_datetime.)
The problem is in overhead from MergeTree and ReplicatedMergeTree Engines. And is in number of files you need to create / read (data locality problem, not related to files, will be the same without a filesystem).
If you have 1000 tenants, the only way is to use order by (tenant_id,..) + restrictions using row policies or on application level.
I have an experience with customers who have 700 Replicated tables -- it's constant straggle with the replication, need to adjust background pools, the problem with Zookeeper (huge DB size, enormous network traffic between CH and ZK). Clickhouse starts for 4 hours because it needs to read metadata from all 1000000 parts. Partition pruning works slower because it iterates through all parts during query analysis for every query.
The source of the issue is the original design, they had about 3 tables in metrika i guess.
Check this for example https://github.com/ClickHouse/ClickHouse/issues/31919
The chunk partitioning for hypertables is a key feature of TimescaleDB.
You can also create relational tables instead of hypertables, but without the chunk partitioning.
So if you have a database with around 5 relational tables and 1 hypertable, does it lose the performance and scalability advantage of chunk partitioning?
One of the key advantages of TimescaleDB in comparison to other timeseries products is that timeseries data and relational data can be stored in the same database and then queried and joined together. So, "by design", it is expected that the database with several normal tables and hypertable will perform well. Usual PostgreSQL consideration about tables and other database objects, e.g., how shared memory is going to be affected, applies here.
I'm an analyst preparing Tableau reports with analysis for other teams. I would like to get some workload of my shoulders by creating a data source so optimized, that the users will be able to use it to get the data they need and do the analysis by themselves.
Current situation:
We use Amazon Redshift. We have tables with raw data coming directly from the systems. Also we have some transformed tables for easier work. All in all, it's tens and tens of tables. We are using Tableau desktop and Tableau server.
Desired situation:
I would like to retain access to the raw data so I can backtrack any potential issues back to the original source. From the raw data, I would like to create transformed tables that will allow users to make queries on them (two-layer system). The tables should contain all the data a user might need, yet be simple enough for a beginner-level SQL user.
I see two ways of approaching this:
Small number of very large tables with all the data. If there are just a couple of tables that contain maximum amount of data, the user can just query one table and ask for columns he need. Or, if necessary, join one or two more tables to it.
Many small and very specialized tables. User will have to do multiple joins to get the data he needs, but all the tables will be very simple so it will not be difficult.
Also, access permissions to the data need to be considered.
What do you think is a good approach to solving my issue? Is it any of the two above mentioned solutions? Do you have any other solution? What would you recommend?
We had this problem and we sorted out with AWS Athena. You pay only when the data is scanned and used. Otherwise, you will not pay and no data will be touched.
With AWS Athena you can create any set of tables with different attributes from and easy to maintain the Role permissions.
Last part to cover, Tableau has a direct interface to Athena, so no need for any intermediate storage.
Also any time you don't want the table, just delete and remove from Roles. Rest of them will be automatically taken care.
On an Additional Note, we tried Redshift Spectrum on JSON data, it does not work with nested JSON yet. So all your attributes should be only one level deep.
Hope it helps.
EDIT1:
Redshift is a columnar database, there is no difference between small tables and big tables. If you can avoid joins with smaller tables. Even if the table is bigger, your query speed depends upon the fields involved in the query. If a field is not required in the query it is never touched when you are querying the data.
I prefer to have all related data in a bigger table so need to duplicate any relations or joins to the tables.
Also you need to ensure there is not much duplication of data when you store in a bigger table.
More about Database Normalization,
MySQL: multiple tables or one table with many columns?
I'm a senior database developer who has always practiced creating four auditing columns on most database tables, if not all, as follows:
DATE_INSERTED
USER_INSERTED
DATE_UPDATED
USER_UPDATED
The reasons I wish to capture this information is NOT to comply with some external auditing requirements like Sarbanes Oxley. It's simply for troubleshooting purposes when Development is asked to investigate some data scenario in the database, and knowing who originally inserted some record and when, as well as who last updated it and when, strongly aids in the troubleshooting effort. Needing to store every version/state of that record that ever existed is probably overkill, but might be useful in some cases.
I'm new to both Azure SQL Database (essentially SQL Server) and its memory-optimized tables (in-memory tables), which I'm planning to implement. The amount of bytes you can store in memory database-wide is limited, so I'm being very conscious of limiting the number of bytes I put into memory, especially for large tables. Well, these four auditing columns will eat up a significant number of in-memory bytes if I add them to large in-memory tables, and I'd love to avoid that. These audit columns will never be queried or displayed in app code or reports, but they will be likely be populated by a combination of column default values, triggers and app code.
My question is whether there's a good data modeling strategy to keep these four columns out of memory, while keeping the remainder of the table in memory? The worst-case scenario seems like creating your main in-memory table, and then create a separate on-disk table with the 4 audit columns and a UNIQUE FOREIGN KEY column pointing to the in-memory table (thereby creating a 1-to-1 FK instead of 1-to-many). But I was hoping there was a more elegant approach to accomplish this in-memory/on-disk split, perhaps leveraging some SQL Server feature I haven't been able to find. As a bonus, it'd be nice if the main table and the audit columns appear as a single table to SQL without having to implement a database view.
Thanks in advance for any suggestions!
We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.