I was reviewing the oracle setup at our place . Have few doubts on best practices. Using oracle 12 c and planning to move to 19 by EOY. Open to answers on both versions if there is a difference.
In a microservice architecture we have 10 apps each having 10 tables they interact with. Is it better to have 10 different DB or 10 different users/schemas in same DB.
All the tables in total have 31Tb of data. It's mentioned that oracle 12c Bigfile can only grow till 32Tb. Is that a true limitation of oracle that it can't grow further?
For tablespace currently , all the objects are saved in 1 tablespace and indexes are saved in 2nd tablespace. Is that a good strategy. Is there something better like having different tablespace for different users linked to different microservices or things like save clob objects in 1 tablespace and rest in another tablespace.
Is there an out of box purge or archive solution. I have seen row level archival option which basically turns a flag on or off. Ideally would like a functionality where every weekend a year old data gets purged or archived automatically.
There is a table with orders which need to be fulfilled in next 3 months and then once they are past the delivery date they remain in the table for select and no updates happen on them. I was thinking to apply partitions to such a table. There is 100 Gb of data each in such tables. Will applying partitions help? What kind of strategy will work best for this usecase.
Related
A lot of people don't want to use ClickHouse just to do analytics for their company or project. They want to use it as the backbone for SaaS data/analytics projects. Which most of the time would require supporting semi-structured json data, which could result in creating a lot of columns for each user you have.
Now, some experinced ClickHouse users say less tables means more performance. So having a seperate table for each user is not an option.
Also, having the data of too many users into the same database will result in a very large number of columns, which some experiments say it could make CH unresponsive.
So what about something like 20 users per database having each user limited to 50 columns.
But what if you got thousands of users? Should you create thousands of databases?
What is the best solution possible to this problem?
Note: In our case, data isolation is not an issue, we are solving it on the application level.
There is no difference between 1000 tables in a single database and 1000 databases with a single table.
There is ALMOST no difference between 1000 tables and a table with *1000 partitions partition by (tenant_id, .some_expression_from_datetime.)
The problem is in overhead from MergeTree and ReplicatedMergeTree Engines. And is in number of files you need to create / read (data locality problem, not related to files, will be the same without a filesystem).
If you have 1000 tenants, the only way is to use order by (tenant_id,..) + restrictions using row policies or on application level.
I have an experience with customers who have 700 Replicated tables -- it's constant straggle with the replication, need to adjust background pools, the problem with Zookeeper (huge DB size, enormous network traffic between CH and ZK). Clickhouse starts for 4 hours because it needs to read metadata from all 1000000 parts. Partition pruning works slower because it iterates through all parts during query analysis for every query.
The source of the issue is the original design, they had about 3 tables in metrika i guess.
Check this for example https://github.com/ClickHouse/ClickHouse/issues/31919
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm currently trying to develop a solution for data logging related to a SCADA application, using SQL Server 2012 Express. The SCADA application is configured to execute a stored procedure on SQL Server to push data in the db.
The data flow imho is quite heavy ( 1.4 - 1.9 m rows per day, averaging 43 bytes in length, after some tweaks). The table which stores data has one clustered index on three columns. For now our focus is to store this data as compactly as possible and without generating too much fragmentation (SELECTS are not of major interest right now).
Currently the DB occupies ~250 MB (I have pre-allocated 5120 MB for the DB), and holds only this data table one other table which can be ignored, and the transaction logs.
My questions are:
How can I setup index maintenance on this DB? Being Express edition I can't use SQL Server agent. I'll use task scheduler but should I use rebuild or reorganize? Is it advisable to use a fill factor under 100? Should I configure the task scheduler to call at intervals such that the task will only reorganize (fragmentation under 30%)? Is rebuilding an increasingly expensive operation (day x index is rebuilt, will day x+1 take less time to rebuild as opposed to rebuilding only once in 2 days), after it reaches max storage space?
Again having SQL Server Express edition limits the data capacity to 10 GB. I'm trying to squeeze as much as I can in that amount. I'm planning to build a ring buffer - can I setup the DB such that after I get in the event logs the message that the alter database expand etc. failed the stored procedure will use update on oldest values as a means of inserting data (my fear is that even updates will take some new space, and at that point I'll have to somehow aggressively shrink the DB)?
I have also considered using a compressed win partition to store the files of the DB, and using a free unlimited DB such as MySQL for storage purposes, and SQL Server only as a frontend - the SCADA app must be configured with SQL Server. Is that worth considering?
To optimize inserts I'm using a global temp db which holds up to 1k rows (counting with a sequence) as a form of buffer and then push the data into the main table and truncate the temp table. Is that efficient? Should I consider transactions for efficiency instead - I've tried to begin a named transaction in the stored procedure if it doesn't exist and if the sequence is reaching 1k commit the tran? Does increasing the threshold to 10k rows lead to less fragmentation?
If you're thinking I'm unfamiliar with Databases then you are right. Atm there is only one scada application using SQL Server, but the actual application is setup redundantly so at the end everything will take twice the resources (and each instance of the SCADA application will get its own storage). Also I need to mention that I can't just upgrade to a superior edition of SQL Server, but I have the freedom to use any piece of free software.
Most of the answers cross over the 4 numbers, so I just put responses in bullets to help:
Indexes should probably be maintained, but in your case, they can be prohibitive. Besides the clustered index on the table, indexes generally (the nonclustered type) are for querying the data.
To let a Scheduled Task do the work since no Agent can be use, the sqlcmd utility (https://technet.microsoft.com/en-us/library/ms165702%28v=sql.105%29.aspx). The command line tools may have to be installed, but that let's you write a batch script to run the SQL commands.
With an app doing as much inserting as you describe, I would design a 2-step process. First, a basic table with no nonclustered indexes to accept the inserts. Second, a table you'd be querying the data. Then, use a scheduled task to call a stored proc to transfer transactional data from table 1 to table 2 perhaps hourly or daily based on your query needs (and also remove the original data from table 1 after transfer to table 2 - this should definitely be done in a transaction).
Otherwise, every insert has to not only insert the raw data for the table, but also insert the records for the indexes.
Due to the quantity of your inserts, high fill factors should probably be avoided (probably set to less than 50%). A high (100%) fill factor means the nonclustered indexes don't leave any space in the pages of the tables to actually insert records. Every record you insert means the pages of the table have to be re-organized. Having a high fill factor will leave space in each page of the table so new records can be inserted in indexes without having to reorganize them.
To optimize your inserts, I would use the 2-step process above to insert records straight into your first table. If you can have your app use SQL Bulk Copy, I would explore that as well.
To optimize space, you can explore a few things:
Do you need all the records accessible in real time? Perhaps you can work with the business to create a data retention policy in which you keep every record in the database for 24 hours, then a summary by minute or something for 1 week, hourly for 2 weeks, daily for 6 months, etc. You could enhance this with a daily backup so that you could restore any particular day in its entirety if needed.
Consider changing the database level from full recovery to simple or bulk-logged. This can control your transaction log with the bulk inserts you may be doing.
More Info: https://technet.microsoft.com/en-us/library/ms190692%28v=sql.105%29.aspx
You'll have to work hard to manage your transaction log. Take frequent checkpoints and transaction log backups.
Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.
It takes about 5-10 minutes to refresh a prepared reporting table. We want to refresh this table constantly (maybe once every 15 minutes or continuously).
We query this reporting table very frequently (many times per minute) and I can't keep it down for any length of time. It is okay if the data is 15 minutes old.
I can't drop the table and recreate it. I can't delete the table's contents and recreate it.
Is there a technique I should be using, like swapping between two tables (read from one while we build the other) or do I put this 5-10 minute process in a large transaction?
Use synonyms?. On creation this points to tableA.
CREATE SYNONYM ReportingTable FOR dbo.tableA;
15 minutes later you create tableB and redefine the synonym
DROP SYNONYM ReportingTable;
CREATE SYNONYM ReportingTable FOR dbo.tableB;
The synonym is merely a pointer to the actual table: this way the handling of the actual table renames etc is simplified and abstracted away and all code/clients would use ReportingTable
Edit, 24 Nov 2011
Synonyms are available in all edition: partition switching is Enterprise/Developer only.
Edit, Feb 2012
You can switch whole tables in standard edition (maybe Express, untested)
ALTER TABLE .. SWITCH ..
This would be more elegant than synonyms if the target table is empty.
Edit, Feb 2012 (2)
Also, you can rotate via schemas as per Caching joined tables in SQL Server
Yes, you should swap tables, and if not already done, consider using a different server or other physical partitioning for the reporting table.
The recommended approach for near real-time reporting is to offload reads from the operational system, and separate write activity from read activity in the reporting system.
You have done the first part, at least logically, by having a prepared table. Swapping between a read-only table for users and a separate table for updates eliminates write-read conflicts between transactions. The cost is the cache latency for users, but if required, it should be possible to make steps to minimize the preparation time and and swap the tables more often.
For more information on design choices in real-time reporting, I recommend a well written paper by Wayne Eckerson, Best Practices in Operational BI.
Having two tables sounds like the simplest solution.
In our project, We used two tables, and Create/Alter View to switch.
HI all!
My client currently has a SQL Server database that performs 3-4 Million Inserts, about as many updates and even more reads a day, every day. Current DB is laid out weirdly IMHO: The incoming data goes to "Current" table, then nightly records are moved to corresponding monthly tables (i.e. MarchData, AprilData, MayData etc.), that are exact copies of Current table (schema-wise i mean). Reads are done from view that UNIONs all monthly tables and Current table, Inserts and Updates are done only to Current table. It was explained to me that the separation of data into 13 tables was motivated by the fact that all those tables use separate data files and those data files are written to 13 physical hard drives. So each table gets its own hard drive, supposedly speeding up the view performance. What i'm noticing is that nightly record move to monthly tables (which is done every 2 minutes for the period of night, 8 hours) coincides with full backup and DB starts crawling, web site times out etc.
I was wondering is this approach really the best approach out there? Or can we consider a different approach? Please mind, that the database is about 300-400 GB and growing by 1.5-2 GB a day. Every so often we move records that are more than 12 months old to a separate database (archive).
Any insight is highly appreciated.
If you are using MS SQL Server, consider Partitioned Tables and Indexes.
In short: you can group your rows by some value, i.e. by year and month. Each group could be accessible as separate table with own index. So you can list, summarize and edit February 2011 sales without accessing all rows. Partitioned Tables complicate the database, but in case of extremely long tables it could lead to significantly better performance. It also supports "filegroups" to store values in different disks.
This MS-made solution seems very similar to yours, except one important thing: it doesn't move records over night.
It was explained to me that the separation of data into 13 tables was motivated by the fact that
all those tables use separate data files and those data files are written to 13 physical hard
drives. So each table gets its own hard drive,
THere is one statement for that: IDIOTS AT WORK.
Tables are not stored on discs, but in file spaces which can span multiple data files. Note this... so you can have one file space that has 12 data files on 13discs and a table would be DISTRIBUTED OVER ALL 13 TABLES. No need to play stupid silly games to distribute the load, it is already possible just by reading the documentation.
Even then, I seriously doubt 13 discs are fast. Really. I run a smaller database privately (merely 800gb) that has 6 discs for the data alone, and my current work assignment is into three digits of discs (that is 100+). Please, do not name 13 discs a large database.
Anyhow, SHOULD the need arive to distribute data, not a UNION but partitioned tables (atgain a standard sql server, albeit enterprise edition feature) is the way to go.
Please mind, that the database is about 300-400 GB and growing by 1.5-2 GB a day.
Get a decent server.
I was wondering is this approach really the best approach out there?
Oh, hardware. Get one of the SuperMicro boxes for databases 2 to 4 rack units high, SAS backplane, 24 to 72 slots for discs. Yes, one one computer.
Scrap that monthly blabla table crap that someone came up with who obviously shoul not work with databases. All in one table. Use filespaces and multiple data files to handle load distribution for all tables into the various discs. Unless...
...you actually realize that running discs like that is gross neglect. A RAID 5 or RAID 6 or RAID 10 is in order, otherwise your server is possibly down when a disc fails which will happen and resotring a 600gb database takes time. I run RAID 10 for my data discs, but then privately have tables with about a billion rows (and in work we add about that a day). Given the SMALL size of the database, a couple of SSD would also help.... their IOPS budget would mean you could go to possibly 2-3 discs and get a lot more speed out. If that is not possible, my bet is that those discs are slow 3.5" discs with 7200 RPM... an upgade to enterprise level discs would help. I personaly use 300gb Velociraptors for databases, but there are 15k SAS discs to be taken ;)
Anyho, this sounds really badly set up. So bad I would either be happy my trainee came up with something that smart (as it woul definitely be over the head of a trainee), or my developer would stop working for me the moment I Find that out (based on gross incompetence, feel free to challenge in court)
Reorganize it. Also be carefull with any batch processing - those NEED to be time staggered so they do not overlap wioth backups. There is only so much IO a mere simple low speed disc can deliver.