HI all!
My client currently has a SQL Server database that performs 3-4 Million Inserts, about as many updates and even more reads a day, every day. Current DB is laid out weirdly IMHO: The incoming data goes to "Current" table, then nightly records are moved to corresponding monthly tables (i.e. MarchData, AprilData, MayData etc.), that are exact copies of Current table (schema-wise i mean). Reads are done from view that UNIONs all monthly tables and Current table, Inserts and Updates are done only to Current table. It was explained to me that the separation of data into 13 tables was motivated by the fact that all those tables use separate data files and those data files are written to 13 physical hard drives. So each table gets its own hard drive, supposedly speeding up the view performance. What i'm noticing is that nightly record move to monthly tables (which is done every 2 minutes for the period of night, 8 hours) coincides with full backup and DB starts crawling, web site times out etc.
I was wondering is this approach really the best approach out there? Or can we consider a different approach? Please mind, that the database is about 300-400 GB and growing by 1.5-2 GB a day. Every so often we move records that are more than 12 months old to a separate database (archive).
Any insight is highly appreciated.
If you are using MS SQL Server, consider Partitioned Tables and Indexes.
In short: you can group your rows by some value, i.e. by year and month. Each group could be accessible as separate table with own index. So you can list, summarize and edit February 2011 sales without accessing all rows. Partitioned Tables complicate the database, but in case of extremely long tables it could lead to significantly better performance. It also supports "filegroups" to store values in different disks.
This MS-made solution seems very similar to yours, except one important thing: it doesn't move records over night.
It was explained to me that the separation of data into 13 tables was motivated by the fact that
all those tables use separate data files and those data files are written to 13 physical hard
drives. So each table gets its own hard drive,
THere is one statement for that: IDIOTS AT WORK.
Tables are not stored on discs, but in file spaces which can span multiple data files. Note this... so you can have one file space that has 12 data files on 13discs and a table would be DISTRIBUTED OVER ALL 13 TABLES. No need to play stupid silly games to distribute the load, it is already possible just by reading the documentation.
Even then, I seriously doubt 13 discs are fast. Really. I run a smaller database privately (merely 800gb) that has 6 discs for the data alone, and my current work assignment is into three digits of discs (that is 100+). Please, do not name 13 discs a large database.
Anyhow, SHOULD the need arive to distribute data, not a UNION but partitioned tables (atgain a standard sql server, albeit enterprise edition feature) is the way to go.
Please mind, that the database is about 300-400 GB and growing by 1.5-2 GB a day.
Get a decent server.
I was wondering is this approach really the best approach out there?
Oh, hardware. Get one of the SuperMicro boxes for databases 2 to 4 rack units high, SAS backplane, 24 to 72 slots for discs. Yes, one one computer.
Scrap that monthly blabla table crap that someone came up with who obviously shoul not work with databases. All in one table. Use filespaces and multiple data files to handle load distribution for all tables into the various discs. Unless...
...you actually realize that running discs like that is gross neglect. A RAID 5 or RAID 6 or RAID 10 is in order, otherwise your server is possibly down when a disc fails which will happen and resotring a 600gb database takes time. I run RAID 10 for my data discs, but then privately have tables with about a billion rows (and in work we add about that a day). Given the SMALL size of the database, a couple of SSD would also help.... their IOPS budget would mean you could go to possibly 2-3 discs and get a lot more speed out. If that is not possible, my bet is that those discs are slow 3.5" discs with 7200 RPM... an upgade to enterprise level discs would help. I personaly use 300gb Velociraptors for databases, but there are 15k SAS discs to be taken ;)
Anyho, this sounds really badly set up. So bad I would either be happy my trainee came up with something that smart (as it woul definitely be over the head of a trainee), or my developer would stop working for me the moment I Find that out (based on gross incompetence, feel free to challenge in court)
Reorganize it. Also be carefull with any batch processing - those NEED to be time staggered so they do not overlap wioth backups. There is only so much IO a mere simple low speed disc can deliver.
Related
I have an Access database for which I am the only user. It's the first database I've built. It has 16 related tables, around 40 select queries, and a dozen or so update/delete queries. It is already 512MB and will at least double in size as additional data is added to tables and more queries & reports are created over the next 12 months. The largest table (which is accessed by most of the queries) is around 800k rows by 11 fields. This table will most likely grow to over 2M lines over the useful life of the database (c. 12 months).
Queries that ran in under 30 sec a month ago are starting to run slower as the tables have grown, some queries which include calculations now taking 10 min or so to complete (and yes, I am using stacked queries as much as possible).
Does anyone have solid advice one way or the other as to the performance boost I could expect from splitting?
Thanks
No you wouldn't, only query optimizing and careful indexing can speed up the query time.
That said, you should split it anyway (create a backup and run the wizard) if for nothing else to ease backup of your data and to make that independent of your ongoing development of the frontend.
We work for a small company that cannot afford to pay SQL DBA nor consultation.
What started as a small project has now become a full scale system with a lot of data.
I need someone to help me sort out performance improvements. I realise no-one will be able to help directly and nail this issue completely, but I just want to make sure I have covered my tracks.
OK, the problem is basically we are experiencing time-outs with our queries on cached data. I have increased the time-out time with c# code but I can only go so far when it's becoming ridiculous.
The current setup is a database that has data inserted every 5 / 10 seconds, constantly! During this process we populate tables from csv files. Over night we run data caching processes that reduces the overload on the "inserted" tables. Originally we were able to convert 10+ million rows into say 400000 rows, but as users want more filtering we had to include more data rows and of course increases the number of data cached tables from 400000 to 1-3 million rows.
On my SQL Development Server (which does not have data inserted every 5 seconds) it used to take 30 seconds to run queries on data cache table with 5 million rows, now with indexing and some improvements it's now 17 seconds. The live server has standard SQL Server and used to take 57 seconds, now 40 seconds.
We have 15+ instances running with same number of databases.
So far we have outlined the following ways of improving the system:
Indexing on some of the data cached tables - database now bloated and slows down overnight processes.
Increased CommandTimeout
Moved databases to SSD
Recent improvements likely:
Realised we will have to move csv files on another hard disk and not on the same SSD drive SQL Server databases reside.
Possibly use file-groups for indexing or cached tables - not sure if SQL Server standard will cover this.
Enterprise version and partition table data - customer may pay for this but we certainly can't afford this.
As I said I'm looking for rough guidelines and realise no-one may be able to help fix this issue completely. We're are a small team and no-one has extended SQL Server experience. Customer wants answers and we've tried everything we know. Incidentally they had a small scale version in Excel and said they found no issues so why are we?!?!?
Hope someone can help.
We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.
I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.
There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.
SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.
SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.