Where I'm at there is a main system that runs on a big AIX mainframe. To facility reporting and operations there is nightly dump from the mainframe into SQL Server, such that each of our 50-ish clients is in their own database with identical schemas. This dump takes about 7 hours to finish each night, and there's not really anything we can do about it: we're stuck with what the application vendor has provided.
After the dump into sql server we use that to run a number of other daily procedures. One of those procedures is to import data into a kind of management reporting sandbox table, which combines records from a particularly important table from across the different databases into one table that managers who don't know sql so can use to run ad-hoc reports without hosing up the rest of the system. This, again, is a business thing: the managers want it, and they have the power to see that we implement it.
The import process for this table takes a couple hours on it's own. It filters down about 40 million records spread across 50 databases into about 4 million records, and then indexes them on certain columns for searching. Even at a coupld hours it's still less than a third as long as the initial load, but we're running out of time for overnight processing, we don't control the mainframe dump, and we do control this. So I've been tasked with looking for ways to improve one the existing procedure.
Currently, the philosophy is that it's faster to load all the data from each client database and then index it afterwards in one step. Also, in the interest of avoiding bogging down other important systems in case it runs long, a couple of the larger clients are set to always run first (the main index on the table is by a clientid field). One other thing we're starting to do is load data from a few clients at a time in parallel, rather than each client sequentially.
So my question is, what would be the most efficient way to load this table? Are we right in thinking that indexing later is better? Or should we create the indexes before importing data? Should we be loading the table in index order, to avoid massive re-ordering of pages, rather than the big clients first? Could loading in parallel make things worse by causing to much disk access all at once or removing our ability to control the order? Any other ideas?
Update
Well, something is up. I was able to do some benchmarking during the day, and there is no difference at all in the load time whether the indexes are created at the beginning or at the end of the operation, but we save the time building the index itself (it of course builds nearly instantly with no data in the table).
I have worked with loading bulk sets of data in SQL Server quite a bit and did some performance testing on the Index on while inserting and the add it afterwards. I found that BY FAR it was much more efficient to create the index after all data was loaded. In our case it took 1 hour to load with the index added at the end, and 4 hours to add it with the index still on.
I think the key is to get the data moved as quick as possible, I am not sure if loading it in order really helps, do you have any stats on load time vs. index time? If you do, you could start to experiment a bit on that side of things.
Loading with the indexes dropped is better as a live index will generate several I/O's for every row in the database. 4 million rows is small enough that you would not expect to get a significant benefit from table partitioning.
You could get a performance win by using bcp to load the data into the staging area and running several tasks in parallel (SSIS will do this). Write a generic batch file wrapper for bcp that takes the file path (and table name if necessary) and invoke a series of jobs in half a dozen threads with 'Execute Process' tasks in SSIS. For 50 jobs it's probably not worth trying to write a data-driven load controller process. Wrap these tasks up in a sequence container so you don't have to maintain all of the dependencies explicitly.
You should definitely drop and re-create the indexes as this will greatly reduce the amount of I/O during the process.
If the 50 sources are being treated identically, try loading them into a common table or building a partitioned view over the staging tables.
Index at the end, yes. Also consider setting the log level setting to BULK LOGGED to minimize writes to the transaction log. Just remember to set it back to FULL after you've finished.
To the best of my knowledge, you are correct - it's much better to add the records all at once and then index once at the end.
Related
I currently have trouble updating SQLite database records at scale within a healthy amount of time.
I have a small database of about 70,000 records and I have some office personal using Navicat to filter the records and make some bulk edits to commit. When trying to perform an UPDATE on a large amount of records in one field everything comes to a crawl, when I look at the raw SQL query I can see that the program is using an UPDATE ..SET.. WHERE to perform the operations.
My question is what can I do to help this query run faster? I have a SQLite auto index on the column used for matching the record to update, but I have read and searched all over and have yet to see any kind of resolve other then what I already have in place. Updating one field in 20,000 records is literally taking close to 3 hours...regardless of using Navicat or not. The whole database is only 30mb so I have to be doing something wrong.
All other database operations run nice and speedy, not sure whats wrong and looking for some guidance from a SQLite vet.
There are very few reasons why SQLite can perform as poorly as you describe, but I've experienced pretty much the same thing.
First, if you are performing a table-wide update on an indexed column, you're gonna thrash the whole index, all while making the index less useless. Delete the index before you update, then recreate the index.
Additionally, you can get some serious perf boost by changing the synchronous PRAGMA, but it comes with a non-zero amount of extra risk of corrupt data on a crash (extremely unlikely in my experience)
PRAGMA schema.synchronous = 0
I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
If you do need to insert directly into a relational database: That's (part of) what PgPool-II and (especeially) PgBouncer are for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
At one extreme, each INSERT can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERTs in a BEGIN and COMMIT.
At the other extreme, you say "I really don't care if I lose all this data" and use unlogged tables for your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits (commit_delay and commit_siblings), batching inserts into groups wrapped in explicit BEGIN and END, etc. Instead of INSERT batching you could do COPY loads of a few thousand records at a time. All these things trade data durability off against speed.
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
Here are a few things that might help:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns (time timestamp, values double precision[]) rather than the more correct (time timestamp, instrument_id int, value double precision). This is an explicit denormalisation in aid of efficiency. (I blogged about my own experience with this scheme.)
Use citus for PostgreSQL auto sharding. Also this link is helpful.
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
Load one eight of your data into a PG instance on each of the servers
For optimum load speed, don't use inserts but the COPY method
When the data is loaded, do not combine the eight databases into one. Instead, use plProxy to launch a single statement to query all databases at once (or the right one to satisfy your query)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tables which should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
You could use mySQL - which supports auto-sharding across a cluster.
I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.
Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.
One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.
This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)
Does anyone have experience solving this kind of problem or implementing this particular solution?
Thanks!
have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.
Thanks to those who tried.
For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.
This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.
I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users.
I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from the scraper?
Is it best practice to wait till all results are in (keeping them all in memory), and insert them all when the scraping is finished? Or is it better to do an insert on every single result that comes in (coming in at a decent rate)? Or something in between?
If someone could point me in the right direction on how often/when I should be doing this I would appreciate it.
Also, would the answer change if I was storing these results in a flat file vs a database?
Thank you for your time!
You might get a performance increase by batching up several hundred, if your database supports inserting multiple rows per query (both MySQL and PostgreSQL do). You'll also probably get more performance by batching multiple inserts per transaction (except with transactionless databases, such as MySQL with MyISAM).
The benefits of the batching will rapidly fall as the batch size increases; you've already reduced the query/commit overhead 99% by the time you're doing 100 at a time. As you get larger, you will run into various limits (example: longest allowed query).
You also run into another large tradeoff: If your program dies, you'll lose any work you haven't yet save to the database. Losing 100 isn't so bad; you can probably redo that work in a minute or two. Losing 300,000 would take quite a while to redo.
Summary: Personally, I'd start with one value/one query, as it'll be the easiest to implement. If I found insert time was a bottleneck (very much doubt it, scrape will be so much slower), I'd move to 100 values/query.
PS: Since the site admin has given you permission, have you asked if you can just get a database dump of the relevant data? Would save a lot of work!
My preference is to write bulk data to the database every 1,000 rows, when I have to do it the way you're describing. It seems like a good volume. Not too much re-work if I do have a crash and need to re-generate some data (re-scraping in your case). But it's a good healthy bite that can reduce overhead.
As #derobert points out, wrapping a bunch of inserts in a transaction also helps reduce overhead. But don't put everything in a single transaction - some vendors of RDBMS like Oracle maintain a "redo log" during a transaction, so if you do too much work this can cause congestion. Breaking up the work into large, but not too large, chunks is best. I.e. 1,000 rows.
Some SQL implementations support multi-row INSERT (#derobert also mentions this) but some don't.
You're right that flushing raw data to a flat file and batch-loading it later is probably worthwhile. Each SQL vendor supports this kind of bulk-load differently, for instance LOAD DATA INFILE in MySQL or ".import" in SQLite, etc. You'll have to tell us what brand of SQL database you're using to get more specific instructions, but in my experience this kind of mechanism can be 10-20x the performance of INSERT even after improvements like using transactions and multi-row insert.
Re your comment, you might want to take a look at BULK INSERT in Microsoft SQL Server. I don't usually use Microsoft, so I don't have first-hand experience with it, but I assume it's a useful tool in your scenario.
I want to replicate data from a boat offshore to an onshore site. The connection is sometimes via a satellite link and can be slow and have a high latency.
Latency in our application is important, the people on-shore should have the data as soon as possible.
There is one table being replicated, consisting of an id, datetime and some binary data that may vary in length, usually < 50 bytes.
An application off-shore pushes data (hardware measurements) into the table constantly and we want these data on-shore as fast as possible.
Are there any tricks in MS SQL Server 2008 that can help to decrease the bandwith usage and decrease the latency? Initial testing uses a bandwidth of 100 kB/s.
Our alternative is to roll our own data transfer and initial prototyping here uses a bandwidth of 10 kB/s (while transferring the same data in the same timespan). This is without any reliability and integrity checks so this number is artificially low.
You can try out different replication profiles or create your own. Different profiles are optimized for different network/bandwidth scenarios.
MSDN talks about replication profiles here.
Have you considered getting a WAN accelerator appliance? I'm too new here to post a link, but there are several available.
Essentially, the appliance on the sending end compresses the outgoing data, and the receiving end decompresses it, all on the fly and completely invisibly. This has the benefit of increasing the apparent speed of the traffic and not requiring you to change your server configurations. It should be entirely transparent.
I'd suggest on the fly compression/decompression outside of SQL Server. That is, SQL replicates the data normally but something in the network stack compresses so it's much smaller and bandwidth efficient.
I don't know of anything but I'm sure these exist.
Don't mess around with the SQL files directly. That's madness if not impossible.
Do you expect it to always be only one table that is replicated? Are there many updates, or just inserts? The replication is implemented by calling an insert/update sproc on the destination for each changed row. One cheap optimization is to force the sproc name to be small. By default it is composed from the table name, but IIRC you can force a different sproc name for the article. Given an insert of around 58 bytes for a row, saving 5 or 10 characters in the sproc name is significant.
I would guess that if you update the binary field it is typically a whole replacement? If that is incorrect and you might change a small portion, you could roll your own diff patching mechanism. Perhaps a second table that contains a time series of byte changes to the originals. Sounds like a pain, but could have huge savings of bandwidth changes depending on your workload.
Are the inserts generally done in logical batches? If so, you could store a batch of inserts as one customized blob in a replicated table, and have a secondary process that unpacks them into the final table you want to work with. This would reduce the overhead of these small rows flowing through replication.