How can I minimize the data in a SQL replication - sql-server

I want to replicate data from a boat offshore to an onshore site. The connection is sometimes via a satellite link and can be slow and have a high latency.
Latency in our application is important, the people on-shore should have the data as soon as possible.
There is one table being replicated, consisting of an id, datetime and some binary data that may vary in length, usually < 50 bytes.
An application off-shore pushes data (hardware measurements) into the table constantly and we want these data on-shore as fast as possible.
Are there any tricks in MS SQL Server 2008 that can help to decrease the bandwith usage and decrease the latency? Initial testing uses a bandwidth of 100 kB/s.
Our alternative is to roll our own data transfer and initial prototyping here uses a bandwidth of 10 kB/s (while transferring the same data in the same timespan). This is without any reliability and integrity checks so this number is artificially low.

You can try out different replication profiles or create your own. Different profiles are optimized for different network/bandwidth scenarios.
MSDN talks about replication profiles here.

Have you considered getting a WAN accelerator appliance? I'm too new here to post a link, but there are several available.
Essentially, the appliance on the sending end compresses the outgoing data, and the receiving end decompresses it, all on the fly and completely invisibly. This has the benefit of increasing the apparent speed of the traffic and not requiring you to change your server configurations. It should be entirely transparent.

I'd suggest on the fly compression/decompression outside of SQL Server. That is, SQL replicates the data normally but something in the network stack compresses so it's much smaller and bandwidth efficient.
I don't know of anything but I'm sure these exist.
Don't mess around with the SQL files directly. That's madness if not impossible.

Do you expect it to always be only one table that is replicated? Are there many updates, or just inserts? The replication is implemented by calling an insert/update sproc on the destination for each changed row. One cheap optimization is to force the sproc name to be small. By default it is composed from the table name, but IIRC you can force a different sproc name for the article. Given an insert of around 58 bytes for a row, saving 5 or 10 characters in the sproc name is significant.
I would guess that if you update the binary field it is typically a whole replacement? If that is incorrect and you might change a small portion, you could roll your own diff patching mechanism. Perhaps a second table that contains a time series of byte changes to the originals. Sounds like a pain, but could have huge savings of bandwidth changes depending on your workload.
Are the inserts generally done in logical batches? If so, you could store a batch of inserts as one customized blob in a replicated table, and have a secondary process that unpacks them into the final table you want to work with. This would reduce the overhead of these small rows flowing through replication.

Related

Fill Factor 80 vs 100

I am rebuilding some indexes in Azure SQL using a fill factor of 80 (recommended by the company who developed the application, who are not experts on the database) and after doing this queries got a LOT slower. We noticed that now they were taking a longer time in "Network I/O". Does anybody know what the problem might be?
Fillfactor is not a silver bullet and has it's tradeoffs. https://www.mssqltips.com/sqlservertip/5908/what-is-the-best-value-for-fill-factor-in-sql-server/
It is important to note which effect the lower fillfactor value has on the underlying data pages and index pages, which comprise your table:
There is now 20% more storage allocated for data pages for the same number of records!
This causes increased I/O. Depending on your Azure storage/compute plan you may be hitting a ceiling and need to bump up you IOPS.
Now, if you are not running out of IOPS, there's more to look into. Is it possible that the index rebuild operation had not completed yet and index is not being used for query optimization? A Profiler/Execution plan can confirm this.
I'd say that if you have a very large table and want to speed things up dramatically, your best bet is partitioning on the column most commonly used to address the data.
See also: https://www.sqlshack.com/database-table-partitioning-sql-server/
Try to identify queries returning large data sets to client hosts. Large result sets may lead to unnecessary network utilization and client application processing. Make sure queries are returning only what is needed using filtering and aggregations, and make sure no duplicates are being returned unnecesarily.
Another possible cause of that wait on Azure SQL may be the client application doesn't fetch results fast enough and doesn't notify Azure SQL that the result set has been received. On the client appliation side, store the results in memory first and only then doing more processing. Make sure the lient application is not under stress and that makes it unable to get the results faster.
One last thing, make sure Azure SQL and the appliation are on the same region, and there is not transfer of data over different regions or zones.

How expensive is access to database? How often do we access to it?

I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you
Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves
It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.
For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.

Auto sharding postgresql?

I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
If you do need to insert directly into a relational database: That's (part of) what PgPool-II and (especeially) PgBouncer are for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
At one extreme, each INSERT can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERTs in a BEGIN and COMMIT.
At the other extreme, you say "I really don't care if I lose all this data" and use unlogged tables for your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits (commit_delay and commit_siblings), batching inserts into groups wrapped in explicit BEGIN and END, etc. Instead of INSERT batching you could do COPY loads of a few thousand records at a time. All these things trade data durability off against speed.
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
Here are a few things that might help:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns (time timestamp, values double precision[]) rather than the more correct (time timestamp, instrument_id int, value double precision). This is an explicit denormalisation in aid of efficiency. (I blogged about my own experience with this scheme.)
Use citus for PostgreSQL auto sharding. Also this link is helpful.
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
Load one eight of your data into a PG instance on each of the servers
For optimum load speed, don't use inserts but the COPY method
When the data is loaded, do not combine the eight databases into one. Instead, use plProxy to launch a single statement to query all databases at once (or the right one to satisfy your query)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tables which should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
You could use mySQL - which supports auto-sharding across a cluster.

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users.
I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from the scraper?
Is it best practice to wait till all results are in (keeping them all in memory), and insert them all when the scraping is finished? Or is it better to do an insert on every single result that comes in (coming in at a decent rate)? Or something in between?
If someone could point me in the right direction on how often/when I should be doing this I would appreciate it.
Also, would the answer change if I was storing these results in a flat file vs a database?
Thank you for your time!
You might get a performance increase by batching up several hundred, if your database supports inserting multiple rows per query (both MySQL and PostgreSQL do). You'll also probably get more performance by batching multiple inserts per transaction (except with transactionless databases, such as MySQL with MyISAM).
The benefits of the batching will rapidly fall as the batch size increases; you've already reduced the query/commit overhead 99% by the time you're doing 100 at a time. As you get larger, you will run into various limits (example: longest allowed query).
You also run into another large tradeoff: If your program dies, you'll lose any work you haven't yet save to the database. Losing 100 isn't so bad; you can probably redo that work in a minute or two. Losing 300,000 would take quite a while to redo.
Summary: Personally, I'd start with one value/one query, as it'll be the easiest to implement. If I found insert time was a bottleneck (very much doubt it, scrape will be so much slower), I'd move to 100 values/query.
PS: Since the site admin has given you permission, have you asked if you can just get a database dump of the relevant data? Would save a lot of work!
My preference is to write bulk data to the database every 1,000 rows, when I have to do it the way you're describing. It seems like a good volume. Not too much re-work if I do have a crash and need to re-generate some data (re-scraping in your case). But it's a good healthy bite that can reduce overhead.
As #derobert points out, wrapping a bunch of inserts in a transaction also helps reduce overhead. But don't put everything in a single transaction - some vendors of RDBMS like Oracle maintain a "redo log" during a transaction, so if you do too much work this can cause congestion. Breaking up the work into large, but not too large, chunks is best. I.e. 1,000 rows.
Some SQL implementations support multi-row INSERT (#derobert also mentions this) but some don't.
You're right that flushing raw data to a flat file and batch-loading it later is probably worthwhile. Each SQL vendor supports this kind of bulk-load differently, for instance LOAD DATA INFILE in MySQL or ".import" in SQLite, etc. You'll have to tell us what brand of SQL database you're using to get more specific instructions, but in my experience this kind of mechanism can be 10-20x the performance of INSERT even after improvements like using transactions and multi-row insert.
Re your comment, you might want to take a look at BULK INSERT in Microsoft SQL Server. I don't usually use Microsoft, so I don't have first-hand experience with it, but I assume it's a useful tool in your scenario.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Resources