How to sync SQL Server data to MongoDB

How to sync SQL Server data to MongoDB - sql-server

I have a table in SQL Server, and I have to sync the data to MongoDB for r/w separation. The table was inserted/updated/deleted so often that the SQL Server was not able to be stopped.
Is there an experienced way to implement that?

A proper answer to that is probably going to be beyond the scope of the question as posed, but things like workload (lots of data? heavy volume? network toplogy), SLAs (how fast, how often, how foolproof, can you have down time?) and synchronicity (See Brewers Cap Theorem), will dramatically change what is or isnt reasonable as an approach.
A naive answer might be as simple as "write the data to a csv once a day, then feed those into mongo via a script".
On the other side of the coin, there are ETL tools and libraries out there which I'm sure specialize in moving lots and lots of data as quickly as possible between your two storage engines. In fact I believe there's an ODBC driver for mongo which is a standard you can find in products from microsoft, oracle, open source, and everything in between.
A happy middle ground might just be a lightweight application or a script whos job it is to fire off documents one by one (or in batches) off of a queue till its empty. App reads sql, writes mongo, and handles any distributed transaction logic you may wish to impose.

Related

SQL Server table or flat text file for infrequently changed data

I'm building a .net web app that will involve validating an email input field against a list of acceptable email addresses. There could be up to 10,000 acceptable values and they will not change very often. When they do, the entire list would be replaced, not individual entries.
I'm debating the best way to implement this. We have a SQL Server database but since these records will be relatively static and only replaced in bulk, I'm considering just referencing / searching text files containing the string values. Seems like that would make the upload process easier and there is little benefit to having this info in an rdbms.
Feedback appreciated.

If the database is already there then use it. What you are talking about is exactly what databases are designed to do. If down the road you decided you need do something slightly more complex you will be very glad you went with the DB.

I'm going to make a few assumptions about your situation.
Your data set contains more than 1 column of usable data.
Your record set contains more rows than you will always display.
Your data will need to be formatted into some kind of output view (e.g. HTML).
Here are some specific reasons why your semi-static data should stay in SQL and not in a text file.
You will not have to parse the text data every time you wish to read and process it. (String parsing is a relatively heavy memory and CPU load). SQL will store your columns as structured data which are pre-parsed.
You will not have to develop your own row-filtering or searching algorithm (or implement a library that does it for you). SQL is already a sophisticated engine that applies advanced algorithms of caching, query optimization (with many underlying advanced algorithms for seek/search/scan/index/hash/etc.)
You have the option with SQL to expand your solution's robustness and integration with other tools over time. (Putting the data into a text or XML file will limit future possibilities).
Caveats:
Your particular SQL Server implementation can be affected by disk IO and network latency performance. To tune disk IO performance, a well constructed SQL Server places the data files (.mdf) on fast multi-spindle disk arrays tuned for fast reads, and separates them from the Log Files (.log) on spindles that are tuned for fast writes.
You might find that the latency and busy-ness (not a word) of the SQL server can affect performance. If you're running on a very busy or slow SQL server, you might be in a situation where you'd look to a local-file alternative, in which case I would recommend a structured format such as XML. (However, if you find yourself looking for work-arounds to avoid using your SQL server, it would probably be best to invest some time/money into improving your SQL implementation.

Reliable alternative to replication for continous data sync between two databases

I have one central database and 25 client databases and all have same schema.
I want that whenever some changes are done in some tables of the central database then these changes flow down to the client database.
The databases used is SQL Express so I cannot use replication.
The solution that I have today is to make keep track of the changes in the central database and then a program makes a text file with these changes and sends them down to the client databases.Another program reads these text files and updates the client database.
There are three problems with this:-
1. The files get lost or arrive in jumbled order which messes up the client data
2. the process is slow
3. the programs are sometimes shutdown so the whole sync flow gets stopped.
Is there a reliable alternative that is fast and secure ?
I wonder how banking software are made ...they never lose transactions and they are fast.

Add an UpdateDate column to all the entities that need to be replicated. At each client add a linked server to the central repository. Now, every 5 minutes or so, poll your central repository for changes using the last UpdateDate of a client entity and grab the delta.
Then use merge or insert and update to merge data on the client. That's a very reliable way of doing homebrew replication. To keep track of deleted elements you would either want to mark them as deleted or have another table to keep track of entity kind and its reference, again combined with UpdateDate for replication.
Update
Then you mention transactions and banking software. When you do your replication via files, we ain't talkin' about no transactional replication here, not by a long shot.
If you need transactional consistency you need to subscribe to the transaction flow of the data warehouse.

I don't want to be unhelpful and you haven't given any background about your business needs, but you have to decide if your priority is really "fast and secure" or if it's actually "cheap". Replicating changes between multiple databases in a reliable, consistent way is not easy (as you know) and it's highly unlikely that you will be able to develop a solution yourself that has the features, stability and performance of SQL Server replication.
SQL Express can be a replication subscriber, by the way, so it's not clear why it doesn't meet your needs. But if it doesn't, you should estimate the cost to your business (or customer) of dealing with issues caused by an unreliable solution: your time, business downtime, finding and correcting incorrect data, customer complaints, lost business etc. Then compare that to the cost of 25 SQL Server licenses (you should certainly be able to get a good discount when you order that volume), additional hardware (if any) and the costs of training, consulting and/or learning how to use replication. Then extrapolate those costs over 5 years or so. You may find that it's cheaper just to buy the solution you need. And of course buying the full SQL Server edition means you get a lot of other new features that might be useful to you.
If you (or your boss) is really determined to get something for nothing, you might want to investigate PostgreSQL or MySQL. They both have free replication solutions that seem to be widely enough used to be reliable for many companies. Of course, you then need to calculate the costs of switching to a new database platform.

If you have one central database and 25 clients, you can easily do it with one (yes only one) SQL server licence for the main database. Subscribers to this database can run SQL express. As long as users access the the client databases, you are not even obliged to buy SQL CALs.
Back to banking software, be sure that they are paying good money for their server licenses! So don't be surprised if these are reliable and fast ...

Databases Software for Rapid Queries

I'm writing a Comet application that has to keep track of each open connection to the server. I want to write an entry to the database for each connection, and I will have to search the database for the proper connections every time the application receives new data (often), which is why I don't want to start off on the wrong foot by choosing slow database software. Any suggestions for a database that favors rapid, small pieces of data (rather than occasional large pieces of data)?

I suggest rather using a server platform that allows the creation of persistent servers, that keep all such info in the memory. Thus all database access will be limited to writing (if you want to actually save any information permanently), which usually is signifficantly less in typical Comet-apps (such as chats/games).
Databases are not made to keep such data. Accessing a database directly always means composing query strings, often sending them to a db server (sometimes even over the network), db lookup, serialization of the results, sending back, deserialization and traversing the fetched results. There is no way this can be even nearly as fast as just retrieving a value from memory.
If you really want to stick with PHP, then I suggest you have a look at memcached and similar caching servers.
greetz
back2dos

SQL Server 2008 has a FileStream data type that can be used for rapid, small pieces of data. McLaren Electronic Systems uses it to capture and analyze telemetry/sensor data from Formula One race cars.

Hypersonic: http://hsqldb.org/
MySQL (for webapps)

Would it ever be wise to have a SQL server per web server?

I'm wondering if, under the circumstances that
You get lots more reads than writes
Your SQL server of choice is cheap/free and offers a fast mirroring/replication service
Your database isn't insanely large
rather than having separate SQL servers it would be better to have an instance of SQL on each machine getting instant updates from the master. This way there would be no network latency when doing all the read queries, but there would be a per box performance hit as the SQL instance has to execute. Would this be better overall for performance? Are there any other pros/cons that might come up?

Your SQL Server should always be on a different box to the webserver, of that there is no question.
How many DB servers and webservers you have, and how they mirror (or otherwise) is up to how you scale your application.
You have SQL Server on a different machine because it needs (and deserves) a lot of RAM.

It's quite a common architectural pattern to have read-only replicas of a database. We accept some degree of stalesness in them, perhaps they are even only updated once a day.
The general rule will be that multiple copies will introduce complexity in terms of operations and management and tend to introduce the possibilities of inconsistency of data - almost inevitably the copies will not be perfectly is step (or the costs of making them soo will be too high.)
An example: what happens if your replication processing breaks a bit. So that some, but not all copies become stale. Now your users start to see radically different views of the world. How much might that matter to you? If it's a site with low value data (eg. celebrity sightings in London suberbs) then perhaps that's fine. If it's on hand inventory, and being out of date means that your customers can't place orders, then maybe you care rather more.
My advice: things that sound simple at a boxed on paper sort of level don't always work out that way when you're sitting in an operations room at 3AM. Be very sure that you can easily operate your solution.

How would your SQL Server be cheap/free? I should have said the licensing costs for this setup would be crippling. At retail prices you're looking at $6000 per server. See also Jeff's comments about costs. Scale out the web servers by all means, but not your SQL Server until it's pretty much on its' knees.
You might instead want to think about a distributed cache like Velocity or NCache.
Either way, run your site first with one SQL server and see how it copes with the load, then think about mirroring/replication across servers, otherwise you're just optimising prematurely. Measure first!

An immediate con is that there is no distributed lock co-ordinator in SQL Server so you can get merge conflicts as updates can change the same row on two different servers at the same time.
Depending on the size of the database and the disks in the web servers, you will find your network latency is smaller than the disk latency you will start suffering as the web server disks will not usually be as performant as the disk array you give to the database. If you wanted that kind of performance, you would be buying it per web server.
Replication performance is not without latency either, the distribution of the transactions isn't 'free' and careful maintenance of the transaction log would have to be planned to ensure you did not get log fragmentation (too many vlog's wthin the transaction log) which kills replication performance.

What is a good SQL Server 2008 solution for handling massive writes so that they don't slow down reads for users of the database?

We have large SQL Server 2008 databases. Very often we'll have to run massive data imports into the databases that take a couple hours. During that time everyone else's read and small write speeds slow down a ton.
I'm looking for a solution where maybe we setup one database server that is used for bulk writing and then two other database servers that are setup to be read and maybe have small writes made to them. The goal is to maintain fast small reads and writes while the bulk changes are running.
Does anyone have an idea of a good way to accomplish this using SQL Server 2008?

Paul. There's two parts to your question.
First, why are writes slow?
When you say you have large databases, you may want to clarify that with some numbers. The Microsoft teams have demonstrated multi-terabyte loads in less than an hour, but of course they're using high-end gear and specialized data warehousing techniques. I've been involved with data warehousing teams that regularly loaded so much data overnight that the transaction log drives had to be over a terabyte just to handle the quick bursts, but not a terabyte per hour.
To find out why writes are slow, you'll want to compare your load methods to data warehousing techniques. For example, have you tried using staging tables? Table partitioning? Data and log files on different arrays? If you're not sure where to start, check out my Perfmon tutorial to measure your system looking for bottlenecks:
http://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/
Second, how do you scale out?
You asked how to set up multiple database servers so that one handles the bulk load while others handle reads and some writes. I would heavily, heavily caution against taking the multiple-servers-for-writes approach because it gets a lot more complicated quickly, but using multiple servers for reads is not uncommon.
The easiest way to do it is with log shipping: every X minutes, the primary server takes a transaction log backup and then that log backup is applied to the read-only reporting server. There's some catches with this - the data is a little behind, and the restore process has to kick all connections out of the database to apply the restore. This can be a perfectly acceptable solution for things like data warehouses, where the end users want to keep running their own reports while the new day's data loads. You can simply not do transaction log restores while the data warehouse is loading, and the users can maintain connections the whole time.
To help find out what solution is right for you, consider adding the following to your question:
The size of your database (GB/TB in size, # of millions of rows in the largest table that's having the writes)
The size of your server & storage (a box with 10 drives has different solutions available than a box hooked up to a SAN)
The method of loading data (is it single-record inserts, are you using bulk load, are you using table partitioning, etc)

Why not use MemCached to eliminate the reads, I've got the same situation where I work and we've been using memcached on Windows with great results. I was supprised how trivial it was to get my code running with it too. There are open-source wrapping libraries for virtually every mainstream language, and using it could result in 99% of your reads, not even touching the database (becasue you set the memcache values on the write operation of the database).
Memcached, is really just a giant hash table store (and can even be clustered or run on any machine you like since it uses sockets to read and store the hashes).
When reading the memcached value, simply check if its null (return if its not) or do your ussual database read and return. It can store just about everything, so long as each memcached key/value pair is less than 1MB.

The easiest way would be to slow down the rate at which writes occur, and feed them in one record at a time. They'll be slower, but it would make things faster for users. If the batches take "a couple hours", you perhaps can spread them out more.

This is just an idea. Create a view over your "active" tables. Then BCP in the data into a "staging" table. When it is done, update the view to include the "staging" tables. Just an idea.

I'm not sure what you mean when you say everyone else's read and write slows down. Does it slow down when they read & write to the same database where the data is currently being imported or from different databases on the same server?
If it is the same database, you could always use the "with (nolock)" hint to do the reads even when the table is locked for writes/inserts. However, please be aware that the reads can be dirty reads. I am not sure how you can do faster quick writes when the table is locked because a write is already in progress. You can keep the transaction small to make the writes faster and release the locks. The other option is to have a separate database for bulk inserts and another database for reading.