DB Sharding: Data migration to another machine - database

I have a some data that is sharded and kept in DB1, DB2 and DB3 hosted on Machine1. Now, to scale the system, I need to move shard DB1 from Machine1 to Machine2. Once the move is complete, all requests to shard DB1 will be routed to Machine2.
Assume that we have reads, writes and updates coming to DB1 all the time. How can I do the migration without any downtime to read/write/update?
We can make DB1 readonly during the migration window and copy the data to Machine2. Once copy is complete, we can route traffic to Machine2 and allow writes.
But what if we want to do the same while writes are also happening?

After some research I found couple of ways to do this.
Solution 1
For the purpose of copying the shard to another physical machine, break that into several small segments. Trigger a script to copy segment by segment from M1 to M2. While the copy is happening, replicate incoming writes to both M1 and M2.
All newly written rows will be written to new copy as well
Any updates to existing segments which are not yet copied will be taken care when that segment is copied to M2 in the future
Updates to already copied segments will be applied to M2 as well because M2 also has that segment
Writes to a particular segment is blocked while the segment is being copied. Once the copy is complete, write is completed and it's same as #3 above.
Once all the segments are copied successfully, stop writes to M1. At this point, DB1 in M1 is stale and can be deleted safely.
Solution 2
As before, break the shard into several smaller segments.
Schedule a script to segment by segment to M2
Any updates to existing segments which are not yet copied will be taken care when that segment is copied to M2 in the future
When there is an update to a segment that's already copied, mark that segment as dirty to be copied again.
Newly created segments are by default marked as dirty
Once copy is complete, start another pass to copy all the dirty blocks again
Repeat the passes until the number of dirty blocks is below a certain threshold. At that point, queue incoming writes (increases write latency), copy the remaining dirty blocks, commit the queued writes to new machine, change the configuration to write to M2 and start accepting writes.
I feel Solution 2 is better because it doesn't write to two places and hence client write requests will be faster.

Related

How exactly streaming data to PostgreSQL through STDIN works?

Let's say I am using COPY to stream data into my database.
COPY some_table FROM STDIN
I noticed that AFTER stream had finished, database needs significant amount of time to process this data and input these variables into the table. In PgAdmin's monitoring I can see that there are nearly 0 table writes throughout streaming process and then suddenly everything writes in 1 peak.
Some statistics:
I am inserting 450k rows into one table without indexes or keys,
table has 28 fields,
I am sending all NULLs to every field
I am worried that there are problems with my implementation of streams. Is it how streaming works? Database is waiting to gather all text to then execute one gigantic command?
COPY inserts the rows as they are sent, so the data are really streamed. But PostgreSQL doesn't write them to disk immediately: rather, it only writes transaction log (WAL) information to disk, and the actual rows are written to the shared memory cache. The data are persisted later, during the next checkpoint. There is a delay between the start of COPY and actual writing to disk, which could explain what you observe.
The monitoring charts provided in pgAdmin are not fit for the purpose you are putting them to. Most of that data is coming from the stats collector, and that is generally only updated once per statement or transaction, at the end of the statement or transaction. So they are pretty much useless for monitoring rare, large, ongoing operations.
The type of monitoring you want to do is best done with OS tools, like top, vmstat, or sar.

Why aren't read replica databases just as slow as the main database? Do they not suffer the same "write burden" as they must be in sync?

My understanding: a read replica database exists to allow read volumes to scale.
So far, so good, lots of copies to read from - ok, that makes sense, share the volume of reads between a bunch of copies.
However, the things I'm reading seem to imply "tada! magic fast copies!". How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
Good question.
First, the writes to the replicas may be more efficient than the writes to the primary if the replicas are maintained by replaying the Write-Ahead Logs into the secondaries (sometimes called a "physical replica"), instead of replaying the queries into the secondaries (sometimes called a "logical replica"). A physical replica doesn't need to do any query processing to stay in sync, and may not need to read the target database blocks/pages into memory in order to apply the changes, leaving more of the memory and CPU free to process read requests.
Even a logical replica might be able to apply changes cheaper on a replica as a query on the primary of the form
update t set status = 'a' where status = 'b'
might get replicated as a series of
update t set status = 'a' where id = ?
saving the replica from having to identify which rows to update.
Second, the secondaries allow the read workload to scale across more physical resources. So total read workload is spread across more servers.

Fetching and saving to database - Time vs Memory

Suppose I have a long list of URLs. Now, I need to write a script to do the following -
Go to each of the URLs
Get the data returned
And store it in a database
I know two ways of doing this -
Pop one URL from the list, download the data and save it in the database. Pop the next URL, download data and save it in the db, and repeat...
This will require way too many disk writes so, other way is to
Download the data from each of the URLs and save it to the memory. And finally, save it all to the database in one disk write.
But this will require carrying a huge chunk of data in the memory. So there's a possibility that program may just terminate because of OOM Error.
Is there any other way, which is some kind of intermediate between these methods?
(In particular, I am writing this script in Julia and using MongoDB)
We can extend #Trifon's solution a little bit with concurrency. You could simultaneously run two threads:
A thread which fetches the data from the URLs, and stores them in a channel in the memory.
A thread which reads from the channel and writes the data to the disk.
Make sure that the channel has some bounded capacity, so that Thread 1 is blocked in case there are too many consecutive channel writes without Thread 2 consuming them.
Julia is supposed to have good support for parallel computing
Write results to the database in batches, say every 1000 URLs.
This solution is something between 1 & 2 of the two ways you are describing above.

SQL Server Hekaton Reclaim Memory Used by Dropped Memory-Optimized Tables

I'm using
Microsoft SQL Server Enterprise: Core-based Licensing (64-bit) Version
12.0.4100.1
I have a code which creates many non-durable memory-optimized tables, uses them for some things, and drops them when it doesn't need them anymore.
However, it seems as though the dropped tables are still consuming RAM, since if I run
SELECT pool_id, Name, min_memory_percent, max_memory_percent, max_memory_kb/1024 AS max_memory_mb,
used_memory_kb/1024 AS used_memory_mb, target_memory_kb/1024 AS target_memory_mb
FROM sys.dm_resource_governor_resource_pools
WHERE Name='InMemoryObjects'
I get the following:
pool_id Name min_memory_percent max_memory_percent max_memory_mb used_memory_mb target_memory_mb
256 InMemoryObjects 50 50 380003 233239 380003
Notice how high the "used_memory_mb" column is. There are no memory-optimized tables in the server at the time I ran this, so I figure it must be data from the dropped memory-optimized tables still somehow taking up RAM.
Similarly, when I run
SELECT type, name, memory_node_id, pages_kb/1024 AS pages_MB
FROM sys.dm_os_memory_clerks WHERE type LIKE '%xtp%'
I get the following output:
type name memory_node_id pages_MB
MEMORYCLERK_XTP Default 0 1055
MEMORYCLERK_XTP DB_ID_19 0 6
MEMORYCLERK_XTP DB_ID_33 0 6
MEMORYCLERK_XTP DB_ID_41 0 56
MEMORYCLERK_XTP DB_ID_47 0 0
MEMORYCLERK_XTP DB_ID_32 0 233240
MEMORYCLERK_XTP Default 1 0
MEMORYCLERK_XTP Default 64 0
See how DB_ID_32 is taking up the same ~240gb of RAM?
I need some way to clear this out. When I run more than one instance of the code, I get the error
"There is insufficient system memory in resource pool 'InMemoryObjects' to run this query". So I think this memory has to actually be tied up, and doesn't release itself when it gets full. The resource pool 'InMemoryObjects' was made just for this one code, and there are no other memory-optimized objects in the entire server besides the ones this code creates (and subsequently drops). The memory-optimized tables it creates are all reasonably small (a few gb apiece).
I know the garbage collector is supposed to run every minute, but it has been over a day since any memory-optimized tables have existed in the database, and the memory used hasn't decreased at all. I've tried forcing garbage collection, checkpoints, resetting statistics, etc., and haven't been able to get this memory back.
The only thing I've found that works is taking the entire database offline and bringing it back online. But I really can't do that each time I run the code, so I need a better solution.
Any ideas would be immensely appreciated. Thank you!
I think this might be a reason. But I'm surprised the forced garbage collection doesn't pick it up. From the MSDN documentation on the in-memory garbage collection:
After a user transaction commits, it identifies all queued items
associated with the scheduler it ran on and then releases the memory.
If the garbage collection queue on the scheduler is empty, it searches
for any non-empty queue in the current NUMA node. If there is low
transactional activity and there is memory pressure, the main
garbage-collection thread can access garbage collect rows from any
queue. If there is no transactional activity after (for example)
deleting a large number of rows and there is no memory pressure, the
deleted rows will not be garbage collected until the transactional
activity resumes or there is memory pressure.
https://msdn.microsoft.com/en-us/library/dn643768.aspx
Might try the filestream garbage collection? I think I read somewhere that the in-memory optimized filegroup that a db must have uses the filestream technology. Don't quote me on that.

How to efficiently utilize 10+ computers to import data

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1
You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html
Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.
This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.
Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?
The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?
Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.
On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx
It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.

Resources