File Replication Solutions - file

Thinking about a Windows-hosted build process that will periodically drop files to disk to be replicated to several other Windows Servers in the same datacenter. The other machines would run IIS, and serve those files to the masses.
The total corpus size would be millions of files, 100's of GB of data. It'd have to deal with possible contention on the target servers, latent links e.g. over a WAN, cold-start clean servers
Solutions I've thought about so far :
queue'd system and daemons either wake periodically and copy or run as services.
SAN - expensive, complex, more expensive
ROBOCOPY, on a timed job - simple but effective. Lots of internal/indeterminate state e.g. where its at in copying, errors
Off the shelf repl. software - less expensive than SAN but still expensive
UNC shared folders and no repl. Higher latency, lower cost - still need a clustering solution too.
DFS Replication.
What else have other folks used?

I've used rsync scripts with good success for this type of work, 1000's of machines in our case. I believe there is an rsync server for windows, but I have not used it on anything other than Linux.

Though we do not have these millions of giga of data to manage, we are sending and collecting lots of files overnight between our main company and its agencies abroad. We have been using allwaysync for a while. It allows folders/ftp synchronization. It has a nice interface that allow folders and files analysis and comparisons, and it can be of course scheduled.

UNC shared folders and no replication has many downsides, especially if IIS is going to use UNC paths as home directories for sites. Under stress, you will run into http://support.microsoft.com/default.aspx/kb/810886 because of the number of simultaneous sessions against the server sharing the folder. Also, you will experience slow IIS site startups since IIS is going to want to scan/index/cache (depending on IIS version and ASP settings) the UNC folder.
I've seen tests with DFS that are very promising, exhibition none of the above restrictions.

We use ROBOCOPY in my organization to pass files around. It runs very seamlessly and I feel it worth a recommendation.
Additionally, you are not doing anything too crazy. If you are also proficient in perl, I am sure you could write a quick script that will fulfill your needs.

Related

Filesystem performance issue with lots of sub-directories for attachments

all the attachments from issues are put into a #hashed directory. Each attachment creates its own hashed directory. Over time it can create a lot of sub-directories in a single directory. A project with 100K issues with an average 10 images each can lead to 1 million. This can be a performance bottleneck. How does Gitlab suggest self hosting users do for this. Thanks
Answer here https://forum.gitlab.com/t/filesystem-performance-issue-with-lots-of-sub-directories-for-attachments/76004
Basically quoting #gitlab-greg from the gitlab forum
Usually bottlenecks are caused by excessive read/write operations where input/output is the limiting factor. Attachments usually take up minimal storage space, so it’s actually much easier and more common for someone to hit max I/O by downloading/pushing a lot of Git Repo, LFS, or (Package|Container) Registry data in parallel. Alternatively, if storage space is filled up (90%+ full), you might see some performance issues.
I’ve not heard of any situations where the existence of high number of image uploads or a large amount of sub-directories in #hashed directory has caused a performance bottleneck in itself. GitLab uses a database to keep track of where issue attachments are stored, so it’s not like it’s crawling or looking through every directory in #hashed every time an upload or image is requested.
I’ve worked with GitLab admins who have (tens of) thousands of users and I’ve never seen subdirectories for attachments result in a performance bottleneck. If you find it does cause a bottleneck, please create an issue.

CouchDB parallel replications causes high cpu usage

I have a per-user DB architecture like so:
There is around 200 user DBs and each has a continuous replication link to the master couch. ( All within the same couch instance) The problem is the CPU usage at any given time is always close to 100%.
The DBs are idle, so no data is getting written/read from them.There's only a few KB of data per DB so I don't think the load is an issue at this point. The master DB size is less than 10 MB.
How can I go about debugging this performance issue?
You should have a look at https://github.com/redgeoff/spiegel - it's a tool to handle many CouchDB replications in a scalable way. Basically it achieves that by listening to the _global_changes endpoint and creating replications only when needed.
In recent CouchDB versions (2.1.0+), the replicator has been improved, but I think for replicating per-user databases it still makes sense to use an external mechanism like Spiegel to manage the number of active replications.
Just as a reminder, there are some security flaws in CouchDB 2.1.0 and you might need to upgrade to 2.1.1. Maybe you've been hacked like this one.

Does LevelDB supports hot backups (or equivalent)?

Currently we are evaluating several key+value data stores, to replace an older isam currently in use by owr main application (for 20 something years!) ...
The problem is that our current isam doesn't support crash recoveries.
So LevelDB seemd Ok to us (also checking BerkleyDB, etc)
But we ran into de question of hot-backups, and, given the fact that LevelDB is a library, and not a server, it is odd to ask for 'hot backup', as it would intuitively imply an external backup process.
Perhaps someone would like to propose options (or known solutions) ?
For example:
- Hot backup through an inner thread of the main applicacion ?
- Hot backup by merely copying the LevelDB data directory ?
Thanks in advance
You can do a snapshot iteration through a LevelDB, which is probably the best way to make a hot copy (don't forget to close the iterator).
To backup a LevelDB via the filesystem I have previously used a script that creates hard links to all the .sst files (which are immutable once written), and normal copies of the log (and MANIFEST, CURRENT etc) files, into a backup directory on the same partition. This is fast since the log files are small compared to the .sst files.
The DB must be closed (by the application) while the backup runs, but the time taken will obviously be much less than the time taken to copy the entire DB to a different partition, or upload to S3 etc. This can be done once the DB is re-opened by the application.
LMDB is an embedded key value store, but unlike LevelDB it supports multi-process concurrency so you can use an external backup process. The mdb_copy utility will make an atomic hot backup of a database, your app doesn't need to stop or do anything special while the backup runs. http://symas.com/mdb/
I am coming a bit late to this question, but there are forks of LevelDB that offer good live backup capability, such as HyperLevelDB and RocksDB. Both of these are available as npm modules, i.e. level-hyper and level-rocksdb. For more discussion, see How to backup RocksDB? and HyperDex Question.

Deploying to more than one application server

We have reached that point where one application server is not enough.
Apart from performance benefits we want to have a somewhat failover scenario as we cannot afford a 30 minute downtime just because the server needs a reboot for the new kernel.
The first issue that has to be resolved is where to store the shared files so that both application servers can access it all the time. Possible solutions are:
DRBD with NFS on both servers.
A NAS with NFS
SAN with a clustered filesystem
We want to keep the solution as simple as possible and we don't want to spend a fortune on it. DRBD synchronization is cheap but too fragile. SANs are expensive and i've heard some horror stories with clustered filesystems. A NAS with NFS seems the best fit.
How do you handle shared storage from multiple servers? Have you encountered any problems with NFS?
Bonus Question: Sun has a start up essentials program that offers significant discounts and we can grab their Sun 7110 Unified Storage for about 6k. Has anyone any experience with unified storage? Are there any fairly priced alternatives ?
I only have experience with nfs mounts (for ten app servers) using a netapp[1] storage product and can't say anything negative.
[1] http://www.netapp.com

Advice on using a web server as a cache

I'd like advice on the following design. Is it reasonable? Is it stupid/insane?
Requirements:
We have some distributed calculations that work on chunks of data that are sometimes up to 50Mb in size.
Because the calculations take a long time, we like to parallelize the calculations on a small grid (around 20 nodes)
We "produce" around 10000 of these "chunks" of binary data each day - and want to keep them around for up to a year... Most of the items aren't 50Mb in size though, so the total daily space requirement is more around 5Gb... But we'd like to keep stuff around for as long as possible, (a year or more)... But hey, you can get 2TB hard disks nowadays.
Although we'd like to keep the data around, this is essentially a "cache". It's not the end of the world if we lose data - it just has to get recalculated, which just takes some time (an hour or two).
We need to be able to efficiently get a list of all "chunks" that were produced on a particular day.
We often need to, from a support point of view, delete all chunks created on a particular day or remove all chunks created within the last hour.
We're a Windows shop - we can't easily switch to Linux/some other OS.
We use SQLServer for existing database requirements.
However, it's a large and reasonably bureaucratic company that has some policies that limit our options: for example, conventional database space using SQLServer is charged internally at extremely expensive prices. Allocating 2 terabytes of SQL Server space is prohibitively expensive. This is mainly because our SQLServer instances are backed up, archived for 7 years, etc. etc. But we don't need this "gold-plated" functionality because we can just recreate the stuff if it goes missing. At heart, it's just a cache, that can be recreated on demand.
Running our own SQLServer instance on a machine that we maintain is not allowed (all SQLServer instances must be managed by a separate group).
We do have fairly small transactional requirement: if a process that was producing a chunk dies halfway through, we'd like to be able to detect such "failed" transactions.
I'm thinking of the following solution, mainly because it seems like it would be very simple to implement:
We run a web server on top of a windows filesystem (NTFS)
Clients "save" and "load" files by using HTTP requests, and when processes need to send blobs to each other, they just pass the URLs.
Filenames are allocated using GUIDS - but have a directory for each date. So all of the files created on 12th November 2010 would go in a directory called "20101112" or something like that. This way, by getting a "directory" for a date we can find all of the files produced for that date using normal file copy operations.
Indexing is done by a traditional SQL Server table, with a "URL" column instead of a "varbinary(max)" column.
To preserve the transactional requirement, a process that is creating a blob only inserts the corresponding "index" row into the SQL Server table after it has successfully finished uploading the file to the web server. So if it fails or crashes halfway, such a file "doesn't exist" yet because the corresponding row used to find it does not exist in the SQL server table(s).
I like the fact that the large chunks of data can be produced and consumed over a TCP socket.
In summary, we implement "blobs" on top of SQL Server much the same way that they are implemented internally - but in a way that does not use very much actual space on an actual SQL server instance.
So my questions are:
Does this sound reasonable. Is it insane?
How well do you think that would work on top of a typical windows NT filesystem? - (5000 files per "dated" directory, several hundred directories, one for each day). There would eventually be many hundreds of thousands of files, (but not too many directly underneath any one particular directory). Would we start to have to worry about hard disk fragmentation etc?
What about if 20 processes are all, via the one web server, trying to write 20 different "chunks" at the same time - would that start thrashing the disk?
What web server would be the best to use? It needs to be rock solid, runs on windows, able to handle lots of concurrent users.
As you might have guessed, outside of the corporate limitations, I would probably set up a SQLServer instance and just have a table with a "varbinary(max)" column... But given that is not an option, how well do you think this would work?
This is all somewhat out of my usual scope so I freely admit I'm a bit of a Noob in this department. Maybe this is an appalling design... but it seems like it would be very simple to understand how it works, and to maintain and support it.
Your reasons behind the design are insane, but they're not yours :)
NTFS can handle what you're trying to do. This shouldn't be much of a problem. Yes, you might eventually have fragmentation problems if you run low on disk space, but make sure that you have copious amounts of space and you shouldn't have a problem. If you're a Windows shop, just use IIS.
I really don't think you will have much of a problem with this architecture. Just keep it simple like you're doing and things should be fine.

Resources