Reed Solomon Erasure Encoding and Replication Factor - filesystems

I'm researching distributed file system architectures and designs. Quite a few DFS(s) I've come across usually have the following architecture:
A namenode or metadata server used to manage the location of data blocks / chunks as well as the hierarchy of the filesystem.
A data node or data server used to store chunks or blocks of data belonging to one or more logical files
A client that talks to a namenode to find appropriate data nodes to read/write from/to.
Many of these systems have two primary variants, a block size and a replication factor.
My question is:
Are Replication Factor and Forward Error Correction like Reed Solomon Erasure Encoding compatible here? Does it makes sense to use both techniques to ensure high availability of data? Or is it enough to use one or hte other (what are the trade offs?)

Whether you can mix and match plain old replication and erasure codes is dependent on what the distributed file system in question offers in its feature set but they are usually mutually exclusive.
Replication is simple in the sense that the file/object is replicated as a whole to 'n' (the replication factor) data nodes. Writes go to all nodes. Reads can be served from any one of the nodes individually since they host the whole file. So you can distribute different reads among multiple nodes. There is no intermediate math involved and is mostly I/O bound. Also, for a given file size, the disk usage is more (since there are 'n' copies).
Erasure codes are complex in the sense that parts of the file/object are encoded and spread among the 'n' data nodes during writes. Reads need to fetch data from more than one node, decode it and reconstruct the data. So math is involved and can become CPU bound. Compared to replication, the disk usage is less but so is the ability to tolerate faults.

Related

Does a "rescale()" operation cause serialization?

If I call a rescale() operation in Flink, I assume that there is NO serialization/deserialization (since the data is not crossing nodes), right? Further, is it correct to assume that objects are not copied/deep copied when rescale() is called?
I ask because I'm passing some large objects, 99% of which are common between multiple threads, so it would be a tremendous RAM waste if the objects were recopied in each thread after a rescale(). Instead, all the different threads should point to the same single object in the java heap for that node.
(Of course, if I call a rebalance, I would expect that there would be ONE serialization of the common objects to the other nodes, even if there are dozens of threads on each of the other nodes? That is, on the other nodes, there should only be 1 copy of a common object that all the threads for that node can share, right?)
Based on the rescale() documentation, there will be network traffic (and thus serialization/deserialization), just not as much as a rebalance(). But as several Flink committers have noted, data skew can make the reduction in network traffic insignificant compared to the cost of unbalanced data, which is why rebalance() is the default action when the stream topology changes.
Also, if you're passing around a lot of common data, then maybe look at using a broadcast stream to more efficiently share that across nodes.
Finally, it's conceptually easier to think about sub-tasks vs. threads. Each operator runs as a sub-task, which (on one Task Manager) is indeed being threaded, but the operator instances are separate, which means you don't have to worry about multi-threading at the operator level (unless you use class variables, which is usually a Bad Idea).

what is a sequential write and what is random write

I want to know what exactly is sequential write and what is random write in definition. I will be even more helpful with example. I tried to google the result. But not much google explanation.
Thanks
When you write two blocks that are next to each-other on disk, you have a sequential write.
When you write two blocks that are located far away from eachother on disk, you have random writes.
With a spinning hard disk, the second pattern is much slower (can be magnitudes), because the head has to be moved around to the new position.
Database technology is (or has been, maybe not that important with SSD anymore) to a large part about optimizing disk access patterns. So what you often see, for example, is trading direct updates of data in their on-disk location (random access) versus writing to a transaction log (sequential access). Makes it more complicated and time-consuming to reconstruct the actual value, but makes for much faster commits (and you have checkpoints to eventually consolidate the logs that build up).

c linux msync(MS_ASYNC) flush order

Is the order of page flushes with msync(MS_ASYNC) on linux guaranteed to be the same as the order the pages where written to?
If it depends on circumstances, is there a way for me (full server access) to make sure they are in the same order?
Background
I'm currently using OpenLDAP Symas MDB as a persistent key/value storage and without MDB_MAPASYNC - which results in using msync(MS_ASYNC) (I looked through the source code) - the writes are so slow, that even while processing data a single core is permanently waiting on IO at sometimes < 1MB/s. After analyzing, the problem seems to be many small IO Ops. Using MDB_MAPASYNC I can hit the max rate of my disk easily, but the documentation of MDB states that in that case the database can become corrupted. Unfortunately the code is too complex to me/I currently don't have the time to work through the whole codebase step by step to find out why this would be, and also, I don't need many of the features MDB provides (transactions, cursors, ACID compliance), so I was thinking of writing my own KV Store backed by mmap, using msync(MS_ASYNC) and making sure to write in a way that an un-flushed page would only lose the last touched data, and not corrupt the database or lose any other data.
But for that I'd need an answer to my question, which I totally can't find by googling or going through linux mailing lists unfortunately (I've found a few mails regarding msync patches, but nothing else)
On a note, I've looked through dozens of other available persistent KV stores, and wasn't able to find a better fit for me (fast writes, easy to use, embedded(so no http services or the like), deterministic speed(so no garbage collection or randomly run compression like leveldb), sane space requirements(so no append-only databases), variable key lengths, binary keys and data), but if you know of one which could help me out here, I'd also be very thankful.
msync(MS_ASYNC) doesn't guarantee the ordering of the stores, because the IO elevator algos operating in the background try to maximize efficiency by merging and ordering the writes to maximize the throughput to the device.
From man 2 msync:
Since Linux 2.6.19, MS_ASYNC is in fact a no-op, since the kernel properly tracks dirty pages and flushes them to storage as necessary.
Unfortunately, the only mechanism to sync a mapping with its backing storage is the blocking MS_SYNC, which also does not have any ordering guarantees (if you sync a 1 MiB region, the 256 4 KiB pages can propagate to the drive in any order -- all you know is that if msync returns, all of the 1 MiB has been synced).

When are sequential seeks with small reads slower than reading a whole file?

I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().

How can storage system handle different write streams at the same place?

Normally, if two applications send two write requests to the same place (lba) of the disk, applications or file systems will add lock for it, so only one request will be handled at a time.
But now there is a difficult problem. There may be multiple write requests that should be written to the same place, but applications don't handle the lock. There is no file system, because the data are directly written to the raw disk. What I can do is to modify the code of the storage system. Things are very complicated now. Suppose there are two requests, A and B. Then finally the data in the corresponding lba may be one of the three results:
All data are from A.
All data are from B.
Parts of data are from A; parts of data are from B.
In my opinion, result 1 & 2 are acceptable, but result 3 is not acceptable. But someone doesn't think so. How about you opinions?
I agree that it should be all of one or none of either. This can be done quite easily by using a form of storage system manager, and writing to the manager in large enough chunks. The manager can do appropriate locking internally so that only one block from one request is written at a time, and you don't get overlaps.

Resources