I want to know what exactly is sequential write and what is random write in definition. I will be even more helpful with example. I tried to google the result. But not much google explanation.
Thanks
When you write two blocks that are next to each-other on disk, you have a sequential write.
When you write two blocks that are located far away from eachother on disk, you have random writes.
With a spinning hard disk, the second pattern is much slower (can be magnitudes), because the head has to be moved around to the new position.
Database technology is (or has been, maybe not that important with SSD anymore) to a large part about optimizing disk access patterns. So what you often see, for example, is trading direct updates of data in their on-disk location (random access) versus writing to a transaction log (sequential access). Makes it more complicated and time-consuming to reconstruct the actual value, but makes for much faster commits (and you have checkpoints to eventually consolidate the logs that build up).
Related
What happens once a graph exceeds the available RAM. Persistence is guaranteed through snapshots and WAL - but at some point, we will likely hit a limit on how much of the graph can be held in memory.
Is Memgraph aware of the completeness of the graph in memory? If so, does it have strategies to offload less-used paths from memory? And if that’s the case, how would it guarantee that a query returns a complete result (and might not have missed offloaded segments of the graph)?
If file storage-based queries are part of the strategy, how much of a performance hit is to be expected?
Memgraph stores the entire graph inside RAM. In other words, there has to be enough memory to store the whole dataset. That was an early strategic decision because we didn’t want to sacrifice performance. At this point, there is no other option. Memgraph is durable because of snapshots and WALs + there are memory limits in place (at some point when the limit is reached, Memgraph will stop accepting writes). A side note, but essential to mention, is that graph algorithms usually use the entire dataset multiple times, which means you, either way, have to bring data to memory (Memgraph is really optimized for that case).
There are drawbacks to the above approach. In the great majority of cases, you can find enough RAM, but it’s not the cheapest.
The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.
Is the order of page flushes with msync(MS_ASYNC) on linux guaranteed to be the same as the order the pages where written to?
If it depends on circumstances, is there a way for me (full server access) to make sure they are in the same order?
Background
I'm currently using OpenLDAP Symas MDB as a persistent key/value storage and without MDB_MAPASYNC - which results in using msync(MS_ASYNC) (I looked through the source code) - the writes are so slow, that even while processing data a single core is permanently waiting on IO at sometimes < 1MB/s. After analyzing, the problem seems to be many small IO Ops. Using MDB_MAPASYNC I can hit the max rate of my disk easily, but the documentation of MDB states that in that case the database can become corrupted. Unfortunately the code is too complex to me/I currently don't have the time to work through the whole codebase step by step to find out why this would be, and also, I don't need many of the features MDB provides (transactions, cursors, ACID compliance), so I was thinking of writing my own KV Store backed by mmap, using msync(MS_ASYNC) and making sure to write in a way that an un-flushed page would only lose the last touched data, and not corrupt the database or lose any other data.
But for that I'd need an answer to my question, which I totally can't find by googling or going through linux mailing lists unfortunately (I've found a few mails regarding msync patches, but nothing else)
On a note, I've looked through dozens of other available persistent KV stores, and wasn't able to find a better fit for me (fast writes, easy to use, embedded(so no http services or the like), deterministic speed(so no garbage collection or randomly run compression like leveldb), sane space requirements(so no append-only databases), variable key lengths, binary keys and data), but if you know of one which could help me out here, I'd also be very thankful.
msync(MS_ASYNC) doesn't guarantee the ordering of the stores, because the IO elevator algos operating in the background try to maximize efficiency by merging and ordering the writes to maximize the throughput to the device.
From man 2 msync:
Since Linux 2.6.19, MS_ASYNC is in fact a no-op, since the kernel properly tracks dirty pages and flushes them to storage as necessary.
Unfortunately, the only mechanism to sync a mapping with its backing storage is the blocking MS_SYNC, which also does not have any ordering guarantees (if you sync a 1 MiB region, the 256 4 KiB pages can propagate to the drive in any order -- all you know is that if msync returns, all of the 1 MiB has been synced).
I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.
I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().