Implementing snapshot functionality for latency demanding program - c

We are developing program, that works on data located in shared memory. Program is latency demanding and processes huge amount of data.
If program fails, we must return to last working state FAST.
One way is to read and process data from transaction log, which contains transactions from the start of the day. But this is not fast at all, considering size of transaction log (hundreds of gigabytes).
We are now looking the way to create snapshots of data that can be written to disk and read very fast if program fails. But snapshot creation must not lock program execution and data in that snapshot must be consistent.
If we were using local memory for keeping data instead of shared memory, solution will be easy:
fork()
write data to disk
Because of copy-on-write on linux, only changed data will be copied, so it is very fast.
But we are using posix shared memory.
Is there any way to do it with speed and consistency in mind?

If you can spare enough CPU cycles for a memcpy(), you could:
fork()
lock shared memory
memcpy (shared_mem -> some_buffer)
unlock shared memory
write data to disk taking the time you like

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

How could WAL (write ahead log) have better performance than write directly to disk?

The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

Alternative to reduce large number of binary files reading access time from hard disk

In my first prototype of application, I have to read around 400,000 files (each 4KB file, around total 1.5 GB data) from hard disk sequentially, and do some operation over the data read from each files, and store the results over RAM. Through this mechanism, I were first accessing I/O for a file and then utilizing CPU for operation, and keep going for another file, but it was very slow process.
To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). It gave significant improvement.
But in my second phase of development, I have to read 20 GB of data, which now I cannot store in RAM. And, single reading operation with CPU utilization is very time consuming operation.
Can someone please suggest some method to work around this problem?
I am developing this application on Windows in C, with Visual Studio compiler.
There's a technique called Asynchronous I/O (AIO) that lets you keep doing some processing with the CPU while a file is read in the background. You can use this to read the next few files at the same time as you're processing a file.
The various AIO calls are OS-specific. On Windows, Microsoft call it "Overlapped I/O". See this Wikipedia page or this MSDN page for more info.
To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU).
(Assuming files can be processed independently...)
You are half-way there. Instead of waiting until all files have been loaded to RAM, start processing as soon as any file is loaded. That would be a form of pipelining.
You'll need three components:
A thread1 that reads files ("producer").
A thread2 that processes the files ("consumer").
A message queue3 between them.
The producer reads the files the way you are already doing it, but instead of processing them, just enqueues them to the message queue. The consumer thread waits until it can dequeue the file from the queue, processes it, and then immediately frees the memory that has been occupied by the file and resumes waiting to the queue.
In case you can process files by sequentially traversing them start-to-finish, you could even devise a more fine-grained "streaming", where files wold be both read and processed in chunks, which could lower the peak memory consumption even more (e.g. if you have some extra-large files that would no longer need to be kept whole in the memory).
1 Or a set of threads to parallelize the I/O, if you anticipate reading from multiple physical disks.
2 Or a set of threads to saturate the CPU cores, if processing the file is not cheaper than reading it.
3 You don't need a fancy persistent distributed message queue for that. Just a
straight in-memory queue, a-la BlockingCollection in .NET (I'm sure you'll find something similar for pure C).
Create threads (in loop) which will read files into RAM.
Work with the data in RAM in separate thread[s] and free RAM after processing.
Keep limits and a poll of records about files (read and processed) in the shared object protected by mutex.
Use semaphore for resources (files in RAM) production/utilisation synchronisation.

having database in memory - C

I am programming a server daemon from which users can query data in C. The data can also be modified from clients.
I thought about keeping the data in memory.
For every new connection I do a fork().
First thing I thought about that this will generate a copy of the db every time a connection takes places, which is a waste of memory.
Second problem I have is that I don't know how to modify the database in the parent process.
What concepts are there to solve these problems?
Shared memory and multi-threading are two ways of sharing memory between multiple execution units. Check out POSIX Threads for multi-threading, and don't forget to use mutexes and/or semaphores to lock the memory areas from writing when someone is reading.
All this is part of the bigger problem of concurrency. There are multiple books and entire university courses about the problems of concurrency so maybe you need to sit down and study it a bit if you find yourself lost. It's very easy to introduce deadlocks and race conditions into concurrent C programs if you are not careful.
What concepts are there to solve these problems?
Just a few observations:
fork() only clones the memory of the process it executes at the time of execution. If you haven't opened or loaded your database at this stage, it won't be cloned into the child processes.
Shared memory - that is, memory mapped with mmap() and MAP_SHARED will be shared between processes and will not be duplicated.
The general term for communicating between processes is Interprocess communication of which there are several types and varieties, depending on your needs.
Aside On modern Linux systems, fork() implements copy-on-write copying of process memory. Actually, you won't end up with two copies of a process in memory - you'll end up with one copy that believes it has been copied twice. If you write to any of the memory, then it will be copied. This is an efficiency saving that makes use of the fact that the majority of processes alter only a small fraction of their memory as they run, so in fact even if you went for the copy the whole database approach, you might find the memory usage less that you expect - although of course that wouldn't fix your synchronisation problems!

Resources