How fio benchmark tool performs sequential disk reads? - benchmarking

I use fio to test read/write bandwidth of my disks.
Even for the sequential read test, I can let it run the multiple threads.
What does it mean by running multiple threads on sequential read test?
Does it perform multiple sequential reads? (each thread is assigned a file offset to start the sequential scanning from)
Do the multiple threads share a file offset? (Each thread invokes sequential reads using a single file offset that is shared by the multiple threads)
I tried to read the open source codes of fio, but I couldn't really figure it out.
Can any one give me an idea?

Sadly you didn't include a jobfile with your question and didn't say what platform you're running on. Here's a stab at answers:
Yes it does multiple sequential reads though wouldn't it have to do this even with a single thread?
No each thread has its own offset but unless you use offset and size they will all work inside the same "region".
On Linux fio actually defaults to using separate processes per job and each process has its own file descriptor (for ioengines that use files) for each file used. Further, some ioengines (e.g. libaio, pvsync but there are many others) use syscalls that take the offset you want to do the I/O at with the request itself so even if they do share a descriptor their offset is not impacted by others using the same descriptor.
There may be problems if you use the sync ioengine, ask fio to use threads rather than processes and have those threads work on the same file. That ioengine has to use lseek prior to doing its I/O so perhaps there's a chance for another thread's lseek to sneak in before the I/O is submitted. Note that the sync I/O engine is not the default one used with recent fio versions.
Perhaps the fio mailing list can say more?


Multiple threads on different cores reading same set of files

I have a multi threaded process, where each thread runs on one core. I am reading the same set of files from each of the threads and processing them. Will reading the same set of files by multiple threads affect the performance of the process?
Not necessarily, but there are a few factors to be taken on account.
When you open a file for READING you don't need to put a read lock on it.
That means multiple threads can be reading from the same file.
In fact all threads from a process share the process memory, so you can use that for your benefit by caching the whole set (or part of it, depending on the size) on the process memory. That will reduce access time.
Otherwise if we assume all files are in the same device, the problem is that reading multiple files simultaneously from the same device at the same time is slow and, depending on the number of threads and the storage type it can be noticeably slower
Reading the same set of files from each different thread may reduce the performance of the process, because the IO request are normally costly and slow, in addition to being repeating the same read operation for each difference thread.
One possible solution to deal with this is having one thread dealing with the IO reads/writes and the rest processing the data, for example as a producer consumer.
You may consider Memory-Mapped Files for concurrent read access.
It will avoid overhead of copying data into every process address space.

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files. can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

Reduce number of disk access while writing to file in C

I am writing a multi-threaded application and as of now I have this idea. I have a FILE*[n] where n is a number determined at runtime. I open all the n files for reading and then multiple threads can access to read it. The computation on the data of each file is equivalent i.e. if serial execution is supposed then each file will remain in memory for the same time.
Each files can be arbitrarily large so on should not assume that they can be loaded in memory.
Now in such a scenario I want to reduce the number of disk IO's that occur. It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. In other words i just want to know what is the most efficient model to implement such a scenario. I am using C.
EDIT: A more detailed scenario.
The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter i.e. each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. I wan to minimize disk IO in such a case.
At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. After that spawn the threads which read the files.
In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full.
If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. You don't need o/s shared memory directly. The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each.
How you do that depends on the processing you're doing.
f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; you can't reduce the I/O any more than that. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. You still want to keep the number of times that the data is accessed to a minimum. Similar concepts apply to the output files.

Simulating file system access

I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the will it spawn threads over a period of time. Should I be using some random delay to do this.
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.

Is there a posix-way to ensure two files are flushed in sequence without blocking?

In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?
I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.