Multiple threads on different cores reading same set of files

Multiple threads on different cores reading same set of files - c

I have a multi threaded process, where each thread runs on one core. I am reading the same set of files from each of the threads and processing them. Will reading the same set of files by multiple threads affect the performance of the process?

Not necessarily, but there are a few factors to be taken on account.
When you open a file for READING you don't need to put a read lock on it.
That means multiple threads can be reading from the same file.
In fact all threads from a process share the process memory, so you can use that for your benefit by caching the whole set (or part of it, depending on the size) on the process memory. That will reduce access time.
Otherwise if we assume all files are in the same device, the problem is that reading multiple files simultaneously from the same device at the same time is slow and, depending on the number of threads and the storage type it can be noticeably slower

Reading the same set of files from each different thread may reduce the performance of the process, because the IO request are normally costly and slow, in addition to being repeating the same read operation for each difference thread.
One possible solution to deal with this is having one thread dealing with the IO reads/writes and the rest processing the data, for example as a producer consumer.

You may consider Memory-Mapped Files for concurrent read access.
It will avoid overhead of copying data into every process address space.

Related

Resources associated to an aio_context

The semantics of Linux's Asynchronous file IO (AIO) is well described in the man page of io_setup(2), io_submit(2) and io_getevents(2).
However, without diving in the block IO subsystem, the operational side of the implementation is a little less clear.
An aio_context allocates a queue for sending back io_events to a specific client in user-space. But is there more to it ?
Let be a file read sequentially chunks by chunks. Can requests, especially in Direct IO (DIO), be collated ? What if requests for two files are interleaved into one aio_context ? What if requests for one file are sent to two different aio_contexts ?
How requests are prioritized and scheduled in the above cases, with one or multiple aio_contexts ?
Is it possible that requests from two aio_contexts get interleaved at some point ? (Occasioning more seek latencies than intended.)
Does the thread or the CPU calling io_submit influence how it is scheduled ? Is the NUMA node containing the target buffer taken into consideration ?
More broadly, to which hardware resources (NUMA nodes, CPU cores, physical drives, file-systems and files) aio_contexts should be assigned, and at which level of granularity ?
Maybe it doesn't really matter and aio_contexts are no more than an abstraction for user-space programs.
I'm asking since I have observed a performance decrease when concurrently reading multiples files, each with it's own aio_context, compared to a manual Round-robin serialization of chunks requests into a single aio_context.

You can mix requests freely in a single context and I would do so. Otherwise you have to poll two separate contexts doubling the number of syscalls.
Requests to a context are passed to the kernels async IO VFS layer. Multiple files, multiple contexts, multiple processes or users doing the requests it all ends up in the same layer. The VFS layer then sends the requests to the relevant filesystems or block devices and all the usual collation and such happens naturally.
Requests to the same file to one or more context at the same time I think are undefined behavior if they overlap. They could be ordered one way or the other. The later request could be processed first for example. So you need to write your own synchronization if strict ordering is required. Same as one or more threads doing read/write calls in parallel.
Prioritization and scheduling will depend on the lower layers. Afaik block devices will reorder requests so they happen in increasing block numbers (elevator code) to minimize seek times on rotating disks.
Yes, requests from different contexts and normal read/write calls will get interleaved.
I think the requesting process and NUMA and such is completely ignored.
Note: When dealing with files make sure the filesystem supports the linux async IO hooks and you might need to use O_DIRECT on open() with all it's consequences.
A way to simply test this I found is to make lots of requests to a file in one io_submit() call and then check if the all finish simultaneously. If the filesystem falls back to sync IO then everything submitted will finish at the same time.

How fio benchmark tool performs sequential disk reads?

I use fio to test read/write bandwidth of my disks.
Even for the sequential read test, I can let it run the multiple threads.
What does it mean by running multiple threads on sequential read test?
Does it perform multiple sequential reads? (each thread is assigned a file offset to start the sequential scanning from)
Do the multiple threads share a file offset? (Each thread invokes sequential reads using a single file offset that is shared by the multiple threads)
I tried to read the open source codes of fio, but I couldn't really figure it out.
Can any one give me an idea?

Sadly you didn't include a jobfile with your question and didn't say what platform you're running on. Here's a stab at answers:
Yes it does multiple sequential reads though wouldn't it have to do this even with a single thread?
No each thread has its own offset but unless you use offset and size they will all work inside the same "region".
On Linux fio actually defaults to using separate processes per job and each process has its own file descriptor (for ioengines that use files) for each file used. Further, some ioengines (e.g. libaio, pvsync but there are many others) use syscalls that take the offset you want to do the I/O at with the request itself so even if they do share a descriptor their offset is not impacted by others using the same descriptor.
There may be problems if you use the sync ioengine, ask fio to use threads rather than processes and have those threads work on the same file. That ioengine has to use lseek prior to doing its I/O so perhaps there's a chance for another thread's lseek to sneak in before the I/O is submitted. Note that the sync I/O engine is not the default one used with recent fio versions.
Perhaps the fio mailing list can say more?

Alternative to reduce large number of binary files reading access time from hard disk

In my first prototype of application, I have to read around 400,000 files (each 4KB file, around total 1.5 GB data) from hard disk sequentially, and do some operation over the data read from each files, and store the results over RAM. Through this mechanism, I were first accessing I/O for a file and then utilizing CPU for operation, and keep going for another file, but it was very slow process.
To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). It gave significant improvement.
But in my second phase of development, I have to read 20 GB of data, which now I cannot store in RAM. And, single reading operation with CPU utilization is very time consuming operation.
Can someone please suggest some method to work around this problem?
I am developing this application on Windows in C, with Visual Studio compiler.

There's a technique called Asynchronous I/O (AIO) that lets you keep doing some processing with the CPU while a file is read in the background. You can use this to read the next few files at the same time as you're processing a file.
The various AIO calls are OS-specific. On Windows, Microsoft call it "Overlapped I/O". See this Wikipedia page or this MSDN page for more info.

To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU).
(Assuming files can be processed independently...)
You are half-way there. Instead of waiting until all files have been loaded to RAM, start processing as soon as any file is loaded. That would be a form of pipelining.
You'll need three components:
A thread1 that reads files ("producer").
A thread2 that processes the files ("consumer").
A message queue3 between them.
The producer reads the files the way you are already doing it, but instead of processing them, just enqueues them to the message queue. The consumer thread waits until it can dequeue the file from the queue, processes it, and then immediately frees the memory that has been occupied by the file and resumes waiting to the queue.
In case you can process files by sequentially traversing them start-to-finish, you could even devise a more fine-grained "streaming", where files wold be both read and processed in chunks, which could lower the peak memory consumption even more (e.g. if you have some extra-large files that would no longer need to be kept whole in the memory).
1 Or a set of threads to parallelize the I/O, if you anticipate reading from multiple physical disks.
2 Or a set of threads to saturate the CPU cores, if processing the file is not cheaper than reading it.
3 You don't need a fancy persistent distributed message queue for that. Just a
straight in-memory queue, a-la BlockingCollection in .NET (I'm sure you'll find something similar for pure C).

Create threads (in loop) which will read files into RAM.
Work with the data in RAM in separate thread[s] and free RAM after processing.
Keep limits and a poll of records about files (read and processed) in the shared object protected by mutex.
Use semaphore for resources (files in RAM) production/utilisation synchronisation.

Reduce number of disk access while writing to file in C

I am writing a multi-threaded application and as of now I have this idea. I have a FILE*[n] where n is a number determined at runtime. I open all the n files for reading and then multiple threads can access to read it. The computation on the data of each file is equivalent i.e. if serial execution is supposed then each file will remain in memory for the same time.
Each files can be arbitrarily large so on should not assume that they can be loaded in memory.
Now in such a scenario I want to reduce the number of disk IO's that occur. It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. In other words i just want to know what is the most efficient model to implement such a scenario. I am using C.
EDIT: A more detailed scenario.
The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter i.e. each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. I wan to minimize disk IO in such a case.

At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. After that spawn the threads which read the files.
In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full.

If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. You don't need o/s shared memory directly. The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each.
How you do that depends on the processing you're doing.
f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; you can't reduce the I/O any more than that. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. You still want to keep the number of times that the data is accessed to a minimum. Similar concepts apply to the output files.

Is there a posix-way to ensure two files are flushed in sequence without blocking?

In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?

I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight