parallel strlen? - c

I'm wondering if there would be any merit in trying to code a strlen function to find the \0 sequence in parallel. If so, what should such a function take into account? Thanks.

strlen() is sequential by spirit - one step beyond the null-terminator is undefined behavior and the null-terminator can be anywhere - the first character or the one millionth character, so you have to scan sequentially.

You'd have to make sure the NUL found by a thread is the first NUL in the string, which means that the threads would need to synchronize on what their lowest NUL location is. So while it could be done, the overhead for the sync would be far more expensive than any potential gain from parallelization.
Also, there's the issue of caching. A single thread can read a string contiguously, which is cache friendly. Multiple threads run the risk of stepping on each other's toes.

It would be possible on some parallel architectures, but only if one could guarantee that a substantial amount the memory beyond the string could be accessed safely; it would only be practical if the strings were expected to be quite long and thread communication and synchronization were cheap. For example, if one had sixteen processors and one knew that one could safely access 256KB beyond the end of a string, one could start by dispatching the sixteen processors to handle sixteen 4K chunks. Each time a processor finished and hadn't found a zero, it could either start to handle either the next 4K chunk (if it was within 256KB of the lowest chunk that was still in progress), or wait for the lowest processor to complete. In practice, unless strings were really huge, synchronization delays and excess work would moot any gains from parallelism, but if one needed to find the length of a multi-megabyte string, the task could be done in parallel.

To parallelize tasks, you have to split input data and dispatch it to multiple threads. Without knowing the length of the string in advance, you cannot split up the data.
So you must know the allocated size of the input data (which is not necessarily identical with the string-length) in advance, then it will work.
Your program might return multiple NUL-values which is may have found. Your function can only know that the correct NUL value has been found if all threads which are processing the data that comes before any of the NUL values that have been found, have been completed.
Say we have our string split up in 8 chunks (0-7). If we found NUL-values in chunk 3 we cannot know if maybe there are other NUL-values in chunks 0-2, so we have to wait for any of these threads, an we can immediately stop all other threads. If then a NUL-value is found in thread 1 we only have to wait for thread 0 to complete, so we can get a definitive answer.

You could use this on FIXED-WIDTH strings, but not much more than that.

It depends on the architecture. Nothing wrong with having multiple compute units hunt for that first null character, but you will have to keep them fed with a steady stream of data from memory. You will probably want to perform platform specific tuning for the exact parameters keeping cache boundaries in mind.

Related

Cache pre-fetching while traversing an array: what if some memory pages have being swapped out?

The biggest advantage of an array, of, for example, int, is that, if you read it sequentially, it can be fully preloaded in cache because the CPU checks the memory access patterns and pre-fetches the next locations about to be read, so the "next" element of a vector is always in cache.
At which extent is that sentence just "theory"? Thinking about timing, for that to be true, the pre-fetcher must know how much time will take sending the next cache line to cache (which implies knowing how "slow" is the RAM), and how much time is left before such data is the next one to be read by the CPU (which implies knowing how time-consuming are the remaining instructions), so the first sequence of actions takes no longer than the second.
The specific case I have in mind: let's assume that the first 5 pages of a 10-pages-long array are in RAM and the last 5 in swap. If the next address that the pre-fetcher wants to load is the first address of the fifth page, that pre-loading time will be unpredictable long and the pre-fetcher will fail in its mission.
I know CPUs try to do a lot of guessing and speculations about the future of a process, like such cache pre-fetching, branch prediction and probably a lot of other techniques I'm not aware of, some of them probably talking to the OS to speculate together (and I'm eager to know more about this, it surprises me every time).
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems, for example, by trying to answer the pre-fetcher's question of: how much time do I need in advance for my speculative pre-fetching to cause 0 delay?
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems
No.
This is solely handled by the CPU HW. If you read some memory location the CPU will simply bring a memory block containing your location into a whole cache line.

how can I ensure "nice" (or at least "ok") cache behavior with a volatile mutex'd array?

I have two threads which share a large array of data. One thread writes to it, and the other reads from it. Because the array cannot be in an "incompletely-updated" state when read, I mutex all array operations (reads/writes).
I also try to "play nicely with the cache"- when I read/write large amounts of data, I get one mutex, and read/write as much as required in sequence, then relinquish the mutex.
[edit: to clarify, this is the cache behavior I would like to preserve. if you read/write large swathes of memory in sequence, then the cache can pull in large lines of data from memory (slow!) only once, then operate on the cache (fast!) without again hitting memory until that cache line is exhausted.]
One thing I would like to protect against is "writing to a small part of the array in one thread, and then in another thread (after receiving the mutex) reading from that small part of the array which hasn't yet been flushed to memory (out of the first thread/core's cache), resulting in an outdated read". So the solution would be to mark the array as "volatile" (right?).
Am I correct to worry that "marking the array as volatile" will totally kill my ability to read/write large chunks in accordance with a well-behaved cache? Or will every read/write be called to/from memory?
In a perfect world, what I think I'd want is the ability to: 1. grab a mutex, 2. load data from memory (as though it were volatile), 3. read/write to array (as though it weren't volatile- should be safe to rely on own cache bc mutex), 4.(in the case of write) flush any remaining cache to memory. 5. relinquish mutex
Can I accomplish this? Are there any glaring misunderstandings here on my part?

Why fread does have thread safe requirements which slows down its call

I am writing a function to read binary files that are organized as a succession of (key, value) pairs where keys are small ASCII strings and value are int or double stored in binary format.
If implemented naively, this function makes a lot of call to fread to read very small amount of data (usually no more than 10 bytes). Even though fread internally uses a buffer to read the file, I have implemented my own buffer and I have observed speed up by a factor of 10 on both Linux and Windows. The buffer size used by fread is large enough and the function call cannot be responsible for such a slowdown. So I went and dug into the GNU implementation of fread and discovered some lock on the file, and many other things such as verifying that the file is open with read access and so on. No wonder why fread is so slow.
But what is the rationale behind fread being thread-safe where it seems that multiple thread can call fread on the same file which is mind boggling to me. These requirements make it slow as hell. What are the advantages?
Imagine you have a file where each 5 bytes can be processed in parallel (let's say, pixel by pixel in an image):
123456789A
One thread needs to pick 5 bytes "12345", the next one the next 5 bytes "6789A".
If it was not thread-safe different threads could pick-up wrong chunks. For example: "12367" and "4589A" or even worst (unexpected behaviour, repeated bytes or worst).
As suggested by nemequ:
Note that if you're on glibc you can use the _unlocked variants (*e.g., fread_unlocked). On Windows you can define _CRT_DISABLE_PERFCRIT_LOCKS
Stream I/O is already as slow as molasses. Programmers think that a read from main memory (1000x longer than a CPU cycle) is ages. A read from the physical disk or a network may as well be eternity.
I don't know if that's the #1 reason why the library implementers were ok with adding the lock overhead, but I guarantee it played a significant part.
Yes, it slows it down, but as you discovered, you can manually buffer the read and use your own handling to increase the speed when performance really matters. (That's the key--when you absolutely must read the data as fast as possible. Don't bother manually buffering in the general case.)
That's a rationalization. I'm sure you could think of more!

Interruptible in-place sorting algorithm

I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.
Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX
Any hints on accomplishing this, or notes in general?
Thanks!
To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.
Following up (2 years later): Just for posterity, I have installed the SIGINT handler and it worked great. This does not protect me against power failure, but that is a risk I can handle. Code at https://code.google.com/p/pawnsbfs/source/browse/trunk/hsort.c and https://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c
Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:
1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.
2) Copy A to the scratch space, record that you've done so, flush.
3) Copy B over A, then record in the scratch space that you have done so, flush
4) Copy from the scratch space over B.
5) Remove the record.
This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.
In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.
On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.
Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.
Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.
Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.
Install a handler for SIGINT that just sets a "process should exit soon" flag.
In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.
The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);.
As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.
Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.
Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.
Backup whatever you plan to change. The put a flag that marks a successful sort. If everything is OK then keep the result, otherwise restore backup.
Assuming a 64-bit OS (you said it is a 64bit machine but could still be running 32bit OS), you could use mmap to map the file to an array then use qsort on the array.
Add a handler for SIGINT to call msync and munmap to allow the app to respond to Ctrl-C without losing data.

Thread safe char string in C

In C:
If I have 3 threads,
2 threads that are appending strings to a global char string (char*),
and 1 thread that is reading from that char string.
Let's say that the 2 threads are appending about 8 000 strings per second each and the 3rd thread is reading pretty often too.
Is there any possibility that they will append at exactly the same time and overwrite each other's data or read at the same time and get an incomplete string?
Yes, this will get corrupted pretty quickly
You should protect access to this string with a mutex or read/write locking mechanism.
I'm not sure what platform you're on but have a look at the pthreads library if you're on a *nix platform.
I don't develop for windows, so I can't point you at any threading capabilities (though I know there's plenty of good threading API in Win32
Edit
#OP have you considered the memory issues of appending 8000 strings (you don't state how large each string is) per second. You're going to run out of memory pretty quickly if you're never removing data from your global string. You might want to consider bounding the size of this string somehow, and also setting up some kind of system to remove data from your string (the reader thread would be the best place for this). If you're already doing that, then ignore the above.
Is there any possibility that they
will append exact on the same time and
overwrite each others data or read on
the same time and get an incomplete
string?
When dealing with concurrent issues, you must always protect your data. You can never leave this kind of stuff to chance. Even if there's 0.1% chance of trouble, it will happen.

Resources