Batch Processing

Batch Processing - c

I have a list of strings that I want to output to different files according to a key for each file(this key is present in the list, so if this key is 1 in certain node then the string needs to be written to file 1.txt and if the key is 2 then the output should be redirected to 2.txt and so on...).
What I was thinking, is to assign each list member a unique key which makes it a unique record, and then spawn multiple threads depending on the number of processors available in the system. The thread redirect the output of a node from a pool of nodes(that is my list) to the concerned file. I was skeptical whether this is a good design for batch processing. Or should I just have one thread to do the whole output thingy.
ps - Before I get bashed or anything let me tell you I am just a curious learner.

Make it single threaded. Then run, find what your bottleneck is. If you find out, that your bottleneck is CPU and not disk IO, then enable parallel processing.

As I understand your processing steps is:
select file by the key
write item to file
I think this is not the case when parallel processing can result in performance improvement. If you want to speed up this code - use buffering and asynchronous I/O.
for each file maintain a flag - write-in-progress
when you want to write something to file - check this flag
if write-in-progress is False:
set write-in-progress = True
add your item to buffer
start writing this buffer to file asynchronously
if write-in-progress is True:
add your item to buffer
when pending asynchronous operation is completed
check is there is nonempty buffer, if so start async write
There is more simple approach: use buffering and synchronous I/O. It will be slower than asynchronous approach described above, but not very much. You can start several thread and traverse list in each thread independently. Each thread must process only some unique set of keys. For example, you can use two threads, first thread must write only items with odd keys, second thread must write only items with even keys.

You need a concurrency model for that - however serious it sounds :)
First analyze what can be done at the same time and is unrelated to each other. Imagine each step of your program is executed on different machine with a sort of communication between, e.g. IP network.
Then draw a flow between these instances (actions/machines). Mark what resources actions need to perform, e.g. a list, a file. Mark resources as separate instances (same as actions and machines).
Put the file system in your picture to see if writing separate files may be sped up or it will end in the file system and thus it will be serialized again.
Connect the instances. And see if you get any benefit. It could look like that:
list
|
list reader
/ \ \
/ \ ----------\
file file file
writer writer writer
| | |
file 1 file 2 file 3
\ / |
\ / |
file system 1 file system 2
In the example you can see that it may make sense to get some parallel execution

Related

Handling lots of FILE pointers

TLDR
Is there a clean way to handle 1 to 65535 files through an entire program without allocating global variables where a lot of it is may never used and without using linked lists (mingw-w64 on windows)
Long Story
I have a tcp-server which allocates data from a lot of clients (up to 65535) and saves them in kind of a database. The "database" is a directory/file structure which looks like this: data\%ADDR%\%ADDR%-%DATATYPE%-%UTCTIME%.wwss where %ADDR% is the Address, %DATATYPE% is the type of data and %UTCTIME% is the utc time in seconds when the first data packet arrived on this socket. So every time a new connection is accepted it should create this file as specified.
How do I handle 65535 FILE handles correctly? First thought: Global variable.
FILE * PV_WWSS_FileHandles[0x10000]
//...
void tcpaccepted(uint16_t u16addr, uint16_t u16dataType, int64_t s64utc) {
char cPath[MAX_PATH];
snprintf(cPath, MAX_PATH, "c:\\%05u\\%05u-%04x-%I64d.wwss", u16addr, u16addr, u16dataType, s64utc);
PV_WWSS_FileHandles[u16addr] = fopen(cPath, "wb+");
}
This seems very lazy, as it will likely never happen that all addresses are connected at the same time and so it allocates memory which is never used.
Second thought: Creating a linked list which stores the handles. The bad thing here is, that it could be quite cpu intensive because I want to do this in a multithreading Environment and when f.e. 400 threads receive new data at the same time they all have to go through the entire list to find there FILE handle.

You really should look at other people's code. Apache comes to mind. Let's assume you can open 2^16 file handles on your machine. That's a matter of tuning.
Now... consider first what a file handle is. It's generally a construct of your C standard library... which is keeping an array (the file handle is the index to that array) of open files. You're probably going to want to keep an array, too, if you want to keep other information on those handles.
If you're concerned about the resources you're occupying, consider that each open network filehandle causes the OS to keep a 4k or 8k (it's configurable) buffer x2 (in and out) along with the file handle structure. That's easily a gigabyte of memory in use at the OS level.
When you do your equivalent of select(), if your OS is smart, you'll get the filehandle back --- so you can use that to index your array of "what to do" for that file handle. If your select() is not smart, you'll have to check every open filehandle ... which would make any attempt at performance a laugh.
I said "look at other people's solutions." I mean it. The original apache used one filehandle per process (effectively). When select()'s were dumb, this was a good strategy. Bad in that typically, dumb OS's would wake too many processes --- but that was circa 1999. These days apache defaults to it's hybrid MPM model... which is a hybrid of multi-threading and multi-tasking. It services a certain number of clients per process (threads) and has multiple processes. This keeps the number of files per process more reasonable.
If you go back further, for simplicity, there's the inetd approach. Fork one (say) ftp process per connect. The world's largest ftp server (ftp.freebsd.org) ran that way for many years.
Do not store file handles in files (silly). Do not store file handles in linked lists (your most popular code route will kill you). Take advantage of the fact that file handles are small integers and use an array. realloc() can help here.
Heh... I see other FreeBSD people have chipped in ... in the comments. Anyways... look up FreeBSD and kqueue() if you're going to try keeping that many things open in one process.

Conflicts in writing/reading a file

I'm developing a little software in C that reads and writes messages in a notice-board. Every message is a .txt named with a progressive number.
The software is multithreading, with many users that can do concurrent operations.
The operations that a user can do are:
Read the whole notice-board (concatenation of all the .txt file contents)
Add a message (add a file named "id_max++.txt")
Remove a message. When a message is removed there will be a hole in that number (e.g, "1.txt", "2.txt", "4.txt") that will never be filled up.
Now, I'd like to know if there is some I/O problem (*) that I should manage (and how) or the OS (Unix-like) does it all by itself.
(*) such as 2 users that want to read and delete the same file

As you have an Unix-like, OS will take care of deleting a file while it is open by another thread : the directory entry is immediately removed, and the file itself (inode) is deleted on last close.
The only problem I can see is between the directory scan and the open of a file : race conditions could make that the file has been deleted.
IMHO you simply must considere that an error file does not exist is normal, and simply go to next file.
What you describe is not really bad, since it is analog to MH folders for mails, and it can be accessed by many different processes, even if locking is involved. But depending on the load and on the size of the messages, you could considere using a database. Rule of thumb (my opinion) :
few concurrent accesses and big files : keep on using file system
many accesses and small files (several ko max.) : use a database
Of course, you must use a mutex protected routine to find next number when creating a new message (credits should be attributed to #merlin2011 for noticing the problem).
You said in a comment that your specs do not allow a database. On the analogy with mail handling, you could alse use a single file (like traditionnal mail format) :
one single file
each message is preceded with a fixed size header saying whether it is active or deleted
read access need not be synchronized
write accesses must be synchronized
It would be a poor man's database where all synchronization is done by hand, but you have only one file descriptor per thread and save all open and close operations. It makes sense where there are many reads and few writes or deletes
A possible improvement would be (still like mail readers do) to build an index with the offset and status of each message. The index could be on disk or in memory depending on your requirements.

The easier solution is to use a database like sqlite or MySQL, both of which provide transactions that you can use ot achieve consistency. If you still want to go down the route, read on.
The issue is not an IO problem, it's a concurrency problem if you do not implement proper monitors. Consider the following scenario (it is not the only problematic one, but it is one example of one).
User 1 reads the maximum id and stores it in a local variable.
Meanwhile, User 2 reads the same maximum id and stores it in a local variable also.
User 1 writes first, and then User 2 overwrites what User 1 just wrote, because it had the same idea of what the maximum id was.
This particular scenario can be solved by keeping the current maximum id as a variable that is initialized when the program is initialized, and protecting the get_and_increment operation with a lock. However, this is not the only problematic scenario that you will need to reason through if you go with this approach.

Execute Large C Program By Generating Intermediate Stages

I have an algorithm that takes 7 days to Run To Completion (and few more algorithms too)
Problem: In order to successfully Run the program, I need continuous power supply. And if out of luck, there is a power loss in the middle, I need to restart it again.
So I would like to ask a way using which I can make my program execute in phases (say each phase generates Results A,B,C,...) and now in case of a power loss I can some how use this intermediate results and continue/Resume the Run from that point.
Problem 2: How will i prevent a file from re opening every time a loop iterates ( fopen was placed in a loop that runs nearly a million times , this was needed as the file is being changed with each iteration)

You can separate it in some source files, and use make.

When each result phase is complete, branch off to a new universe. If the power fails in the new universe, destroy it and travel back in time to the point at which you branched. Repeat until all phases are finished, and then merge your results into the original universe via a transcendental wormhole.

Well, couple of options, I guess:
You split your algorithm along sensible lines with this a defined output from a phase that can be the input to the next phase. Then, configure your algorithm as a workflow (ideally soft-configured through some declaration file.
You add logic to your algorithm by which it knows what it has successfully completed (commited). Then, on failure, you can restart the algorithm and it bins all uncommitted data and restarts from the last commit point.
Note that both these options may draw out your 7hr run time further!
So, to improve the overall runtime, could you also separate your algorithm so that it has "worker" components that can work on "jobs" in parallel. This usually means drawing out some "dumb" but intensive logic (such as a computation) that can be parameterised. Then, you have the option of running your algorithm on a grid/ space/ cloud/ whatever. At least you have options to reduce the run time. Doesn't even need to be a space... just use queues (IBM MQ Series has a C interface) and just have listeners on other boxes listening to your jobs queue and processing your results before persisting the results. You can still phase the algorithm as discussed above too.

Problem 2: Opening the file on each iteration of the loop because it's changed
I may not be best qualified to answer this but doing fopen on each iteration (and fclose) presumably seems wasteful and slow. To answer, or have anyone more qualified answer, I think we'd need to know more about your data.
For instance:
Is it text or binary?
Are you processing records or a stream of text? That is, is it a file of records or a stream of data? (you aren't cracking genes are you? :-)
I ask as, judging by your comment "because it's changed each iteration", would you be better using a random-accessed file. By this, I'm guessing you're re-opening to fseek to a point that you may have passed (in your stream of data) and making a change. However, if you open a file as binary, you can fseek through anywhere in the file using fsetpos and fseek. That is, you can "seek" backwards.
Additionally, if your data is record-based or somehow organised, you could also create an index for it. with this, you could use to fsetpos to set the pointer at the index you're interested in and traverse. Thus, saving time in finding the area of data to change. You could even persist your index in an accompanying index file.
Note that you can write plain text to a binary file. Perhaps worth investigating?

Sounds like classical batch processing problem for me.
You will need to define checkpoints in your application and store the intermediate data until a checkpoint is reached.
Checkpoints could be the row number in a database, or the position inside a file.
Your processing might take longer than now, but it will be more reliable.
In general you should think about the bottleneck in your algo.
For problem 2, you must use two files, it might be that your application will be days faster, if you call fopen 1 million times less...

How to speed up consecutive program startup under Linux?

I've written two relatively small programs using C. Both of them comunnicate with each other using textual data. Program A generates some problems from given input, B evaluates them and creates input for another iteration of A.
Here's a bash script that I currently use:
for i in {1..1000}
do
./A data > data2;
./B data2 > data;
done
The problem is that since what A and B do is not very time consuming, most of the time is spent (as I suppose) in starting apps up. When I measure time the script runs I get:
$ time ./bash.sh
real 0m10.304s
user 0m4.010s
sys 0m0.113s
So my main question is: is there any way to communicate data beetwen those two apps faster? I don't want to integrate them into one application, because I'm trying to build a toolset with independent, easly communicating tools (as was suggested in "The Art of Unix Programming" from which I'm learning the way to write reusable software).
PS. The data and data2 files contain sets of data needed in whole at once by those applications (so communicating by for e.g. one line of data at time is impossible).
Thanks for any suggestions.
cheers,
kajman

Can you create named pipe ?
mkfifo data1
mkfifo data2
./A data1 > data2 &
./B data2 > data1
If your application is reading and writing in a loop, this could work :)

If you used pipes to transfer the stdout of program A to the stdin of program B you would remove the need to write the file "data2" each loop.
./A data1 | ./B > data1
Program B would need to have the capability of using input from stdin rather than a specified file.

If you want to make a program run faster, you need to understand what is making the program run slowly. The field of computer science dedicated to measuring the performance of a running program is called profiling.
Once you discover which internal portion of your program is running slow, you can generally speed it up. How you go about speeding up that item depends heavily on what "the slow part" is doing and how it is "being done".
Several people have recommended pipes for moving the data directly from the output of one program into the input of another program. Assuming you rewrite your tools to handle input and output in a piped manner, this might improve performance. Again, it depends on what you are doing and how you are doing it.
For example, if your tool just fixes windows style end-of-lines into unix style end-of-lines, the program might read in one line, waiting for it to be available, check the end-of-line and write out the line with the desired end-of-line. Or the tool might read in all of the data, do a replacement call on each "wrong" end-of-line in memory, and then write out all of the data. With the first solution, piping speeds things up. With the second solution piping doesn't speed up anything.
The reason is is truly so hard to answer such a question is because the fix you need really depends on the code you have, the problem you are trying to solve, and the means by which you are solving it now. In the end, there isn't always a 100% guarantee that the code can be sped up; however, virtually every piece of code has opportunities to be sped up. Use profiling to speed up the parts that are slow, instead of wasting your time working on a part of your program that is only called once, and represents 0.001% of the program's runtime.
Remember if you speed up something that is 0.001% of your program's runtime by 50%, you actually only sped up your entire program by 0.0005%. Use profiling to determine the block of code that's taking up 90% of your runtime and concentrate on it.

I do have to wonder why, if A and B depend on each other to run, do you want them to be part of an independent toolset.
One solution is a compromise between the two:
Create a library that contains A.
Create a library that contains B.
Create a program that spawns two threads, 1 containing A and 2 containing B.
Create a semaphore that tells A to run and another that tells B to run.
After the function that calls A in 1, increment B's semaphore.
After the function that calls B in 2, increment A's semaphore.
Another possibility is to use file locking in your programs:
Make both A and B execute in infinite loops (or however many times you're processing data)
Add code to attempt to lock both files at the beginning of the infinite loop in A and B (if not, sleep and try again so that you don't do anything until you have the lock).
Add code to unlock and sleep for longer than the sleep in step 2 at the end of each loop.
Either of these solve the problem of having the overhead of launching the program between runs.

It's almost certainly not application startup which is the bottleneck. Linux will end up caching large portions of your programs, which means that launching will progressively get faster (to a point) the more times you start your program.
You need to look elsewhere for your bottleneck.

Simulating file system access

I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks

Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.

That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight