edit all files via multi-threading in C - c

If you had a base file directory with an unknown amount of files and additional folders with files in them and needed to rename every file to append the date it was created on,
i.e filename.ext -> filename_09_30_2021.ext
Assuming the renaming function was already created and returned 1 on success, 0 on fail and -1 on error,
int rename_file(char * filename)
I'm having trouble understanding how you would write the multi-threaded file parsing section to increase the speed.
Would it have to first break down the entire file tree into say 4 parts of char arrays with filenames and then create 4 threads to tackle each section?
Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?

Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?
In general, you get better performance from multithreading for intensive cpu operations. In this case, you'll probably see little to no improvement. It's even quite possible that it gets slower.
The bottleneck here is not the cpu. It's reading from the disk.
Related: An answer I wrote about access times in general https://stackoverflow.com/a/45819202/6699433

Related

Hadoop: Sending Files or File paths to a map reduce job

supposed I had N files to process using hadoop map-reduce, let's assume they are large, well beyond the block size and there are only a few hundred of them. Now I would like to process each of these files, let's assume the word counting example.
My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into different mappers [EDIT].
Are these both valid approaches?
Are there any drawbacks to them?
Thanks for the prompt answers I've included a detailed description of my problem since my abstraction may have missed a few important topics:
I have N small files on Hadoop HDFS in my application and I just need to process each file. So I am using a map function to apply a python script to each file (actually image [I've already looked at all the hadoop image processing links out there]), I am aware of the small file problem and the typical recommendation is to group the smaller files so we avoid the overhead of moving files around (the basic recommendation using sequence files or creating your own data structures as in the case of the HIPI).
This makes me wonder can't we just tell each mapper to look for files that are local to him and operate on those?
I haven't found a solution to that issue which is why I was looking at either sending a path to the files to each mapper or the file it self.
Creating a list of path names for each of the collection of images seems to be ok, but as mentioned in the comments I loose the data locality property.
Now when I looked at the hadoop streaming interface it mentions that the different pieces communicate based on stdin and stdout typically used for text files. That's where I get confused, if I am just sending a list of path names this shouldn't be an issue since each mapper would just try to find the collection of images it is assigned. But when I look at the word count example the input is the file which then gets split up across the mappers and so that's when I was confused as to if I should concatenate images into groups and then send these concatenated groups just like the text document to the different mappers or if I should instead concatenate the images leave them in hadoop HDFS and then just pass the path to them to the mapper... I hope this makes sense... maybe I'm completely off here...
Thanks again!
Both are valid. But latter would incur extra overhead and performance will go down because you are talking about concatenating all the files into one and feeding it to just 1 mapper. And by doing that you would go against one of the most basic principles of Hadoop, parallelism. Parallelism is what makes Hadoop so efficient.
FYI, if you really need to do that you have to set isSplittable to false in your InputFormat class, otherwise the framework will split the file(based on your InputFormat).
And as far as input path is considered, you just need to give the path of the input directory. Each file inside this directory will be processed without human intervention.
HTH
In response to your edit :
I think you have misunderstood this a bit. You don't have to worry about localization. Hadoop takes care of that. You just have to run your job and the data will be processed on the node where it is present. Size of the file has nothing to with it. You don't have to tell anything to mappers. Process goes like this :
You submit your job to JT. JT directs the TT running on the node which has the block of data required by the job to start start the mappers. If the slots are occupied by some other process, then same thing takes place on some other node having the data block.
The bottleneck will be there if you are processing the whole concatenated file in a single mapper as you have mentioned.
It won't be a problem is you are providing the concatenated file as input to Hadoop. Since, the large file formed will obviously be distributed in HDFS (I assume you are using HDFS) and will be processed by multiple Mappers and reducers concurrently.
My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into a single mapper.
By listing the files paths in a text file and (i assume) manually opening them in the mapper, you'll be defeating data locality (that is where hadoop will try and run your mapper code where the data is, rather than moving the data to where your code executes. with 1000 files, this will also probably be processed by a single mapper instance (i imagine 1000 lines of text should be less than your block size).
If you concatenate all the files first and then use as input, this will usually be less efficient, mainly because you're copying all the files to a single node (to concatenate them) and then pushing the data back up to HDFS as a single file. This is even before you then get to process the file again in a mapper (or more depending on splittability of your input format split / compression codec).
If you were going to process this concatenated file multiple times, and each file is smaller than the block size, then merging them to a single file may be beneficial, but you've already noted that each file is larger than the default block size.
Is there particular reason you want all files to flow through a single mapper (which is what it sounds like is you are trying to achieve by doing these two options).

Loading thousand of files in same memory chunk C

Box: Linux, gcc.
Problem :
Finding out the file signature of an home folder, which contains thousand of items, by scanning this folder recursively.
Done so far:
Using mmap() system call to load the first 1k byte of each file and check the file magic number.
The drawback with that method is that for each file encountered i've got to make two system calls (e.g mmap() and munmap()).
Best solution if possible:
I would like to allocate a single chunk of memory, load each file (in a row) in this unique buffer and when processing is completed deallocate it, meaning that for each folder scanned i would only use two system calls.
I can't figure out which system call to use in order to achieve that, not even if this solution is realistic!
Any advice would be greatly appreciated.
Don't worry about performance until you know it isn't enough. Your time is much more valuable than the gains in program run time (except for extremely rare cases). And when the performance isn't enough, measure before digging in. There are numerous war stories on "performance optimizations" that were a complete waste (if not positivley harmful).

write one big file or multiple small files

I'm wondering what is better in therms of performance: write in one big text file (something about 10GB or more) or use a subfolder system that will have 3 levels with 256 folders in each one, the last level will be the text file. Example:
1
1
2
3
1
2
3
4
4
2
3
4
It will be heavy accessed (will be opened, append some text stuff then closed), so I don't know what is better, open and close file pointers thousand times in a second, or change a pointer inside one big file thousand times.
I'm at a core i7, 6GB DDR3 of RAM and a 60MB/s write disk speed under ext4.
You ask a fairly generic question, so the generic answer would be to go with the big file, access it and let the filesystem and its caches worry about optimizing access. Chances are they came up with a more advanced algorithm than you just did (no offence).
To make a decision, you need to know answers to many questions, including:
How are you going to determine which of the many files to access for the information you are after?
When you need to append, will it be to the logical last file, or to the end of whatever file the information should have been found in?
How are you going to know where to look in any given file (large or small) for where the information is?
Your 2563 files (16 million or so if you use all of them) will require a fair amount of directory storage.
You actually don't mention anything about reading the file - which is odd.
If you're really doing write only access to the file or files, then a single file always opened with O_APPEND (or "a") will probably be best. If you are updating (as well as appending) information, then the you get into locking issues (concurrent access; who wins).
So, you have not included enough information in the question for anyone to give any definitive answer. If there is enough more information in the comments you've added, then you should have placed those comments into the question (edit the question; add the comment material).

Efficient copy of entire directory

I want to copy one directory and the two files under it to another shared location of shared storage. Is it possible to combine the three(one directory and two files) as a continuous file writing and decompose it at another side to save the cost? I am limited to c language and Unix/Linux. I am considering to create a structure with the inode info and get the data at receiver.
Thanks!
rsync is what you're looking for. Or tar if you feel like working with the shell on the other side.
The best optimization you can do is to use large buffers for the copy. If that is not enough then restructure your data to be a single file instead of two files in a directory. Next step is to get faster hardware.
There are many file systems in common use for Unix/Linux and you would need to write a custom copy algorithm for each. There is rarely a guarantee of contiguous blocks for even a single file, let alone two. Odds are also good that your block copy routine would bypass and be less efficient than existing file system optimizations.
Reading an entire file into memory before writing it out will give more benefit in terms of minimizing seek times than opening fewer files would, at least for files over a certain size. And not all hardware suffers from seek times.
For some reason, cpio is often preferred over tar for this.
You can, for example, pipe cpio to a ssh session running cpio remotely.

How can you concatenate two huge files with very little spare disk space? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose that you have two huge files (several GB) that you want to concatenate together, but that you have very little spare disk space (let's say a couple hundred MB). That is, given file1 and file2, you want to end up with a single file which is the result of concatenating file1 and file2 together byte-for-byte, and delete the original files.
You can't do the obvious cat file2 >> file1; rm file2, since in between the two operations, you'd run out of disk space.
Solutions on any and all platforms with free or non-free tools are welcome; this is a hypothetical problem I thought up while I was downloading a Linux ISO the other day, and the download got interrupted partway through due to a wireless hiccup.
time spent figuring out clever solution involving disk-sector shuffling and file-chain manipulation: 2-4 hours
time spent acquiring/writing software to do in-place copy and truncate: 2-20 hours
times median $50/hr programmer rate: $400-$1200
cost of 1TB USB drive: $100-$200
ability to understand the phrase "opportunity cost": priceless
I think the difficulty is determining how the space can be recovered from the original files.
I think the following might work:
Allocate a sparse file of the
combined size.
Copy 100Mb from the end of the second file to the end of the new file.
Truncate 100Mb of the end of the second file
Loop 2&3 till you finish the second file (With 2. modified to the correct place in the destination file).
Do 2&3&4 but with the first file.
This all relies on sparse file support, and file truncation freeing space immediately.
If you actually wanted to do this then you should investigate the dd command. which can do the copying step
Someone in another answer gave a neat solution that doesn't require sparse files, but does copy file2 twice:
Copy 100Mb chunks from the end of file 2 to a new file 3, ending up in reverse order. Truncating file 2 as you go.
Copy 100Mb chunks from the end of file 3 into file 1, ending up with the chunks in their original order, at the end of file 1. Truncating file 3 as you go.
Here's a slight improvement over my first answer.
If you have 100MB free, copy the last 100MB from the second file and create a third file. Truncate the second file so it is now 100MB smaller. Repeat this process until the second file has been completely decomposed into individual 100MB chunks.
Now each of those 100MB files can be appended to the first file, one at a time.
With those constraints I expect you'd need to tamper with the file system; directly edit the file size and allocation blocks.
In other words, forget about shuffling any blocks of file content around, just edit the information about those files.
if the file is highly compressible (ie. logs):
gzip file1
gzip file2
zcat file1 file2 | gzip > file3
rm file1
rm file2
gunzip file3
At the risk of sounding flippant, have you considered the option of just getting a bigger disk? It would probably be quicker...
Not very efficient, but I think it can be done.
Open the first file in append mode, and copy blocks from the second file to it until the disk is almost full. For the remainder of the second file, copy blocks from the point where you stopped back to the beginning of the file via random access I/O. Truncate the file after you've copied the last block. Repeat until finished.
Obviously, the economic answer is buy more storage assuming that's a possible answer. It might not be, though--embedded system with no way to attach more storage, or even no access to the equipment itself--say, space probe in flight.
The previously presented answer based on the sparse file system is good (other than the destructive nature of it if something goes wrong!) if you have a sparse file system. What if you don't, though?
Starting from the end of file 2 copy blocks to the start of the target file reversing them as you go. After each block you truncate the source file to the uncopied length. Repeat for file #1.
At this point the target file contains all the data backwards, the source files are gone.
Read a block from the tart and from the end of the target file, reverse them and write them to the spot the other came from. Work your way inwards flipping blocks.
When you are done the target file is the concatenation of the source files. No sparse file system needed, no messing with the file system needed. This can be carried out at zero bytes free as the data can be held in memory.
ok, for theoretical entertainment, and only if you promise not to waste your time actually doing it:
files are stored on disk in pieces
the pieces are linked in a chain
So you can concatenate the files by:
linking the last piece of the first file to the first piece of the last file
altering the directory entry for the first file to change the last piece and file size
removing the directory entry for the last file
cleaning up the first file's end-of-file marker, if any
note that if the last segment of the first file is only partially filled, you will have to copy data "up" the segments of the last file to avoid having garbage in the middle of the file [thanks #Wedge!]
This would be optimally efficient: minimal alterations, minimal copying, no spare disk space required.
now go buy a usb drive ;-)
Two thoughts:
If you have enough physical RAM, you could actually read the second file entirely into memory, delete it, then write it in append mode to the first file. Of course if you lose power after deleting but before completing the write, you've lost part of the second file for good.
Temporarily reduce disk space used by OS functionality (e.g. virtual memory, "recycle bin" or similar). Probably only of use on Windows.
I doubt this is a direct answer to the question. You can consider this as an alternative way to solve the problem.
I think it is possible to consider 2nd file as the part 2 of the first file. Usually in zip application, we would see a huge file is split into multiple parts. If you open the first part, the application would automatically consider the other parts in further processing.
We can simulate the same thing here. As #edg pointed out, tinkering file system would be one way.
you could do this:
head file2 --bytes=1024 >> file1 && tail --bytes=+1024 file2 >file2
you can increase 1024 according to how much extra disk space you have, then just repeat this until all the bytes have been moved.
This is probably the fastest way to do it (in terms of development time)
You may be able to gain space by compressing the entire file system. I believe NTFS supports this, and I am sure there are flavors of *nix file systems that would support it. It would also have the benefit of after copying the files you would still have more disk space left over than when you started.
OK, changing the problem a little bit. Chances are there's other stuff on the disk that you don't need, but you don't know what it is or where it is. If you could find it, you could delete it, and then maybe you'd have enough extra space.
To find these "tumors", whether a few big ones, or lots of little ones, I use a little sampling program. Starting from the top of a directory (or the root) it makes two passes. In pass 1, it walks the directory tree, adding up the sizes of all the files to get a total of N bytes. In pass 2, it again walks the directory tree, pretending it is reading every file. Every time it passes N/20 bytes, it prints out the directory path and name of the file it is "reading". So the end result is 20 deep samples of path names uniformly spread over all the bytes under the directory.
Then just look at that list for stuff that shows up a lot that you don't need, and go blow it away.
(It's the space-equivalent of the sampling method I use for performance optimization.)
"fiemap"
http://www.mjmwired.net/kernel/Documentation/filesystems/fiemap.txt

Resources