How to read efficiently from an stdin pipe data that need seeking - c

I'm looking for the best way to read data from an stdin pipe in C programming.
Problem : I need to seek on this data, ie I need to read data from the start of the stream after reading some data at the end of this same stream.
Small use case : gunzip -c 4GbDataFile.gz | myprogram
Another one :
On local host : nc -l -p 1234 | myprogram
On remote host : gunzip -c 4GbDataFile.gz | nc -q 0 theotherhost 1234
I know that reading from fifo can be done only once. So, at the moment :
I slurp everything from stdin to memory and work from this allocated memory.
It's ugly, but it works. An evident issue is that if someone sends a huge (or a continuous) stream to my app, I'll end with a big allocated memory chunk or I'll run out of memory. (Think about an 8Gb file)
What I thought next :
I set a size limit (maybe user-defined) of that memory chunk. Once I've read this much data from stdin :
Either I stop here : "Errr. Out of memory, bazinga. Forget it." style.
Either I start dumping what I am reading to a file and work from this file once all data is read.
But then, what is the point? I can not find out the origin of the data that I am reading. If this is a local 8Gb file, I'll be dumping it to another 8Gb file on the same system.
So, my question is :
How do you read efficiently a lot of data from an stdinpipe when you have to seek back and forth in it?
Thanks in advance for your answers.
Edit :
My program needs to read metadata somewhere (depending of the file format) in the given file, so that maybe at the end of the stream. Then it may read back other data at the start of the stream, then at another place etc. In short : it needs to have access to any bytes of the data.
An example would be to read data of an archive file without knowing the file format before starting to read from stdin: I need to check the archive metadata, find archive files names and offsets etc.
So I'll make a local copy of stdin content and work from it. Thanks everyone for your inputs ;)

You need to get your requirements clear. If you need to seek() then obviously you can't take input from stdin. If you need to seek() then you should take input file name as argument.

The data structure in your 4GbDataFile just doesn't lend itself to what you want to do. Think outside the box. Don't hammer your program into something it shouldn't even attempt. Try to fix the input format where it is generated so you don't need to seek back 4 GB.
In case you do like hammering: 4GB of in-core memory is pretty expensive. Instead, save the data read from stdin in a file, then open the file (or mmap it) and seek to your heart's content.

I think you should read the infamous Useless Use of Cat Award.
TL;DR: change cat 4gbfile | yourprogram to yourprogram < 4gbfile.
If you really insist on having it work with data from a pipe, you'll have to store it in a temporary file at startup then replace file descriptor 0 with a copy of the fd for the temp file, using dup2.

Related

How do I get the disk addresses of files in C/C++?

When a file is saved into a drive, its contents are written & then indexed. I want to get the indexes and to access the raw contents of the files.
Any idea on the method how to do it, especially for ex4 & btrfs?
UPDATE: I want to get the addresses of the extents of a file. The information about the addresses must be stored somewhere onto the disk. I want to retrieve this info, in order to map the physical location of the file contents. Any methods in order to achieve that?
UPDATE: Hello, all! Thanks for your replies. What I want is a function/command which returns me a list of extent addresses. debugfs seems the function/command with the most-relevant functionality.
It depends of the filesystem you are using. If you are running Linux you can use debufs to seek the file in the filesystem.
I have to say that all FSs are mounted through a VFS, a virtual filesystem that is like a simplified interface with the standard operations (open, close, read...). What is the meaning of that? No filesystem nor its contents(files, dirs) are opened directly from disk, when you open something, you move it to the main memory(your RAM) you do your operations and when you close something it returns to the disk drive.
Now, the question is: Can I get the absolute address in a FS? Yes, if you open your whole filesystem like open ("/dev/sdaX", 0_RDONLY); so you get the address relative to your filesystem using lseek in C for example.
And then... Can I get the same in the whole drive? No, that is because you cannot open the whole drive as a file descriptor. Remember /dev/sdaXin UNIX? Partitions and its can be opened like files because they have a virtual interface running on them.
Your last answer: Can I read really raw contents? All files are read as they appear on disk, the only thing that changes is the descriptor used by the OS and some data about how is indexed, all this as a "file header".
I hope all your questions are answered.
The current solution/workaround is to call these functions with popen:
filefrag -e /path/to/file
hdparm --fibmap /path/to/filename
Then one should simply parse the stringoutputs of these programs. It is not a real solution (i.e.: outputs at C/C++ level), but I'll accept it for now.
Sources:
https://unix.stackexchange.com/questions/106802/what-command-do-i-use-to-see-the-start-and-end-block-of-a-file-in-the-file-syste
https://serverfault.com/questions/29886/how-do-i-list-a-files-data-blocks-on-linux

How to create a virtual file with inode that has references to storage in memory

Let me explain clearly.
The following is my requirement:
Let's say there is a command which has an option specified as '-f' that takes a filename as argument.
Now I have 5 files and I want to create a new file merging those 5 files and give the new filename as argument for the above command.
But there is a difference between
reading a single file and
merging all files & reading the merged file.
There is more IO (read from 5 files + write to the merged file + any IO our command does with the given file) generated in the second case than IO (any IO our command does with the given file) generated in the first case.
Can we reduce this unwanted IO?
In the end, I really don't want the merged file at all. I only create this merged file just to let the command read the merged files content.
And to say, I also don't want this implementation. The file sizes are not so big and it is okay to have that extra negligible IO. But, I am just curious to know if this can be done.
So in order to implement this, I have following understanding/questions:
Generally what all the commands (that takes the filename argument) does is it reads the file.
In our case, the filename(filepath) is not ready, it's just an virtual/imaginary filename that exists (as the mergation of all files).
So, can we create such virtual filename?
What is a filename? It's an indirect inode entry for a storage location.
In our case, the individual files have different inode entries and all inode entries have different storage locations. And our virtual/imaginary file has in fact no inode and even if we could create an imaginary inode, that can only point to a storage in memory (as there is no reference to the storage location of another file from a storage location of one file in disk)
But, let's say using advanced programming, we are able to create an imaginary filepath with imaginary inode, that points to a storage in memory.
Now, when we give that imaginary filename as argument and when the command tries to open that imaginary file, it finds that it's inode entry is referring to a storage in memory. But the actual content is there in disk and not in the memory. So, the data is not loaded into memory yet, unless we read it explicitly. Hence, again we would need to read the data first.
Simply saying, as there is no continuity or references at storage in disk to the next file data, the merged data needs to be loaded to memory first.
So, with my deduction, it seems we would at least need to put the data in memory. However, as the command itself would need the file to be read (if not the whole file, at least a part of it until the commands's operation is done - let it be parsing or whatever). So, using this method, we could save some significant IO, if it's really a big file.
So, how can we create that virtual file?
My first answer is to write the merged file to tmpfs and refer to that file. But is it the only option or can we actually point to a storage location in memory, other than tmpfs? tmpfs is not option because, my script can be run from any server and we need to have a solution that work from all servers. If I mention to create merged file at /dev/shm in my script, it may fail in the server where it doesn't have /dev/shm. So I should be able to load to memory directly. But I think normal user will not have access to memory and so, it seems can not be done without shm.
Please let me know your comments and also kindly correct me if my understanding anywhere is wrong. Even if it is complicated for my level, kindly post your answer. At least, I might understand it after few months.
Create a fifo (named pipe) and provide its name as an argument to your program. The process that combines the five input files writes to this fifo
mkfifo wtf
cat file1 file2 file3 file4 file5 > wtf # this will block...
[from another terminal] cp wtf omg
Here I used cp as your program, and cat as the program combining the five files. You will see that omg will contain the output of your program (here: cp) and that the first terminal will unblock after the program is done.
Your program (here:cp) is not even aware that its 1st argument wtf refers to a fifo; it just opens it and reads from it like it would do with an ordinary file. (this will fail if the program attempts to seek in the file; seek() is not implemented for pipes and fifos)

Easiest way to overwrite a series of files with zeros

I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.
Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.
I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.
The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.
Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command

popen vs. KornShell security

I am writing a C program using some external binaries to achieve a planned goal. I need to run one command which gives me an output, which in turn I need to process, then feed into another program as input. I am using popen, but wonder if that is the same as using a KornShell (ksh) temporary file instead.
For example:
touch myfile && chmod 700
cat myfile > /tmp/tempfile
process_file < /tmp/tempfile && rm /tmp/tempfile
Since that creates a temporary file which can be readable by root, would it be the same if one used popen in C, knowing that pipes are also files? Or is it safe to assume that the Operating System (OS) will not allow any other process to read your pipe?
You say "that creates a temporary file which can be readable by root", which implies that you are attempting to transfer the data in a way in which the root user cannot read it. That's impossible; in general, the root user has total control of the system, and can thus read any data that is on the system, whether it's in a temporary file or not. Even within a single process, the root user can read the memory of that process.
If you use popen(), there will not be an entry for the file on a filesystem; it creates a pipe, which acts like a file, but doesn't actually write that data to disk, instead it just passes it between two programs.
There will be a file descriptor for it; depending on the system, it may be easier or harder to intercept that data, but it will always be possible to do so. For instance, on Linux, you can just look in /proc/<pid>/fd/ to find all of the open file descriptors and manipulate them (read from or write to them).

Search a pattern in .gz file in C

I want to read a .gz file (text.gz) with 300MB length and search a pattern in it. I opened the text file in a binary format using fopen with "rb" and stored it in a buffer. When I search a pattern that I know it exists in the text, the result is wrong. When I debug the program, the elements of the buffer are different from what I expect. Do I have to read and store these kind of files in other ways??????
You might try using zlib and gzread to read the file.
http://zlib.net/manual.html
Try this.
gunzip -c file.gz | grep <pattern>
If the program is exiting and failing to read the file, a real common problem is that you don't close the file in Notepad or whatever is using it and the FileIO fails due to not being able to access the file. Make sure you don't have anything with that file open before you test your program.

Resources