Reading a file using pread - c

The aim of the problem is to use only pread to read a file with the intergers.
I am trying to device a generic solution where I can read intergers of any length, but I think there must be a better solution from my current algorithm.
For the sake of explanation and to guide the algorithm, here is a sample input file. I have explicitly added \r\n to show that they exist in the file.
Input file:
23456\r\n
134\r\n
1\r\n
345678\r\n
Algorithm
1. Read a byte from the file
2. Check if it is number i.e '0' <= byte <= '9'
3.1 if yes, increment the offset and read the next byte
3.2 if not, is it \r
3.2.1 if yes, read the next and it should be \n.
Here the line is finished and we can use strtol to convert string to int.
3.2.2 // Error condition
I'm required to make this algorithm because if found out that pread reads the files as string and just pust the requested number of bytes in the provided buffer.
Question:
Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?

Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?
Yes, read big chunks of data into memory and then do the parsing on the memory. Use a big buffer (i.e. depending on system memory). On a mordern system where giga-bytes of memory is available, you can go for a buffer in the mega byte range. I would probably start out with a 1 or 2mega byte buffer and see how it performs.
This will be much more efficient that byte-by-byte reads.
note: your code needs to handle situations where a chunk from the file stops in the middle of an integer. That adds a little complexity to code but it's not that difficult to handle.
where I can read intergers of any length
Well, if you actually mean integers greater than the largest integer of your system, it's much more complicated. Standard functions like strtol can't be used. Further, you'll need to define your own way of storing these values. Alternatively, you can fetch a public library that can handle such values.

Related

Is it possible to count the frequency of a word in a file precisely using two buffers in C?

I have a file of size 1GB. I want to find out how many times the word "sosowhat" is found in the file. I've written a code using fgetc() which reads one character at a time from the file which is way too slower when it comes for a file of size 1GB. So I made a buffer of size 1000(using mmalloc) to hold 1000 words at a time from the file and I used the strstr() function to count the occurrence of the word "sosowhat". The logic is fine. But the problem is that if the part "so" of "sosowhat" is located at the end of the buffer and the "sowhat" part in the new buffer, the word will not be counted. So I used two buffers old_buffer and current_buffer. At the beginning of each buffer I want to check from the last few characters of old buffer. Is this possible? How can I go back to the old buffer? Is it possible without memmove()? As a beginner, I will be more than happy for your help.
Yes, it can be done. There are more possible approaches to this.
The first one, which is the cleanest, is to keep a second buffer, as suggested, of the length of the searched word, where you keep the last chunk of the old buffer. (It needs to be exactly the length of the searched word because you store wordLength - 1 characters + NULL terminator). Then the quickest way is to append to this stored chunk from the old buffer the first wordLen - 1 characters from the new buffer and search your word here. Then continue with your search normally. - Of course you can create a buffer which can hold both chunks (the last bytes from the old buffer and the first bytes from the new one).
Another approach (which I don't recommend, but can turn out to be a bit easier in terms of code) would be to fseek wordLen - 1 bytes backwards in the read file. This will "move" the chunk stored in previous approach to the next buffer. This is a bit dirtier as you will read some of the contents of the file twice. Although that's not something noticeable in terms of performance, I again recommend against it and use something like the first described approach.
use the same algorithm as per fgetc only read from the buffers you created. It will be same efficient as strstr iterates thorough the string char by char as well.

How do fread and fwrite distinguish between different data (types) in C?

I am working with a program and C (with Ubuntu and its bash) and using it to manipulate binary data files. First of all, when I use fopen(filename, 'w') it creates a file but without any extension. However, when I use vim filename it opens it up in some binary form.
For this question, when I use fwrite(array, sizeof(some struct), # of structs, filePointer) it writes (which I am not sure how in binary) into the file. When I use fread(anotherArray, sizeof(same struct), same # of structs, anotherFilePointer) it somehow magically knows how to read each struct in binary form and puts it into the array just by knowing its size and how much to read. What happens if I put a decimal value less than the number of structs there are in the # of structs parameter? How would fread know what to read correctly? How does it work in reading data just by looking at the sizes and not knowing what type of data it is?
fwrite writes the bytes of the memory where the object is stored to the output stream and fread reads bytes from the input stream into the memory whose address it gets as an argument. No assumption is made regarding the types and representations of the C objects stored in this memory.
Hence a number of problems can occur:
the representation of basic types can differ from one compiler to another, one machine to another, one OS to another, possibly even depending on compiler switches. Writing the bytes of the memory representation of basic types makes sense only if you know you will be reading the file back into byte-compatible structures.
the mode for accessing the input and output files matters: as you mention, files must be open in binary mode to avoid any translation between memory representation and file contents such as what happens for text files on legacy systems. For example text mode on MS-Windows causes 0A bytes to convert to 0D 0A sequences on output and 0D bytes to be stripped on input, resulting in different contents for isolated 0D bytes in the initial content.
if the C structure contains pointers, the bytes written to the output represent the value of these pointers, not what they point to. Reading these values back into memory is highly likely to create invalid pointers and very unlikely to make any sense.
if the C structure has a flexible array at the end, its contents is not included in the sizeof(T) bytes written by fwrite or read by fread.
the C structure may contain padding between members, causing the output file to contain non deterministic bytes, which might be a problem in some circumstances.
if the C structure has arrays with only partial meaningful contents, such as char arrays containing C strings, beware that fwrite will write the bytes beyond the null terminator, which should not be meaningful, but might be sensitive information such as password fragments or other meaningful data. Carefully erasing such arrays may avoid this issue, but padding bytes cannot be erased reliably, so this solution is not perfect.
For all the above reasons and other ones, reading/writing binary data is to be reserved to very specific cases where the programmer knows exactly what is happening. For other purposes, saving as text files in human readable form is much preferred.
In question comments from #David C. Rankin
"Well, fread/fwrite read and write bytes (binary data - if you write out then read in the same number of bytes -- you get the same thing back). If you want to read and write text where you need to worry about line-breaks, etc.., fgets/fputs. or fprintf"
So I guess I can never know what I read in with fread unless I know what I wrote to it in with fwriite?
"Right, look at the type for your buffer in fwrite(3) - Linux man page it is type void *. It's just a starting address for fwrite to use in writing however many bytes you told it to write. (obviously you know what it is writing) The same for fread -- it just reads bytes -- you have to know what you are reading (or at least the format of it). That's what binary I/O is about, it's all just bytes -- it's up to you, the Programmer, to know what you are writing and reading and how to unpack it. Otherwise, use formatted-I/O and lines, words, etc.."

How to decide a buffer's size

I have a program which it's purpose is to read from some input text file,filter all chars which are printable (i.e., ASCII between 32 and 126) into some other output text file.
I also get as an argument "DataAmount"-which means whats the amount of data I need to read-May be 1B,1K,1M,1G,80000B, etc.(Any natural number can be before the unit).
It is NOT the size of the input file,it is how much I need to read from the input file.And if the input file is smaller than the DataAmount,I need to re read the file,untill I read exactly DataAmount bytes.
For the filtering,I read from the input file into some buffer.I filter from the buffer into some other buffer the printable chars,and write from that buffer to the output file(both buffers are in the same size).
Ther question is,how can I decide what size is the best for those two buffers,so there will be a minimal calls for read() and write()?
(NOTE: I won't write the whole data in one time since it may be too big,and I won't write each byte at a time.I write from the outbuff to the output file only when the buffer is full).
At the moment,I build the buffer size only depends on the unit:
If it's B or K,the size will be 1024.
If it's M or G,the size will be 4096.
This is not good at all,since for 1B and 100000B I'll have the same size of the buffer.
How can I improve this?
My personal experience is that the buffer size does not matter much as long as you are using a few kilobytes.
As you noted in your question, there is overhead in doing system calls, so doing I/O one character at a time is not terribly efficient, and you can cut that overhead down by reading and writing larger blocks. However, there are other things that take time, and any reasonable amount of buffering will drop your system call overhead down to the point where it is the other other things that are taking most of the time. At that point larger buffers do not make the program significantly faster. There are also problems with making a buffer too large, so you can err in that direction too.
I would not make the buffer size dynamic as you are doing. It introduces needless complexity into the program. You can verify that by running your program with different buffer sizes, and see what kind of difference it makes.
As for the actual value to use, the stdio.h header file defines the macro BUFSIZ which is the default size for stdio buffers. That macro is a reasonable size to use.
Also note that if you are using the stdio functions to do your I/O, they already provide buffering (if you're not calling the system calls read() and write() directly, you're using stdio.) There isn't really a reason to buffer the data twice, so you can either do the I/O one character at a time and let the stdio buffers take care of it for you, or disable the stdio buffering with setvbuf().
If you know the input previously you can to some statistics and get the average, so it's not a fixed size but an approximation.
But I recommend to you: don't worry about read and close syscalls. If you read a very little data from the imput and your buffer is high, you waste some bytes. If you get a big input and have a little buffer, you only have to do some extra iterations.
A medium size for the buffer would be good. For example, 512.
Once you decide on the unit, then decide if the number of units needs extra buffer size. Thus, once you have found the B, check the value. That way you would not have to split the smaller units.
You can do a switch statement on the unit indicators, and then process within each case, based on the numeric value of that unit. As an example, for the B do an integer divide of the maximum and set the actual buffer size based on the result (again in a switch/case sequence).

How to determine the actual usage of a malloc'ed buffer

I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)
As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.
Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.
If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).
A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.
you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler

What's a good coding style for reading different bits of data from a binary file in C?

I'm novice programmer and am writing a simple wav-player in C as a pet project. Part of the file loading process requires reading specific data (sampling rate, number of channels,...) from the file header.
Currently what I'm doing is similar to this:
Scan for a sequence of bytes and skip past it
Read 2 bytes into variable a
Check value and return on error
Skip 4 bytes
Read 4 bytes into variable b
Check value and return on error
...and so on. (code see: https://github.com/qgi/Player/blob/master/Importer.c)
I've written a number of helper functions to do the scanning/skipping/reading bit. Still I'm repeating the reading, checking, skipping part several times, which doesn't seem to be neither very effective nor very smart. It's not a real issue for my project, but as this seems to be quite a common task when handling binary files, I was wondering:
Is there some kind of a pattern on how to do this more effectively with cleaner code?
Most often, people define structs (often with something like #pragma pack(1) to assure against padding) that matches the file's structures. They then read data into an instance of that with something like fread, and use the values from the struct.
The cleanest option that I've come across is the scanf-like function unpack presented by Kernighan & Pike on page 219 of The Practice of Programming, which can be used like
// assume we read the file header into buf
// and the header consists of magic (4 bytes), type (2) and length (4).
// "l" == 4 bytes (long)
// "s" == 2 bytes (short)
unpack(buf, "lsl", &magic, &type, &length);
For efficiency using a buffer of say size 4096 to read into and then doing your parsing on the data in the buffer would be more efficient, and ofcource doing a single scan parsing where you only go forward is most efficient.

Resources