How to determine the actual usage of a malloc'ed buffer - c

I have some compressed binary data and an API call to decompress it which requires a pre-allocated target buffer. There is not any means via the API that tells me the size of the decompressed data. So I can malloc an oversized buffer to decompress into but I would like to then resize (or copy this to) a memory buffer of the correct size. So, how do I (indeed can I) determine the actual size of the decompressed binary data in the oversized buffer?
(I do not control the compression of the data so I do not know in advance what size to expect and I cannot write a header for the file.)

As others have said, there is no good way to do this if your API doesn't provide it.
I almost don't want to suggest this for fear that you'll take this suggestion and have some mission-critical piece of your application depend on it, but...
A heurstic would be to fill your buffer with some 'poison' pattern before decompressing into it. Then, after decompression, scan the buffer for the first occurrence of the poison pattern.
This is a heuristic because it's perfectly conceivable that the decompressed data could just happen to have an occurrence of your poison pattern. Unless you have exact domain knowledge of what the data will be, and can choose a pattern specifically that you know cannot exist.
Even still, an imperfect solution at best.

Usually this information is supplied at compression time (take a look at 7-zips LZMA SDK for example).
There is no way to know the actual size of the decompressed data (or the size of the part that is actually in use) with the information you are giving now.

If the decompression step doesn't give you the decompressed size as a return value or "out" parameter in some way, you can't.
There is no way to determine how much data was written in the buffer (outside of debugger/valgrind-type checks).

A complex way to answer this problem is by decompressing twice into an over-sized buffer.
In both cases, you need a "random pattern". Starting from the end, you count the number of bytes which correspond to the pattern, and detect the end of decompressed sequence where it differs.
Or does it ? Maybe, by chance, one of the final byte of the decompressed sequence corresponds to the random byte at this exact position. So the final decompressed size might be larger than the detected one. If your pattern is truly random, it should not be more than a few bytes.
You need to fill again the buffer with a random pattern, but a different one. Ensure that, at each position, the new random pattern has a different value than the old random pattern. For faster speed, you are not obliged to fill the full buffer : you may limit the new pattern to a few bytes before and some more bytes after the 1st detected end. 32 bytes shall be enough, since it is improbable that so many bytes does correspond by chance to the first generated random pattern.
Decompress a second time. Detect again where the pattern differ. Take the larger of the two values between the first and second end detection. It is your decompressed size.

you should check how free works for your compiler/os
and do the same.
free doesn't take the size of the malloced data, but it somehow knows how much to free right ;)
usually the size is stored before the allocated buffer, don't know though exactly how maby bytes before again depending on the os/arch/compiler

Related

How to decide a buffer's size

I have a program which it's purpose is to read from some input text file,filter all chars which are printable (i.e., ASCII between 32 and 126) into some other output text file.
I also get as an argument "DataAmount"-which means whats the amount of data I need to read-May be 1B,1K,1M,1G,80000B, etc.(Any natural number can be before the unit).
It is NOT the size of the input file,it is how much I need to read from the input file.And if the input file is smaller than the DataAmount,I need to re read the file,untill I read exactly DataAmount bytes.
For the filtering,I read from the input file into some buffer.I filter from the buffer into some other buffer the printable chars,and write from that buffer to the output file(both buffers are in the same size).
Ther question is,how can I decide what size is the best for those two buffers,so there will be a minimal calls for read() and write()?
(NOTE: I won't write the whole data in one time since it may be too big,and I won't write each byte at a time.I write from the outbuff to the output file only when the buffer is full).
At the moment,I build the buffer size only depends on the unit:
If it's B or K,the size will be 1024.
If it's M or G,the size will be 4096.
This is not good at all,since for 1B and 100000B I'll have the same size of the buffer.
How can I improve this?
My personal experience is that the buffer size does not matter much as long as you are using a few kilobytes.
As you noted in your question, there is overhead in doing system calls, so doing I/O one character at a time is not terribly efficient, and you can cut that overhead down by reading and writing larger blocks. However, there are other things that take time, and any reasonable amount of buffering will drop your system call overhead down to the point where it is the other other things that are taking most of the time. At that point larger buffers do not make the program significantly faster. There are also problems with making a buffer too large, so you can err in that direction too.
I would not make the buffer size dynamic as you are doing. It introduces needless complexity into the program. You can verify that by running your program with different buffer sizes, and see what kind of difference it makes.
As for the actual value to use, the stdio.h header file defines the macro BUFSIZ which is the default size for stdio buffers. That macro is a reasonable size to use.
Also note that if you are using the stdio functions to do your I/O, they already provide buffering (if you're not calling the system calls read() and write() directly, you're using stdio.) There isn't really a reason to buffer the data twice, so you can either do the I/O one character at a time and let the stdio buffers take care of it for you, or disable the stdio buffering with setvbuf().
If you know the input previously you can to some statistics and get the average, so it's not a fixed size but an approximation.
But I recommend to you: don't worry about read and close syscalls. If you read a very little data from the imput and your buffer is high, you waste some bytes. If you get a big input and have a little buffer, you only have to do some extra iterations.
A medium size for the buffer would be good. For example, 512.
Once you decide on the unit, then decide if the number of units needs extra buffer size. Thus, once you have found the B, check the value. That way you would not have to split the smaller units.
You can do a switch statement on the unit indicators, and then process within each case, based on the numeric value of that unit. As an example, for the B do an integer divide of the maximum and set the actual buffer size based on the result (again in a switch/case sequence).

Zlib minimum deflate size

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.
is there some way to GUARANTEE that the output compressed data will be
SMALLER than the input data?
Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.
(Also, I know about deflateBound, but would rather avoid it because of
the overhead.)
Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.
By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.
due to the overhead of things like headers and dictionaries in the LZW
stream.
deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.
Or, to put it another way, is there some minimum buffer size that I
can use for input/output buffers ...
The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.

In what situation would compressed data be larger than input?

I need to handle compression of data that's largely UTF-8 HTML content in a utility I'm working on. The utility uses zLib and the deflate algorithm to compress data. Is it safe to assume that if the input data size is over 1 kB, compressed data will always be smaller than uncompressed input? (Input data below 1 kB will not be compressed.)
I'm trying to see situations where this assumption would break but apart from near-perfect random input, it seems a safe assumption to me.
Edit: the reason I'm wondering about this assumption is because I already have a buffer allocated that's as big as the input data. If my assumption holds, I can reuse this same buffer and avoid another memory allocation.
No. You can never assume that the compressed data will always be smaller. In fact, if any sequence is compressed by the algorithm, then you are guaranteed that some other sequence is expanded.
You can use zlib's deflate() function to compress as much as it can into your 1K buffer. Do whatever you need to with that result, then continue with another deflate() call writing into that same buffer.
Alternatively you can allocate a buffer big enough for the largest expansion. The deflateBound() or compressBound() functions will tell you how much that is. It's only a small amount more.
As far as I know, a sequence of 128 bytes with values 0, 1, 2, ..., 127 will not be compressed by zLib. Technically, it's possible to intentionally create an HTML page that will break your compression scheme, but with normal innocent HTML data you should be almost perfectly safe.
But almost perfectly is not perfectly. If you already have a buffer of that size, I'd advise to attempt the compression with this buffer, and if it turns out that the buffer is not enough (I suppose zLib has means of indicating that), then allocate a larger buffer or simply store an uncompressed version. And make sure you are writing these cases into some log so you could see if it ever fires :)

What is more efficient, reading word by word from file or reading a line at a time and splitting the string using C ?

I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

Resources