zlib: how to dimension avail_out - c

I would like to deflate a small block of memory (<= 16 KiB) using zlib. The output is stored in a block of memory as well. No disk or database access here.
According to the documentation, I should call deflate() repeatedly until the whole input is deflated. In between, I have to increase the size of the memory block where the output goes.
However, that seems unnecessarily complicated and perhaps even inefficient. As I know the size of the input, can't I predetermine the maximum size needed for the output, and then do everything with just one call to deflate()?
If so, what is the maximum output size? I assume something like: size of input + some bytes overhead

zlib has a function to calculate the maximum size a buffer will deflate to. Your assumption is correct - the returned value is the size of the input buffer + header sizes. After deflation you can realloc the buffer to reclaim the 'wasted' memory.
From zlib.h:
ZEXTERN uLong ZEXPORT deflateBound OF((z_streamp strm, uLong sourceLen));
/*
deflateBound() returns an upper bound on the compressed size after
deflation of sourceLen bytes. It must be called after deflateInit() or
deflateInit2(), and after deflateSetHeader(), if used. This would be used
to allocate an output buffer for deflation in a single pass, and so would be
called before deflate().
*/

Related

How is the buffer allocated by gzbuffer used in zlib?

Trying to understand how gzbuffer is used in zlib. This is cut from the manual (https://www.zlib.net/manual.html):
ZEXTERN int ZEXPORT gzbuffer OF((gzFile file, unsigned size));
Set the internal buffer size used by this library's functions. The default buffer size is 8192 bytes. This function must be called after gzopen() or gzdopen(), and before any other calls that read or write the file. The buffer memory allocation is always deferred to the first read or write. Three times that size in buffer space is allocated. A larger buffer size of, for example, 64K or 128K bytes will noticeably increase the speed of decompression (reading).
The new buffer size also affects the maximum length for gzprintf().
gzbuffer() returns 0 on success, or –1 on failure, such as being called too late.
So three times the buffer size is allocated. When I call gzwrite, is compressed data written to the buffer (compression is done at every call to gzwrite) or is uncompressed data written to the buffer? (and compression is then delayed until the buffer is filled and gzflush is called internally, or i call gzflush myself)
When I continue to call gzwrite, what happens when the buffer is filled? Is there some allocation of new buffer memory at this point or is the buffer simply flushed to the file and then re-used?
When reading, a size input buffer is allocated, and a 2*size output buffer is allocated. When writing, the same thing, but reversed.
If len is less than size in gzwrite(state, buf, len), then the provided data goes into the input buffer. That input buffer is compressed once it has accumulated size bytes. If len is greater than or equal to size what remains in the buffer is compressed, followed by all of the provided len data. If a flush is requested, then all of the data in the input buffer is compressed.
Compressed data is accumulated in the size output buffer, which is written every time size compressed bytes have been accumulated, or when the gzFile is flushed or closed.

How to read a binary into an array

Say I have a 90 megabyte file. It's not encrypted, but it is binary.
I want to store this file into a table as an array of byte values so I can process the file byte by byte.
I can spare up to 2 GB of ram, so something with a thing like jotting down what bytes have been processed, which bytes have yet to be processed, and the processed bytes, would all be good. I don't exactly care about how long it may take to process.
How should I approach this?
Note I've expanded and rewritten this answer due to Egor's comment.
You first need the file open in binary mode. The distinction is important on Windows, where the default text mode will change line endings from CR+LF into C newlines. You do this by specifying a mode argument to io.open of "rb".
Although you can read a file one byte at a time, in practice you will want to work through the file in buffers. Those buffers can be fairly large, but unless you know you are handling only small files in a one-off script, you should avoid reading the entire file into a buffer with file:read"*a" since that will cause various problems with very large files.
Once you have a file open in binary mode, you read a chunk of it using buffer = file:read(n), where n is an integer count of bytes in the chunk. Using a moderately sized power of two will likely be the most efficient. The return value will either be nil, or will be a string of up to n bytes. If less than n bytes long, that was the last buffer in the file. (If reading from a socket, pipe, or terminal, however, reads less than n may only indicate that no data has arrived yet, depending on lots of other factors to complex to explain in this sentence.)
The string in buffer can be processed any number of ways. As long as #buffer is not too big, then {buffer:byte(1,-1)} will return an array of integer byte values for each byte in the buffer. Too big partly depends on how your copy of Lua was configured when it was built, and may depend on other factors such as available memory as well. #buffer > 1E6 is certainly too big. In the example that follows, I used buffer:byte(i) to access each byte one at a time. That works for any size of buffer, at least as long as i remains an integer.
Finally, don't forget to close the file.
Here's a complete example, lightly tested. It reads a file a buffer at a time, and accumulates the total size and the sum of all bytes. It then prints the size, sum, and average byte value.
-- sum all bytes in a file
local name = ...
assert(name, "Usage: "..arg[0].." filename")
file = assert(io.open(name, "rb"))
local sum, len = 0,0
repeat
local buffer = file:read(1024)
if buffer then
len = len + #buffer
for i = 1, #buffer do
sum = sum + buffer:byte(i)
end
end
until not buffer
file:close()
print("length:",len)
print("sum:",sum)
print("mean:", sum / len)
Run with Lua 5.1.4 on my Windows box using the example as its input, it reports:
length: 402
sum: 30374
mean: 75.557213930348
To split the contents of a string s into an array of bytes use {s:byte(1,-1)}.

How to save a specific length string from a file and work with it in C

So what I'm trying to do is open a file and read it until the end in blocks that are 256 bytes long each time it is called. My dilemma is using fgets() or fread() to do it.
I was using fgets() initially, because it returns a string of the bytes that were read, which is great because I can store that data and work with it. However, in my particular file that I'm reading, the 256 bytes often happen over a more than 2 lines, which is a problem because fgets() stops reading when it hits a newline character or the end of the file.
I then thought of using fread(), but I don't know how to save the line that I'm referring to with it because fread() returns an int referring to the number of elements successfully read (according to its documentation).
I've searched and thought of solutions for a while now and can't find anything that works with my particular scenario. I would like some guidance on how to go about this issue, how would you go about this in my position?
You can use fread() to read each 256 bytes block and keep a lineCount variable to keep track of the number of new line characters you have encountered so far in the input. Since you have to process the blocks already this wouldn't mean much of an overhead in the processing.
To read a block of 256 chars, which is what I think you are doing, you just need to create a buffer of chars that can hold 256 of them, in other words a char array of size 256.
#define BLOCK_SIZE 256
char block[BLOCK_SIZE];
Then if you check the documentation for fread() it shows the following signature:
Following is the declaration for fread() function.
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream)
Parameters
ptr -- This is the pointer to a block of memory with a minimum size of size*nmemb bytes.
size -- This is the size in bytes of each element to be read.
nmemb -- This is the number of elements, each one with a size of size bytes.
stream -- This is the pointer to a FILE object that an input stream.
So this means it takes a pointer to the buffer where it will write the read information, the size of each element it's supposed to read, the maximum amount of elements you want it to read and the file pointer. In your case it would be:
int read = fread(block, sizeof(char), BLOCK_SIZE, file);
This will copy the information from the file to the block array, which you can later process and keep track of the lines. The characters that were read by fread are in the block array, so the first char in the last read block would be block[0], the second block[1] and so on. The returned value in read indicates how many elements (in your case chars) were inserted in the array block when you call fread, this number will be equal to BLOCK_SIZE for every call, unless you reach the end of the file or there's an error.
I suggest you read some documentation for a full example, play a little with the code and do some reading on pointers in C to gain a better understanding of how everything works in general. If you still have questions after that, we can take it from there or you can create a new SO question.

How to manage scatterlist for Linux crypto api use?

I need to (de)cipher some data at a time. Extra padding bytes may have to be added to the target data bytes at the beginning and at the end. The built-in crypto API works on struct scatterlist objects, as you can see with the definition of the encrypt method of a block cipher :
int (*encrypt)(struct blkcipher_desc *desc, struct scatterlist *dst,
struct scatterlist *src, unsigned int nbytes);
Now the procedure I am following for the ciphering operation :
Get a data buffer buf (length L)
Compute left padding and right padding bytes (rpad and lpad)
Cipher the whole thing (lpad and buf and rpad)
Get rid of the padding bytes in the result
The most simple and unefficient solution would be to allocate L + rpad + lpad bytes and copy the buffer's content in this new area appropriately. But since the API uses those scatterlist objects, I was wondering if there was a way to avoid this pure waste of resources.
I read a couple of articles on LWN about scatterlist chaining but a quick glance at the header file worries me : it looks like I have to manually set up the whole thing, which is a pretty bad practice ...
Any clue on how to use the scatterlist API properly ? Ideally, I would like to do the following :
Allocates buffers for the padding bytes for both input and output
Allocate a "payload" buffer that will only store the "useful" ciphered bytes
Create the scatterlist objects that includes padding buffers and target buffer
Cipher the whole and store the result in output padding buffers + output "payload" buffer
Discard the input and output padding buffers
Return the ciphered "payload" buffer to the user
first, sorry for my pour english,I am not a native english speaker.I think you are looking for
this api in kernel " blkcipher_walk_virt" , you can find the usage of this in ecb.c
"crypto_ecb_crypt". and you also can see the padlock_aes.c
After having investigated through the code, I found a suitable solution. It follows quite well the procedure I listed in my question, though there are some subtle differences.
As suggested by JohnsonDiao, I dived into the scatterwalk.c file to see how the Crypto API was making use of the scatterlist objects.
The problem that has arisen is the "boundary" between two subsequent scatterlist. Let's say I have two chained scatterlist. The first one hold information about a 12 bytes buffer, the second to a 20 bytes buffer. I want to encrypt the two buffers as a whole using AES128-CTR.
In this particular case, the API will :
Encrypt the 12 bytes of the buffer referenced by the first scatterlist.
Increment the counter
Encrypt the 16 first bytes of the second scatterlist
Increment the counter
Encrypt the last remaining 4 bytes
The behaviour I would have expected was :
Encrypt the 12 bytes of first buffer + the 4 first bytes of second buffer
Increment the counter
Encrypt the last 16 bytes of the second buffer
Thus, to enforce this, one must allocate a 16-byte aligned padding buffer in the pattern :
Let npad the number of padding bytes needed for the requested encryption. Then we have :
Where lbuf is the total length of the padding buffer. Now, the last lbuf - npad bytes must be filled with the first input data bytes. If the input is too short to ensure a full copy, that's not a matter.
Therefore we copy the first lcpy = min(lbuf - npad, ldata) bytes at the offset npad in the padding buffer
In short, here is the procedure :
Allocate the appropriate padding buffer with length lbuf
Copy the first lcpy bytes of the payload buffer at offset npad in the padding buffer
Reference the padding buffer in a scatterlist
Reference the payload buffer in another scatterlist (with a lcpy shift)
Ask for the ciphering
Extract the payload bytes present in the padding buffer
Discard the padding buffer
I tested this and it seemed to work perfectly.
I am also learning this part. and this is my analysis:
if your encryption device need cipher 16-bytes at once, you should set the alignment to (16-1). just like the padlock_aes.c , see ecb_aes_alg.cra_alignmask. the kernel would handle
this in blkcipher_next_copy and blkcipher_next_slow.
but I am puzzled, in aes_generic.c the alignmask is 3, how the kernel handle this without
blkcipher_next_copy?

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?
You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.
If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...
One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.
Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

Resources