openssl aes256 encryption of a file - c

I'd like to encrypt a file with aes256 using OpenSSL with C.
I did find a pretty nice example here.
Should I first read the whole file into a memory buffer and than do the aes256, or should I do it partial with a ~16K buffer?
Any snippets or hints?

Loading the whole file in a buffer can get inefficient to impossible on larger files - do this only if all your files are below some size limit.
OpenSSL's EVP API (which is also used by the example you linked) has an EVP_EncryptUpdate function, which can be called multiple times, each time providing some more bytes to encrypt. Use this in a loop together with reading in the plaintext from a file into a buffer, and writing out the ciphertext to another file (or the same one). (Analogously for decryption.)
Of course, instead of inventing a new file format (which you are effectively doing here), think about implementing the OpenPGP Message format (RFC 4880). There are less chances to make mistakes which might destroy your security – and as an added bonus, if your program somehow ceases to work, your users can always use the standard tools (PGP or GnuPG) to decrypt the file.

It's better to reuse a fixed buffer, unless you know you'll always process small files - but I don't think that fits your backup files definition.
I said better in a non-cryptographic way :-) There won't be any difference at the end (for the encrypted file) but your computer might not like (or even be able) to load several MB (or GB) into memory.
Crypto-wise the operations are done in block, for AES it's 128 bits (16 bytes). So, for simplicity, you better use a multiple of 16 bytes for your buffer. Otherwise the choice is yours. I would suggest between 4kb to 16kb buffers but, to be honest, I would test several values.

Related

How to properly work with file upon encoding and decoding it?

It doesn't matter how I exactly encrypt and decode files. I operate with file as a char massive, everything is almost fine, until I get file, which size is not divide to 8 bytes. Because I can encrypt and decode file each round 8 bytes, because of particular qualities of algorithm (size of block must be 64 bit).
So then, for example, I face .jpg and tried simply add spaces to end of file, result file can't be opened ( ofc. with .txt files nothing bad happen).
Is any way out here?
If you want information about algorithm http://en.wikipedia.org/wiki/GOST_(block_cipher).
UPD: I can't store how many bytes was added, because initial file can be deleted or moved. And, what we are suppose to do then we know only key and have encrypted file.
Do you need padding.
The best way to do this would be to use PKCS#7.
However GOST is not so good, better using AES-CBC.
There is an ongoing similar discussion in "python-channel".

How to read a large file in UNIX/LINUX with limited RAM?

I have a large text file to be opened (eg- 5GB size). But with a limited RAM (take 1 GB), How can I open and read the file with out any memory error? I am running on a linux terminal with with the basic packages installed.
This was an interview question, hence please do not look into the practicality.
I do not know whether to look at it in System level or programmatic level... It would be great if someone can throw some light into this issue.
Thanks.
Read it character by character... or X bytes by X bytes... it really depends what you want to do with it... As long as you don't need the whole file at once, that works.
(Ellipses are awesome)
What do they want you to do with the file? Are you looking for something? Extracting something? Sorting? This will affect your approach.
It may be sufficient to read the file line by line or character by character if you're looking for something. If you need to jump around the file or analyze sections of it, then most likely want to memory map it. Look up mmap(). Here's an short article on the subject:memory mapped i/o
[just comment]
If you are going to use system calls (open() and read()), then reading character by character will generate a lot of system calls that severely slow down your application. Even with the existence of the buffer cache (or disk file), system calls are expensive.
It is better to read block by block where block size "SHOULD" be more than 1MB. In case of 1MB block size, you will issue 5*1024 system calls.

Performance of open SSL on large text

I am currently using openSSL on linux c. I would like to ask if the cipher block size is always fixed to 16 unsigned char? The reason for this is because I am encrypting a massively large data. The problem is based on the description of SSL, the ecrypted cipher block of the previous block is XOR to the next plain text block when encrypting. Is there a way to icrease the size of the cipher block?
For example if I had a Gigabyte of data to encrypt, it would take Gigabyte/16byte times of encryption. Is there a standard way to force the AES_cbc_encrypt method call to use a vi that is not 16 byte? The still follows the standard? The reason for this is the encrypted text would be read by another program on another system that uses another CGC standard library.
No, the AES standard mandates a 16 byte block size. The original algorithm which it was based on, Rijndael, allowed more flexibility, but you can't rely on another AES implementation to support that.
For block ciphers, the size of the block is the property of the algorithm in question. DES used 8 bytes for instance. Stream ciphers (RC4 for example) on the other hand do not use fixed block sizes, they are effectively pseudo random generators seeded with the key.
But anyway, the performance of ciphers are published with the block size in mind, and AES does about 160mbyte/sec on a recent CPU. You can measure this using "openssl speed aes".
So, if you want to encrypt 1Gb of data, I'd be more concerned with moving all that data from disk, to memory and back again, rather than the encryption speed of AES itself.

Does fread fail for large files?

I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.

Parsing: load into memory or use stream

I'm writing a little parser and I would like to know the advantages and disadvantages of the different ways to load the data to be parsed. The two ways that I thought of are:
Load the file's contents into a string then parse the string (access the character at an array position)
Parse as reading the file stream (fgetc)
The former will allow me to have two functions: one for parse_from_file and parse_from_string, however I believe this mode will take up more memory. The latter will not have that disadvantage of using more memory.
Does anyone have any advice on the matter?
Reading the entire file in or memory mapping it will be faster, but may cause issues if you want your language to be able to #include other files as these would be memory mapped or read into memory as well.
The stdio functions would work well because they usually try to buffer up data for you, but they are also general purpose so they also try to look out for usage patterns which differ from reading a file from start to finish, but that shouldn't be too much overhead.
A good balance is to have a large circular buffer (x * 2 * 4096 is a good size) which you load with file data and then have your tokenizer read from. Whenever a block's worth of data has been passed to your tokenizer (and you know that it is not going to be pushed back) you can refill that block with new data from the file and update some buffer location info.
Another thing to consider is if there is any chance that the tokenizer would ever need to be able to be used to read from a pipe or from a person typing directly in some text. In these cases your reads may return less data than you asked for without it being at the end of the file, and the buffering method I mentioned above gets more complicated. The stdio buffering is good for this as it can easily be switched to/from line or block buffering (or no buffering).
Using gnu fast lex (flex, but not the Adobe Flash thing) or similar can greatly ease the trouble with all of this. You should look into using it to generate the C code for your tokenizer (lexical analysis).
Whatever you do you should try to make it so that your code can easily be changed to use a different form of next character peek and consume functions so that if you change your mind you won't have to start over.
Consider using lex (and perhaps yacc, if the language of your grammar matches its capabilities). Lex will handle all the fiddly details of lexical analysis for you and produce efficient code. You can probably beat its memory footprint by a few bytes, but how much effort do you want to expend into that?
The most efficient on a POSIX system would probably neither of the two (or a variant of the first if you like): just map the file read-only with mmap, and parse it then. Modern systems are quite efficient with that in that they prefetch data when they detect a streaming access etc., multiple instances of your program that parse the same file will get the same physical pages of memory etc. And the interfaces are relatively simple to handle, I think.

Resources