How to encrypt/decrypt long input messages with RSA? [Openssl, C] - c

I wrote a simple test program that encrypts/decrypts a message.
I have a keylength:
int keylength = 1024; // it can also be 2048, 4096
and max input length:
int maxlen = (keylength/8)-11;
and I know that my input size should be < than maxlen, something like this:
if(insize >= maxlen)
printf("cannot encrypt/decrypt!\n");
My question is simple - is it possible (if so, how can I do this) to encrypt/decrypt with RSA messages LONGER than maxlen?
Main code is also, very simple but works only when insize < maxlen:
if((encBytes=RSA_public_encrypt(strlen(buff1)+1, buff1, buff2, keypair, RSA_PKCS1_PADDING)) == -1)
{
printf("error\n");
}
if((decBytes=RSA_private_decrypt(encBytes, buff2, buff3, keypair, RSA_PKCS1_PADDING)) == -1)
{
printf("error\n");
}

Encrypting long messages requires combined scheme - RSA algorithm encrypts session key (i.e. AES key), and data itself is encrypted with that key.
I would recommend to not invent another bicycle and use well established scheme, i.e. PKCS#7/CMS or OpenPGP, depending on your needs.

You would be able to encrypt long messages with RSA the same way as it is done with block ciphers. That is, encrypt the messages in blocks and bind the blocks with an appropriate chaining mode. However, this is not the usual way to do it and you won't find support for it (RSA chaining) in the libraries you use.
Since RSA is quite slow, the usual way to encrypt large messages is using hybrid encryption. In hybrid encryption you use a fast symmetric encryption algorithm (like AES) for encrypting the data with a random key. The random key is then encrypted with RSA and send along with the symmetric key encrypted data.
EDIT:
As fore your implementation, you have insize = 1300 and keylength = 1024 which gives maxlen = 117. To encrypt the full message you those needs 12 encrypts, that each produce 128 bytes, giving an encrypted size of 1536 bytes. In your code you only allocates buffers of 1416 bytes. Also, you don't seem to allow for 128 bytes output as you only increment with 117 in:
RSA_public_encrypt(maxlen, buff1+i, buff2+i, keypair, RSA_PKCS1_PADDING)
and
i += maxlen;

You can use RSA as block cipher in that case. That is break the message to several blocks smaller than maxlen.
Otherwise impossible.

If you want to run RSA in a "block cipher" kind of mode, you would need to run it in a loop.
Like most of the other commenters, I'd like to point out that this is a bad use of RSA - You should just encrypt a AES key with RSA then use AES for the longer message.
However, I'm not one to let practicality get in the way of learning, so here's how you'd do it. This code isn't tested, since I don't know what libraries you are using. It's also a little overly-verbose, for readability.
int inLength = strlen(buff1)+1;
int numBlocks = (inLength / maxlen) + 1;
for (int i = 0; i < numBlocks; i++) {
int bytesDone = i * maxlen;
int remainingLen = inLength - bytesDone;
int thisLen; // The length of this block
if (remainingLen > maxlen) {
thisLen = maxlen;
} else {
thisLen = remainingLen;
}
if((encBytes=RSA_public_encrypt(thisLen, buff1 + bytesDone, buff2 + bytesDone, keypair, RSA_PKCS1_PADDING)) == -1)
{
printf("error\n");
}
// Okay, IDK what the first parameter to this should be. It depends on the library. You can figure it out, hopefully.
if((decBytes=RSA_private_decrypt(idk, buff2 + bytesDone, buff3 + bytesDone, keypair, RSA_PKCS1_PADDING)) == -1)
{
printf("error\n");
}
}

maxlen actually depends on a key length and padding mode. Think newer padding scheme ´OAEP´ e.g. in Java Encryption Engine takes additional 42 bytes instead of 11. Known libraries are not designed for using RSA in a block cipher mode.
For that purpose, beyond fragmentation as answered above, security aspects require further modification of the padding scheme, e,g, https://crypto.stackexchange.com/a/97974/98888

Related

OpenSSL EVP_aes_128_cbc decryption got unexpected result

I encrypted a string with EVP_aes_128_cbc cipher, then changed the 1st byte of the ciphertext, and decrypt this changed ciphertext. Unexpectly, it didn't decrypt error or get a fully wrong result, but got a wrong 1st 16 bytes and same string of rest. Here is the encrypt func:
int do_crypt1(unsigned char *in, unsigned char *outbuf, int inlen, int do_encrypt)
{
unsigned char inbuf[1024];
int outlen;
EVP_CIPHER_CTX *ctx;
/*
* Bogus key and IV: we'd normally set these from
* another source.
*/
unsigned char key[32] = "test";
unsigned char iv[] = "1234567812345678";
/* Don't set key or IV right away; we want to check lengths */
ctx = EVP_CIPHER_CTX_new();
EVP_CipherInit_ex(ctx, EVP_aes_128_cbc(), NULL, NULL, NULL,
do_encrypt);
OPENSSL_assert(EVP_CIPHER_CTX_key_length(ctx) == 16);
OPENSSL_assert(EVP_CIPHER_CTX_iv_length(ctx) == 16);
/* Now we can set key and IV */
EVP_CipherInit_ex(ctx, NULL, NULL, key, iv, do_encrypt);
int update_len = 0;
for (;;) {
memcpy(inbuf, in, inlen);
if (inlen <= 0)
break;
if (!EVP_CipherUpdate(ctx, outbuf, &outlen, inbuf, inlen)) {
/* Error */
EVP_CIPHER_CTX_free(ctx);
return 0;
}
update_len += outlen;
if(inlen <= 1024)
break;
}
if (!EVP_CipherFinal_ex(ctx, outbuf+outlen, &outlen)) {
/* Error */
EVP_CIPHER_CTX_free(ctx);
return 0;
}
EVP_CIPHER_CTX_free(ctx);
update_len += outlen;
return update_len;
}
and in the main:
unsigned char* b = (unsigned char*)malloc(1024);
unsigned char* hmac_code = (unsigned char*)malloc(1024);
int len = do_crypt1(value, hmac_code, strlen(value), AES_ENCRYPT);
printf("%s\n", value);
for (i = 0; i < len; i++)
printf("%x", *(hmac_code + i));
printf("\n");
printf("%d\n", len);
// printf("%s\n", hmac_code);
*hmac_code = 0x23;
len = do_crypt1(hmac_code, b, len, AES_DECRYPT);
printf("%d\n", len);
printf("%s\n", b);
free(hmac_code);
free(b);
result
Could anyone give me the reason and how to resolve this?
Actually as long as your plaintext is at least 17 bytes, the first 17 decrypted bytes should be wrong -- but since you're displaying the invalid decryption as text, some of the bytes may be invisible: in your example, the output is 2 chars shorter, so clearly 2 chars of the first 17 aren't visible. That's the expected result for damage in (at the beginning of) the first block of a 2-block (or more) CBC ciphertext for a 16-byte-block cipher like AES; see the diagram in wikipedia to understand why.
If you want detection of damage to the ciphertext, don't use CBC mode, or at least don't use it alone. Common/standard solutions nowadays are:
add authentication, such as HMAC. (Intriguingly your code already uses the variable name hmac_code even though you don't do anything even remotely related to HMAC.)
use an authenticated-encryption mode, like GCM, which effectively combines encryption and (some type of) MAC internally. Most authenticated modes today, including GCM, also support 'additional' or 'associated' data that is authenticated but not encrypted, and as a result are called AEAD, but you may or may not care about that.
use an error-propagating mode. These were popular decades ago, around the time of presidents Nixon, Ford, Carter, and Reagan, but are now mostly considered obsolete and are not directly supported by OpenSSL.
Also, BTW, your code is completely pants for values longer than 1024 bytes, which your test clearly doesn't exercise. First of all you don't need to break up such a value into chunks at all, but if for some reason you want to, the method you implemented is wrong.
Plus, the IV for CBC should be different and unpredictable every time; using a hardcoded value like this exposes you to two different classes of attacks. In general the advice you will get on Stacks actually related to security is 'don't roll you own'. Cryptographic code written by people who don't know what they're doing, even if/when it produces the correct output, is usually insecure. This is the major difference between crypto/security software and others; you can easily see if your editor or spreadsheet or database produces correct or incorrect output, and as long as the output is correct that's usually all you need, but you can't tell by looking at the output from an encryption/decryption program whether it is secure or not.

Different encryption results between platforms, using OpenSSL

I'm working on a piece of cross-platform (Windows and Mac OS X) code in C that needs to encrypt / decrypt blobs using AES-256 with CBC and a blocksize of 128 bits. Among various libraries and APIs I've chosen OpenSSL.
This piece of code will then upload the blob using a multipart-form PUT to a server which then decrypts it using the same settings in .NET's crypto framework (Aes, CryptoStream, etc...).
The problem I'm facing is that the server decryption works fine when the local encryption is done on Windows but it fails when the encryption is done on Mac OS X - the server throws a "Padding is invalid and cannot be removed exception".
I've looked at this from many perspectives:
I verified that the transportation is correct - the byte array received on the server's decrypt method is exactly the same that is sent from Mac OS X and Windows
The actual content of the encrypted blob, for the same key, is different between Windows and Mac OS X. I tested this using a hardcoded key and run this patch on Windows and Mac OS X for the same blob
I'm sure the padding the correct, since it is taken care of by OpenSSL and since the same code works for Windows. Even so, I tried implementing the padding scheme as it is in Microsoft's reference source for .NET but still, no go
I verified that the IV is the same for Windows and Mac OS X (I thought maybe there was a problem with some of the special characters such as ETB that appear in the IV, but there wasn't)
I've tried LibreSSL and mbedtls, with no positive results. In mbedtls I also had to implement padding because, as far as I know, padding is the responsibility of the API's user
I've been at this problem for almost two weeks now and I'm starting to pull my (ever scarce) hair out
As a frame of reference, I'll post the C client's code for encrypting and the server's C# code for decrypting. Some minor details on the server side will be omitted (they don't interfere with the crypto code).
Client:
/*++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++*/
void
__setup_aes(EVP_CIPHER_CTX *ctx, const char *key, qvr_bool encrypt)
{
static const char *iv = ""; /* for security reasons, the actual IV is omitted... */
if (encrypt)
EVP_EncryptInit(ctx, EVP_aes_256_cbc(), key, iv);
else
EVP_DecryptInit(ctx, EVP_aes_256_cbc(), key, iv);
}
/*++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++*/
void
__encrypt(void *buf,
size_t buflen,
const char *key,
unsigned char **outbuf,
size_t *outlen)
{
EVP_CIPHER_CTX ctx;
int blocklen = 0;
int finallen = 0;
int remainder = 0;
__setup_aes(&ctx, key, QVR_TRUE);
EVP_CIPHER *c = ctx.cipher;
blocklen = EVP_CIPHER_CTX_block_size(&ctx);
//*outbuf = (unsigned char *) malloc((buflen + blocklen - 1) / blocklen * blocklen);
remainder = buflen % blocklen;
*outlen = remainder == 0 ? buflen : buflen + blocklen - remainder;
*outbuf = (unsigned char *) calloc(*outlen, sizeof(unsigned char));
EVP_EncryptUpdate(&ctx, *outbuf, outlen, buf, buflen);
EVP_EncryptFinal_ex(&ctx, *outbuf + *outlen, &finallen);
EVP_CIPHER_CTX_cleanup(&ctx);
//*outlen += finallen;
}
Server:
static Byte[] Decrypt(byte[] input, byte[] key, byte[] iv)
{
try
{
// Check arguments.
if (input == null || input.Length <= 0)
throw new ArgumentNullException("input");
if (key == null || key.Length <= 0)
throw new ArgumentNullException("key");
if (iv == null || iv.Length <= 0)
throw new ArgumentNullException("iv");
byte[] unprotected;
using (var encryptor = Aes.Create())
{
encryptor.Key = key;
encryptor.IV = iv;
using (var msInput = new MemoryStream(input))
{
msInput.Position = 0;
using (
var cs = new CryptoStream(msInput, encryptor.CreateDecryptor(),
CryptoStreamMode.Read))
using (var data = new BinaryReader(cs))
using (var outStream = new MemoryStream())
{
byte[] buf = new byte[2048];
int bytes = 0;
while ((bytes = data.Read(buf, 0, buf.Length)) != 0)
outStream.Write(buf, 0, bytes);
return outStream.ToArray();
}
}
}
}
catch (Exception ex)
{
throw ex;
}
}
Does anyone have any clue as to why this could possibly be happening? For reference, this is the .NET method from Microsoft's reference source .sln that (I think) does the decryption: https://gist.github.com/Metaluim/fcf9a4f1012fdeb2a44f#file-rijndaelmanagedtransform-cs
OpenSSL version differences are messy. First I suggest you explicitly force and veryify the key lengths, keys, IVs and encryption modes on both sides. I don't see that in the code. Then I would suggest you decrypt on the server side without padding. This will always succeed, and then you can inspect the last block whether it is what you expect.
Do this with the Windows-Encryption and MacOS-Encryption variant and you will see a difference, most likely in the padding.
The outlen padding in the C++ code looks odd. Encrypting a 16 byte long plaintext results in 32 bytes of ciphertext, but you only provide a 16 byte long buffer. This will not work. You will write out of bounds. Maybe it works just by chance on Windows because of a more generous memory layout and fails on MacOS.
AES padding scheme has been changed between OpenSSL versions 0.9.8* and 1.0.1* (at least between 0.9.8r and 1.0.1j). If two of your modules uses these different versions of OpenSSL then this could be the reason for your problem. To verify this first check OpenSSL versions. If you hit the described case you may consider on aligning the padding scheme to be the same.

aes-gcm using libgcrypt api in C

I'm playing with libgcrypt (v1.6.1 on Gentoo x64) and i've already implemented (and tested thorugh the AEs test vectors) aes256-cbc and aes256-ctr. Now i am looking at aes256-gcm but i have some doubt about the workflow. Below there is a skeleton of a simple encryption program:
int main(void){
unsigned char TEST_KEY[] = {0x60,0x3d,0xeb,0x10,0x15,0xca,0x71,0xbe,0x2b,0x73,0xae,0xf0,0x85,0x7d,0x77,0x81,0x1f,0x35,0x2c,0x07,0x3b,0x61,0x08,0xd7,0x2d,0x98,0x10,0xa3,0x09,0x14,0xdf,0xf4};
unsigned char TEST_IV[] = {0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x0f};
unsigned char TEST_PLAINTEXT_1[] = {0x6b,0xc1,0xbe,0xe2,0x2e,0x40,0x9f,0x96,0xe9,0x3d,0x7e,0x11,0x73,0x93,0x17,0x2a};
unsigned char cipher[16] = {0};
int algo = -1, i;
const char *name = "aes256";
algo = gcry_cipher_map_name(name);
gcry_cipher_hd_t hd;
gcry_cipher_open(&hd, algo, GCRY_CIPHER_MODE_GCM, 0);
gcry_cipher_setkey(hd, TEST_KEY, 32);
gcry_cipher_setiv(hd, TEST_IV, 16);
gcry_cipher_encrypt(hd, cipher, 16, TEST_PLAINTEXT_1, 16);
char out[33];
for(i=0;i<16;i++){
sprintf(out+(i*2), "%02x", cipher[i]);
}
out[32] = '\0';
printf("%s\n", out);
gcry_cipher_close(hd);
return 0;
}
In GCM mode there want also these instruction:
gcry_cipher_authenticate (gcry cipher hd t h , const void * abuf , size t abuflen )
gcry_error_t gcry_cipher_gettag (gcry cipher hd t h , void * tag , size t taglen )
So the correct workflow of the encryption program is:
gcry_cipher_authenticate
gcry_cipher_encrypt
gcry_cipher_gettag
But what i haven't undestood is:
abuf is like a salt? (so have i to generate it using gcry_create_nonce or similar?)
If i want to encrypt a file, void *tag is what i have to write to the outfile?
1) gcry_cipher_authenticate is for supporting authenticated encryption with associated data. abuf is data that you need to authenticate but do not need to encrypt. For example, if you are sending a packet, you might want to encrypt the body, but you must send the header unencrypted for the packet to be delivered. The tag generated by the cipher will provide integrity for both the encrypted data and the data sent in plain.
2) The tag is used after decryption to make sure that the data has not been tampered with. You append the tag to the encrypted text. Note, that it is computed on encrypted data and associated (unencrypted) data, so you will need both when decrypting.
You can check these documents for more information on GCM:
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/gcm/gcm-spec.pdf
http://csrc.nist.gov/publications/nistpubs/800-38D/SP-800-38D.pdf
Also, you can probably get faster answers to cryptography questions like this on http://crypto.stackexchange.com.

8 byte missing on EVP_DecryptFinal

this is my first question so please tell me if I do something wrong :).
My problem is I use
EVP_DecryptInit(&ctx1, EVP_des_ecb(), tmpkey, NULL);
EVP_DecryptUpdate(&ctx1, keysigout, &outlu ,keysigin, keysigfilelength);
EVP_DecryptFinal(&ctx1, keysigout, &outlf);
printf("DECLEN:%i",outlu + outlf);
to decrypt a binary file. The file is 248 bytes long but the printf only tells me EVP decrypted 240 bytes. keysigfilelength is 248 and should tell the update that 248 bytes need to be decrypted.
I dont understand why this doesnt work and would be happy if you can enlighten me.
Edit:
I just encrypted a file manually with the command
openssl enc -e -des-ecb -in test.txt -out test.bin -K 00a82b209cbeaf00
and it grew by 8 bytes :O. I still don't know where they come from but I don't think the general error I have in my program is caused by this.
The context of this whole problem is an information security course at my university. We got similar Tasks with different algorithms, but even someone who has done his program successfully couldnt figure out where the problem in my program is.
Is it ok to post my whole program for you?
I hope its fine to answer my own question.
EVP_DecryptUpdate(&ctx1, keysigout, &outlu ,keysigin, keysigfilelength);
EVP_DecryptFinal(&ctx1, keysigout + outlu, &outlf);
The problem was the missing outlu, DecryptFinal tried to decrypt the whole block again. When i added the outlu i got 7 byte in outlf, and it worked.
For future reference i add the whole function below.
It expects the key and iv to be one block of data.
int decrypt(const EVP_CIPHER *cipher,unsigned char *key, unsigned char *encryptedData, int encryptedLength,unsigned int * length, unsigned char ** decryptedData)
{
int decryptedLength = 0, lastDecryptLength = 0, ret;
unsigned char * iv = NULL;
EVP_CIPHER_CTX *cryptCtx = EVP_CIPHER_CTX_new();
EVP_CIPHER_CTX_init(cryptCtx);
*decryptedData = malloc (encryptedLength * sizeof(char));
if(cipher->iv_len != 0) iv = key + cipher->key_len;
EVP_DecryptInit_ex(cryptCtx, cipher, NULL, key, iv);
EVP_DecryptUpdate(cryptCtx, *decryptedData, &decryptedLength, encryptedData, encryptedLength);
ret = EVP_DecryptFinal_ex(cryptCtx, *decryptedData + decryptedLength, &lastDecryptLength);
*length = decryptedLength + lastDecryptLength;
EVP_CIPHER_CTX_free(cryptCtx);
EVP_cleanup();
return ret;
}
Because block ciphers only really want to work on an input that is a multiple of their block size, the input is normally padded to meet this requirement.
The default for many programs (including openssl enc is to use PKCS #5 padding
If the plaintext is not a multiple of 8 bytes then padding bytes are added so that it is. If it is already a multiple of 8 bytes then 8 bytes of padding are added. Thus it is entirely normal for the encrypted data to be longer than the plaintext.

C Library for compressing sequential positive integers

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:
uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };
Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.
The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).
I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.
Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.
A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.
Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.
For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.
I appreciate your help and let me know if you have any doubts.
I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.
However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.
Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.
Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,
Tokyo Cabinet is written in the C
language, and provided as API of C,
Perl, Ruby, Java, and Lua. Tokyo
Cabinet is available on platforms
which have API conforming to C99 and
POSIX.
As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.
With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.
Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.
You have two conflicting requirements:
You want to compress very small items (8 bytes each).
You need efficient random access for each item.
The second requirement is very likely to impose a fixed length for each item.
What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?
If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).
For fast searching, indices would be implemented using something like B+ Tree.
I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.
Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.
Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)
IndexCompressor.h
//
// index compressor class
//
#pragma once
#include "File.h"
const int IC_BUFFER_SIZE = 8192;
//
// index compressor
//
class IndexCompressor
{
private :
File *m_pFile;
WA_DWORD m_dwRecNo;
WA_DWORD m_dwWordNo;
WA_DWORD m_dwRecordCount;
WA_DWORD m_dwHitCount;
WA_BYTE m_byBuffer[IC_BUFFER_SIZE];
WA_DWORD m_dwBytes;
bool m_bDebugDump;
void FlushBuffer(void);
public :
IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
~IndexCompressor(void) {}
void Attach(File& File) { m_pFile = &File; }
void Begin(void);
void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
void End(void);
WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
WA_DWORD GetHitCount(void) { return m_dwHitCount; }
void DebugDump(void) { m_bDebugDump = true; }
};
IndexCompressor.cpp
//
// index compressor class
//
#include "stdafx.h"
#include "IndexCompressor.h"
void IndexCompressor::FlushBuffer(void)
{
ASSERT(m_pFile != 0);
if (m_dwBytes > 0)
{
m_pFile->Write(m_byBuffer, m_dwBytes);
m_dwBytes = 0;
}
}
void IndexCompressor::Begin(void)
{
ASSERT(m_pFile != 0);
m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
m_dwBytes = 0;
}
void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
ASSERT(m_pFile != 0);
WA_BYTE buffer[16];
int nbytes = 1;
ASSERT(dwRecNo >= m_dwRecNo);
if (dwRecNo != m_dwRecNo)
m_dwWordNo = 0;
if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
++m_dwRecordCount;
++m_dwHitCount;
WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
if (m_bDebugDump)
{
TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
}
// 1WWWWWWW
if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
{
buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
}
// 01WWWWWW WWWWWWWW
else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
{
buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
nbytes += sizeof(WA_BYTE);
}
// 001RRRRR WWWWWWWW WWWWWWWW
else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
{
buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
WA_WORD *p = (WA_WORD *) (buffer+1);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
// 0001rrww
buffer[0] = 0x10;
// encode recno
if (dwRecNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwRecNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwRecNoDelta < 65536)
{
buffer[0] |= 0x04;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwRecNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x08;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwRecNoDelta;
nbytes += sizeof(WA_DWORD);
}
// encode wordno
if (dwWordNoDelta < 256)
{
buffer[nbytes] = WA_BYTE(dwWordNoDelta);
nbytes += sizeof(WA_BYTE);
}
else if (dwWordNoDelta < 65536)
{
buffer[0] |= 0x01;
WA_WORD *p = (WA_WORD *) (buffer+nbytes);
*p = WA_WORD(dwWordNoDelta);
nbytes += sizeof(WA_WORD);
}
else
{
buffer[0] |= 0x02;
WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
*p = dwWordNoDelta;
nbytes += sizeof(WA_DWORD);
}
}
// update current setting
m_dwRecNo = dwRecNo;
m_dwWordNo = dwWordNo;
// add compressed data to buffer
ASSERT(buffer[0] != 0);
ASSERT(nbytes > 0 && nbytes < 10);
if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
FlushBuffer();
CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
m_dwBytes += nbytes;
if (m_bDebugDump)
{
for (int i = 0; i < nbytes; ++i)
TRACE("%02X ", buffer[i]);
TRACE("\n");
}
}
void IndexCompressor::End(void)
{
FlushBuffer();
m_pFile->Write(WA_BYTE(0));
}
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
Number of strings
Average length of a string
Maximum length of a string
Median length of strings
Degree to which the strings file compresses with gzip
Whether you are allowed to change the order of strings to improve compression
EDIT
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.
Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

Resources