How to decompress from Huffman's compression in C

How to decompress from Huffman's compression in C - c

I am developing a program to decompress a file passed as a parameter and previously compressed via the Huffman algorithm, but my decompression function does not work, can you help me?
Here is the encryption format:
110;;1100;o1101000; 1101001;f1101010;r1101011;
1101100;�1101101;�1101110;{1101111;�1110000;�1110001;�1110010;�1110011;#1110100;61110101;m11101100;h11101101;l11101110;e11101111;01111;
���;o�;�����E}�j�U����͛wo�Ǘ>�#6
I have a function to read the file, and another to parse the huffman code (at the top of the file)
My decompression function :
unsigned char *read_bits_from_compressed(unsigned char *str, list_t *code) {
int str_len = strlen(str);
unsigned char padding = str[str_len - 1];
int padding_bits = padding >> 4;
int bits_read = 0;
int curr_byte = 0;
int curr_bit = 7;
int bit = 0;
int i = 0;
int j = 0;
node_t *node = code->head;
cipher_t *cipher = NULL;
unsigned char *result = malloc(str_len);
memset(result, 0, str_len);
for (i = str_len - 2; i >= 0; i--) {
curr_byte = str[i];
for (j = 7; j >= 0; j--) {
bit = (curr_byte >> j) & 1;
while (node != NULL && bits_read < padding_bits) {
node = node->next;
bits_read++;
}
while (node != NULL) {
cipher = (cipher_t *) node->data;
if (cipher->code[curr_bit] == bit) {
curr_bit--;
if (cipher->code[curr_bit + 1] == -1) {
result[str_len - padding - 1 - i] = cipher->c;
node = code->head;
curr_bit = 7;
break;
}
} else {
node = node->next;
curr_bit = 7;
}
}
}
}
return result;
}
The function must do the following:
Invert and read the string from the end.
The first character of the string is the padding we get it back
Start reading bit by bit, ignoring the padding.
Insert the entire bit representation into an array
Read the array and retrieve the corresponding character
Write the corresponding character to the output file
Repeat until the end of compressed char (detect and skip Huffman's code)
Here are the structures of the chained list :
The list :
typedef struct list {
node_t *head;
node_t *tail;
size_t size;
} list_t;
The nodes :
typedef struct node {
struct node *prev;
struct node *next;
void *data;
} node_t;
The data contained in the nodes :
typedef struct cipher {
unsigned char c;
int *code;
} cipher_t;
c corresponds to char and code to Huffman code (composed of 1 and 0, terminated by -1).
My function currently returns an empty string.

You cannot compute the length of the compressed array with int str_len = strlen(str);. str points to binary data that may contain embedded null bytes that are meaningful. You should pass the length as an extra argument to read_bits_from_compressed.
As a matter of fact, the compiler should have complained that you pass an unsigned char * to strlen() which expects a char * (or a const char *. Do not ignore compiler warnings.
Furthermore, you allocate the decompressed string with unsigned char *result = malloc(str_len);. There is no guarantee that the length of the decompressed string be the same as that of the compressed buffer. It may be more or less, depending on the Huffmann tree and the uncompressed values. Note also that you must allocate ne extra byte for the null terminator if you intend to produce a C string.

Related

Segmentation fault when using fscanf in c

I am really trying to learn if someone wouldn't mind to educate me in the principles I may be missing out on here. I thought I had everything covered but it seems I am doing something incorrectly.
The following code gives me a segmentation fault, and I cannot figure out why? I am adding the & in front of the arguments name being passed in to fscanf.
int word_size = 0;
#define HASH_SIZE 65536
#define LENGTH = 45
node* global_hash[HASH_SIZE] = {NULL};
typedef struct node {
char word[LENGTH + 1];
struct node* next;
} node;
int hash_func(char* hash_val){
int h = 0;
for (int i = 0, j = strlen(hash_val); i < j; i++){
h = (h << 2) ^ hash_val[i];
}
return h % HASH_SIZE;
}
bool load(const char *dictionary)
{
char* string;
FILE* dic = fopen(dictionary, "r");
if(dic == NULL){
fprintf(stdout, "Error: File is NULL.");
return false;
}
while(fscanf(dic, "%ms", &string) != EOF){
node* new_node = malloc(sizeof(node));
if(new_node == NULL){
return false;
}
strcpy(new_node->word, string);
new_node->next = NULL;
int hash_indx = hash_func(new_node->word);
node* first = global_hash[hash_indx];
if(first == NULL){
global_hash[hash_indx] = new_node;
} else {
new_node->next = global_hash[hash_indx];
global_hash[hash_indx] = new_node;
}
word_size++;
free(new_node);
}
fclose(dic);
return true;
}
dictionary.c:25:16: runtime error: left shift of 2127912344 by 2 places cannot be represented in type 'int'
dictionary.c:71:23: runtime error: index -10167 out of bounds for type 'node *[65536]'
dictionary.c:73:13: runtime error: index -10167 out of bounds for type 'node *[65536]'
dictionary.c:75:30: runtime error: index -22161 out of bounds for type 'node *[65536]'
dictionary.c:76:13: runtime error: index -22161 out of bounds for type 'node *[65536]'
Segmentation fault

Update after OP posted more code
The problem is that your hash_func works with signed integers and that it overflows. Therefore you get a negative return value (or rather undefined behavior).
That is also what these lines tell you:
dictionary.c:25:16: runtime error: left shift of 2127912344 by 2 places cannot be represented in type 'int'
Here it tells you that you have a signed integer overflow
dictionary.c:71:23: runtime error: index -10167 out of bounds for type 'node *[65536]'
Here it tells you that you use a negative index into an array (i.e. global_hash)
Try using unsigned integer instead
unsigned int hash_func(char* hash_val){
unsigned int h = 0;
for (int i = 0, j = strlen(hash_val); i < j; i++){
h = (h << 2) ^ hash_val[i];
}
return h % HASH_SIZE;
}
and call it like:
unsigned int hash_indx = hash_func(new_node->word);
Original answer
I'm not sure this is the root cause of all problems but it seems you have some problems with memory allocation.
Each time you call fscanf you get new dynamic memory allocated for string du to %ms. However, you never free that memory so you have a leak.
Further, this looks like a major problem:
global_hash[hash_indx] = new_node; // Here you save new_node
} else {
new_node->next = global_hash[hash_indx];
global_hash[hash_indx] = new_node; // Here you save new_node
}
word_size++;
free(new_node); // But here you free the memory
So it seems your table holds pointers to memory that have been free'd already.
That is a major problem that may cause seg faults when you use the pointers.
Maybe change this
free(new_node);
to
free(string);
In general I'll suggest that you avoid %ms and also avoid fscanf. Use char string[LENGTH + 1] and fgets instead.

There are multiple issues in the code posted. Here are the major ones:
you should use unsigned arithmetic for the hash code computation to ensure that the hash value is positive. The current implementation has undefined behavior as words longer than 15 letters cause an arithmetic overflow, which may produce a negative value and cause the modulo to be negative as well, indexing outside the bounds of global_hash.
You free the newly allocated node with free(new_node);. It has been stored into the global_hash array: later dereferencing it for another word with the same hash value will cause undefined behavior. You probably meant to free the parsed word instead with free(string);.
Here are the other issues:
you should check the length of the string before copying it to the node structure array with strcpy(new_node->word, string);
fscanf(dic, "%ms", &string) is not portable. the m modifier causes fscanf to allocate memory for the word, but it is an extension supported by the glibc that may not be available in other environments. You might want to write a simple function for better portability.
the main loop should test for successful conversion with while(fscanf(dic, "%ms", &string) == 1) instead of just end of file with EOF. It may not cause a problem in this specific case, but it is a common cause of undefined behavior for other conversion specifiers.
the definition #define HASH_SIZE 65536; has a extra ; which may cause unexpected behavior if HASH_SIZE is used in expressions.
the definition #define LENGTH = 45; is incorrect: the code does not compile as posted.
Here is a modified version:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#define HASH_SIZE 65536
#define LENGTH 45
typedef struct node {
char word[LENGTH + 1];
struct node *next;
} node;
int word_size = 0;
node *global_hash[HASH_SIZE];
unsigned hash_func(const char *hash_val) {
unsigned h = 0;
for (size_t i = 0, j = strlen(hash_val); i < j; i++) {
h = ((h << 2) | (h >> 30)) ^ (unsigned char)hash_val[i];
}
return h % HASH_SIZE;
}
/* read a word from fp, skipping initial whitespace.
return the length of the word read or EOF at end of file
store the word into the destination array, truncating it as needed
*/
int get_word(char *buf, size_t size, FILE *fp) {
int c;
size_t i;
while (isspace(c = getc(fp)))
continue;
if (c == EOF)
return EOF;
for (i = 0;; i++) {
if (i < size)
buf[i] = c;
c = getc(fp);
if (c == EOF)
break;
if (isspace(c)) {
ungetc(c, fp);
break;
}
}
if (i < size)
buf[i] = '\0';
else if (size > 0)
buf[size - 1] = '\0';
return i;
}
bool load(const char *dictionary) {
char buf[LENGTH + 1];
FILE *dic = fopen(dictionary, "r");
if (dic == NULL) {
fprintf(stderr, "Error: cannot open dictionary file %s\n", dictionary);
return false;
}
while (get_word(buf, sizeof buf, dic) != EOF) {
node *new_node = malloc(sizeof(node));
if (new_node == NULL) {
fprintf(stderr, "Error: out of memory\n");
fclose(dic);
return false;
}
unsigned hash_indx = hash_func(buf);
strcpy(new_node->word, buf);
new_node->next = global_hash[hash_indx];
global_hash[hash_indx] = new_node;
word_size++;
}
fclose(dic);
return true;
}

the following proposed code:
cleanly compiles
still has a major problem with the function: hash_func()
separates the definition of the struct from the typedef for that struct for clarity and flexibility.
properly formats the #define statements
properly handles errors from fopen() and malloc()
properly limits the length of the string read from the 'dictionary' file
assumes that no text from the 'dictionary' file will be greater than 45 bytes.
and now, the proposed code:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
//prototypes
bool load(const char *dictionary);
int hash_func(char* hash_val);
#define HASH_SIZE 65536
#define LENGTH 45
struct node
{
char word[LENGTH + 1];
struct node* next;
};
typedef struct node node;
node* global_hash[HASH_SIZE] = {NULL};
int word_size = 0;
int hash_func(char* hash_val)
{
int h = 0;
for ( size_t i = 0, j = strlen(hash_val); i < j; i++)
{
h = (h << 2) ^ hash_val[i];
}
return h % HASH_SIZE;
}
bool load(const char *dictionary)
{
char string[ LENGTH+1 ];
FILE* dic = fopen(dictionary, "r");
if(dic == NULL)
{
perror( "fopen failed" );
//fprintf(stdout, "Error: File is NULL.");
return false;
}
while( fscanf( dic, "%45s", string) == 1 )
{
node* new_node = malloc(sizeof(node));
if(new_node == NULL)
{
perror( "malloc failed" );
return false;
}
strcpy(new_node->word, string);
new_node->next = NULL;
int hash_indx = hash_func(new_node->word);
// following statement for debug:
printf( "index returned from hash_func(): %d\n", hash_indx );
if( !global_hash[hash_indx] )
{
global_hash[hash_indx] = new_node;
}
else
{
new_node->next = global_hash[hash_indx];
global_hash[hash_indx] = new_node;
}
word_size++;
}
fclose(dic);
return true;
}

Pointer is changing after function call, C

So I've written this program to represent a car park as a bitset, each space in the car park being one bit. I have a checkSpace function to check if a space is occupied or not and for some reason the pointer to my car park bitset changes or the data changes after I pass it into the function. To test it I set up the car park, I checked a space, then checked it again immediately after and for some reason the return value is changing when it shouldn't be. Any help would be appreciated!
struct carPark{
int spaces, levels;
unsigned char * park;
};
struct carPark * emptyCarPark(int levels, int spaces){
int chars = (spaces*levels)/8;
if((spaces*levels)%8 != 0){
chars++;
}
unsigned char park[chars];
for (int i = 0; i < chars; ++i){
park[i] = 0;
}
unsigned char * ptr = &park[0];
struct carPark * myPark = malloc(sizeof(struct carPark));
myPark->park = ptr;
myPark->spaces = spaces;
myPark->levels = levels;
return myPark;
}
int checkSpace(int level, int spaceNum, struct carPark * carpark){
int charPosition = ((level*carpark->spaces) + spaceNum)/8;
int bitPosition = ((level*carpark->spaces) + spaceNum)%8;
if(carpark->park[charPosition]&&(1<<bitPosition) != 0){
return 1;
}
return 0;
}
int main(int argc, char const *argv[]){
struct carPark * myPark = emptyCarPark(5,20);
printf("1st check: %d\n",checkSpace(1,1,myPark));
printf("Second check: %d\n",checkSpace(1,1,myPark));
return 0;
}
So when I run the program I get:
1st check: 0
Second check: 1

Look at the code below - in emptyCarPark() you are allocating the park array on the stack, and then returning a pointer to it. As soon as the function returns, the park array is no longer allocated and you have a dangling pointer - for more information, see: Cause of dangling pointers (Wikipedia)
unsigned char park[chars];
for (int i = 0; i < chars; ++i){
park[i] = 0;
}
// This is the pointer to an object on the stack.
unsigned char * ptr = &park[0];
struct carPark * myPark = malloc(sizeof(struct carPark));
myPark->park = ptr;

Struct pointer array segmentation fault

I'm building an autocomplete program that takes a few characters of input and gives back suggested words to complete the characters. I have an AutoComplete_AddWord function that adds words for suggestion. However, whenever I try to access my structs completions array(holds up to 10 suggested words for given host table's letters) a segmentation fault is thrown. Not sure where I'm going wrong. Thanks for any help.
struct table {
struct table *nextLevel[26];
char *completions[10]; /* 10 word completions */
int lastIndex;
};
static struct table Root = { {NULL}, {NULL}, 0 }; //global representing the root table containing all subsequent tables
void AutoComplete_AddWord(const char *word){
int i; //iterator
char *w = (char*)malloc(100*(sizeof(char));
for(i = 0; w[i]; i++){ // make lowercase version of word
w[i] = tolower(word[i]);
}
char a = 'a';
if(w[0] < 97 || w[0] > 122)
w++;
int index = w[0] - a; // assume word is all lower case
if(Root.nextLevel[index] == NULL){
Root.nextLevel[index] = (struct table*) malloc(sizeof(struct table));
TotalMemory += sizeof(table);
*Root.nextLevel[index] = (struct table){{NULL},{NULL},0};
}
else
// otherwise, table is already allocated
struct table *pointer = Root.nextLevel[index];
pointer->completions[0] = strdup(word); //Here is where seg fault keeps happening
}

OK, so there are a lot of errors with this, and you obviously didn't test it and compile it. But I was curious so I took a closer look, and the problem stems from here:
for(i = 0; w[i]; i++){ // make lowercase version of word
w[i] = tolower(word[i]);
}
You are diving right into a loop checking w[0], a fresh, uninitialized block of memory.
Changing it to this:
for(i = 0; word[i]; i++){ // make lowercase version of word
w[i] = tolower(word[i]);
}
Will solve that problem. Fixing the other miscellaneous problems mentioned above, a non-segfaulting version of the code looks like this:
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
struct table {
struct table *nextLevel[26];
char *completions[10]; /* 10 word completions */
int lastIndex;
};
int TotalMemory = 0;
static struct table Root = { {NULL}, {NULL}, 0 }; //global representing the root table containing all subsequent tables
void AutoComplete_AddWord(const char *word){
int i; //iterator
char *w = (char*)malloc(100*(sizeof(char)));
for(i = 0; word[i]; i++){ // make lowercase version of word
w[i] = tolower(word[i]);
}
char a = 'a';
if(w[0] < 97 || w[0] > 122) w++;
int index = w[0] - a; // assume word is all lower case
if(Root.nextLevel[index] == NULL){
Root.nextLevel[index] = (struct table*) malloc(sizeof(struct table));
TotalMemory += sizeof(struct table);
*Root.nextLevel[index] = (struct table){{NULL},{NULL},0};
}
struct table *pointer = Root.nextLevel[index];
pointer->completions[0] = strdup(word); //Here is where seg fault keeps happening
}
int main(int argc, char **argv)
{
AutoComplete_AddWord("testing");
return 0;
}
I can't speak for what happens next with this program, but at least this gets you past the segfault.

Code for parsing a character buffer

I want to parse a character buffer and store it in a data structure.
The 1st 4 bytes of the buffer specifies the name, the 2nd four bytes specifies the length (n) of the value and the next n bytes specifies the value.
eg: char *buff = "aaaa0006francebbbb0005swisscccc0013unitedkingdom"
I want to extract the name and the value from the buffer and store it a data structure.
eg: char *name = "aaaa"
char *value = "france"
char *name = "bbbb"
char *value = "swiss"
After storing, I should be able to access the value from the data structure by using the name.
What data structure should I use?
EDIT (from comment):
I tried the following:
struct sample {
char string[4];
int length[4];
char *value; };
struct sample s[100];
while ( *buf ) {
memcpy(s[i].string, buf, 4);
memcpy(s[i].length, buf+4, 4);
memcpy(s[i].value, buf+8, s.length);
buf += (8+s.length);
}
Should I call memcpy thrice? Is there a way to do it by calling memcpy only once?

How about not using memcpy at all?
typedef struct sample {
char name[4];
union
{
char length_data[4];
unsigned int length;
};
char value[];
} sample_t;
const char * sample_data = "aaaa\6\0\0\0francebbbb\5\0\0\0swisscccc\15\0\0\0unitedkingdom";
void main()
{
sample_t * s[10];
const char * current = sample_data;
int i = 0;
while (*current)
{
s[i] = (sample_t *) current;
current += (s[i])->length + 8;
i++;
}
// Here, s[0], s[1] and s[2] should be set properly
return;
}
Now, you never specify clearly whether the 4 bytes representing the length contain the string representation or the actual binary data; if it's four characters that needs to run through atoi() or similar then you need to do some post-processing like
s[i]->length = atoi(s[i]->length_data)
before the struct is usable, which in turn means that the source data must be writeable and probably copied locally. But even then you should be able to copy the whole input buffer at once instead of chopping it up.
Also, please note that this relies on anything using this struct honors the length field rather than treating the value field as a null-terminated string.
Finally, using binary integer data like this is obviously architecture-dependent with all the implications that follows.

To expand on your newly provided info, this will work better:
struct sample {
char string[4];
int length;
char *value; };
struct sample s[100];
while ( *buf && i < 100) {
memcpy(s[i].string, buf, 4);
s[i].length = atoi(buf+4);
s[i].value = malloc(s[i].length);
if (s[i].value)
{
memcpy(s[i].value, buf+8, s[i].length);
}
buf += (8+s[i].length);
i++;
}

I would do something like that:
I will define a variable length structure, like this:
typedef struct {
char string[4];
int length[4];
char value[0] } sample;
now , while parsing, read the string and length into temporary variables.
then, allocate enough memory for the structure.
uint32_t string = * ( ( uint32_t * ) buffer );
uint32_t length = * ( ( uint32_t * ) buffer + 4);
sample * = malloc(sizeof(sample) + length);
// Check here for malloc errors...
* ( (uint32_t *) sample->string) = string;
* ( (uint32_t *) sample->length) = length;
memcpy(sample->value, ( buffer + 8 ), length);
This approach, keeps the entire context of the buffer in one continuous memory structure.
I use it all the time.

C - split/store string of X length into an array of structs

I'm trying to split a string every X amount of characters, and then store each line in an array of structs. However, I'm wondering what would be a short and efficient way of doing it. I thought that maybe I could use sscanf, but not very sure how to. Any help will be appreciated. So far I have:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct st {char *str;};
int main ()
{
struct st **mystruct;
char tmp[] = "For configuration options (arch/xxx/config.in, and all the Config.in files),somewhat different indentation is used.";
size_t max = 20, j = 0; // max length of string
size_t alloc = strlen(tmp)/max + 1;
mystruct = malloc(alloc * sizeof *mystruct);
for (j = 0; j < alloc; j++)
mystruct[j] = malloc(sizeof *mystruct[j]);
const char *ptr = tmp;
char field [ max ];
int n;
while (*ptr != '\0') {
int line = sscanf(ptr, "%s", field, &n); // not sure how to use max in here
mystruct[j]->str = field;
field[0]='\0';
if (line == 1)
ptr += n;
if ( n != max )
break;
++ptr;
++j;
}
return 0;
}
So when I iterate over my struct, I can get something like:
For configuration op
tions (arch/xxx/conf
ig.in, and all the C
onfig.in files),some
what different inden
tation is used.

You could use strncpy.
FYI:
char field [ max ];
while (...) {
mystruct[j]->str = field;
Two problems with this: (1) every struct in your array is going to end up pointing at the same string, which will have the value of the last thing you scanned, (2) they are pointing to a variable on the stack, so when this function returns they will be trashed. That doesn't manifest itself visibly here (e.g. your program doesn't explode) because the function happens to be 'main', but if you moved this to a separate routine and called it to parse a string, you'd get back garbage.
mystruct doesn't need to be pointer to pointer. For a 1D array, just allocate a block N * sizeof *myarray for N elements.
A common C idiom when dealing with structs is to use typedef so you don't have to type struct foo all the time. For instance:
typedef struct {
int x, y;
} point;
Now instead of typing struct point pt you can just say point pt.

If your string is not going to change after you split it up, I'd recommend using a struct like this:
struct st {
char *begin;
char *end;
};
or the alternative:
struct st {
char *s;
size_t len;
};
Then instead of creating all those new strings, just mark where each one begins and ends in your struct. Keep the original string in memory.

One option is to do it character-by-character.
Calculate the number of lines as you are currently doing.
Allocate memory = (strlen(tmp) + number_of_lines) * sizeof(char)
Walk through your input string, copying characters from the input to the newly allocated memory. Every 20th character, insert a null byte to delimit that string. Save a pointer to the beginning of each line in your array of structs.

Its easy enough?
#define SMAX 20
typedef struct {char str[SMAX+1];} ST;
int main()
{
ST st[SMAX]={0};
char *tmp = "For configuration options (arch/xxx/config.in, and all the Config.in files),somewhat different indentation is used.";
int i=0,j;
for( ; (st[i++]=*(ST*)tmp).str[SMAX]=0 , strlen(tmp)>=SMAX; tmp+=SMAX );
for( j=0;j<i;++j )
puts(st[j].str);
return 0;
}

You may use (non C standard but GNU) function strndup().
#define _GNU_SOURCE
#include <string.h>
struct st {char *str;};
int main ()
{
struct st *mystruct; /* i wonder if there's need for double indirection... */
char tmp[] = "For configuration options (arch/xxx/config.in, and all the Config.in files),somewhat different indentation is used.";
size_t max = 20, j = 0; // max length of string
size_t alloc = (strlen(tmp) + max - 1)/max; /* correct round up */
mystruct = malloc(alloc * sizeof mystruct);
if(!mystruct) return 1; /* never forget testing if allocation failed! */
for(j = 0; j<alloc; j++)
{
mystruct[j].str = strndup(tmp+alloc*max, max);
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to decompress from Huffman's compression in C - c

Related

Segmentation fault when using fscanf in c

Pointer is changing after function call, C

Struct pointer array segmentation fault

Code for parsing a character buffer

C - split/store string of X length into an array of structs

Categories

Resources