Trouble with compression in LZW - c

I'm having trouble while implementing the compressor of the LZW. The compressor seems to work fine but while processing some streams it doesn't put the end of stream character (defined with the value 256), the result is that the decompressor will loop infinitely.
The code of the compressor is the following:
int compress1(FILE* input, BIT_FILE* output) {
CODE next_code; // next node
CODE current_code; // current node
CODE index; // node of the found character
int character;
int ret;
next_code = FIRST_CODE;
dictionary_init();
if ((current_code = getc(input)) == EOF)
current_code = EOS;
while ((character = getc(input)) != EOF) {
index = dictionary_lookup(current_code, (SYMBOL)character);
if (dictionary[index].code != UNUSED) {
current_code = dictionary[index].code;
}
else {
if (next_code <= MAX_CODE-1) {
dictionary[index].code = next_code++;
dictionary[index].parent = current_code;
dictionary[index].symbol = (SYMBOL)character;
}
else {
// handling full dictionary
dictionary_init();
next_code = FIRST_CODE;
}
ret = bit_write(output, (uint64_t) current_code, BITS);
if( ret != 0)
return -1;
current_code = (CODE)character;
}
}
ret = bit_write(output, (uint64_t) current_code, BITS);
if (ret != 0)
return -1;
ret = bit_write(output, (uint64_t) EOS, BITS);
if (ret != 0)
return -1;
if (bit_close(output) == -1) {
printf("Ops: error during closing\n");
return -1;
}
return 0;
}
CODE and SYMBOL are typedef of, respectively, uint32_t and uint16_t, FIRST_CODE is defined as 257. The funtion dictionary_init() simply initializes the dictionary, dictionary_lookup() returns the index of a child having symbol "character" of the parent node "current_node" (if it exists).
The writing of the binary file is defined as:
int bit_write(BIT_FILE* bf, uint64_t data, int len)
{
int space, result, offset, wbits, udata;
uint64_t* p;
uint64_t tmp;
udata = (int)data;
if (bf == NULL || len < 1 || len > (8* sizeof(data)))
return -1;
if (bf->reading == true)
return -1;
while (len > 0) {
space = bf->end - bf->next;
if (space < 0) {
return -1;
}
// if buffer is full, flush data to file and reinit BIT_IO struct
if (space == 0) {
result = bit_flush(bf);
if (result < 0)
return -1;
}
p = bf->buf + (bf->next/64);
offset = bf->next % 64;
wbits = 64 - offset;
if (len < wbits)
wbits = len;
tmp = le64toh(*p);
tmp |= (data << offset);
*p = htole64(tmp);
bf->next += wbits;
len -= wbits;
data >>= wbits;
}
return 0;
}
I already opened the file using another function, so the bit_write take as input the pointer to the bf structure.
Can someone help me finding the error?
An example of when this problem arises is the following:
If the input string is "Nel mezzo del cammi" everything works fine and I have the following compressed file (in Hexadecimal, using 12 Bits for encoding symbols):
4E 50 06 6C 00 02 6D 50 06 7A A0 07 6F 00 02 64
20 10 20 30 06 61 D0 06 6D 90 06 0D A0 00 00 01
If I add another character to the string, in particular "Nel mezzo del cammin", I have the following result:
4E 50 06 6C 00 02 6D 50 06 7A A0 07 6F 00 02 64
20 10 20 30 06 61 D0 06 6D 90 06 6E D0 00 0A 00
10
In the second case it doesn't write the End of Stream correctly.
SOLUTION: check that there is enough space in the buffer for the whole coded symbol I am going to write. Just change:
if (space == 0)
to:
if(space == 0 && space < len)

Related

How can I read and obtain separated data from a file using 'fread' in C?

I've written in a file (using 'fwrite()') the following:
TUS�ABQ���������������(A����������(A��B������(A��B���A��(A��B���A������B���A������0����A������0�ABQ�������0�ABQ�����LAS����������������A�����������A��&B�������A��&B��B���A��&B��B������&B��
B����153���B����153�LAS�����153�LAS�����LAX���������������:A����������:AUUB������:AUUB��B��:
AUUB��B����UUB��B����������B��������LAX���������LAX�����MDW���������������A����������A��(�������A��(����A��A��(����A������(����A����A�89���A����A�89MDW�����A�89MDW�����OAK���������
����������������������#�����������#�����������#�����������#�������������������������OAK���������OAK�����SAN���������������LA����������LA��P#������LA��P#��#A��LA��P#��#A������P#��#A����������#A��������SAN���������SAN�����TPA�ABQ����������������B�����������B��#�����...(continues)
which is translated to this:
TUSLWD2.103.47.775.1904.06.40.03AMBRFD4.63.228.935.0043.09.113.0ASDGHU5.226.47.78.3.26...(The same structure)
and the hexdump of that would be:
00000000 54 55 53 00 41 42 51 00 00 00 00 00 00 00 00 00 |TUS.ABQ.........|
00000010 00 00 00 00 00 00 28 41 00 00 0e 42 00 00 f8 41 |......(A...B...A|
00000020 00 00 00 00 4c 41 53 00 00 00 00 00 00 00 00 00 |....LAS.........|
00000030 00 00 00 00 00 00 88 41 00 00 26 42 9a 99 11 42 |.......A..&B...B|
(Continues...)
the structure is, always 2 words of 3 characters each one (i.e. TUS and LWD) followed by 7 floats, and then it repeats again on a on until end of file.
The key thing is: I just want to read every field separated like 'TUS', 'LWD', '2.10', '3.4', '7.77'...
And I can only use 'fread()' to achieve that! For now, I'm trying this:
aux2 = 0;
fseek(fp, SEEK_SET, 0);
fileSize = 0;
while (!feof(fp) && aux<=2) {
fread(buffer, sizeof(char)*4, 1, fp);
printf("%s", buffer);
fread(buffer, sizeof(char)*4, 1, fp);
printf("%s", buffer);
for(i=0; i<7; i++){
fread(&delay, sizeof(float), 1, fp);
printf("%f", delay);
}
printf("\n");
aux++;
fseek(fp,sizeof(char)*7+sizeof(float)*7,SEEK_SET);
aux2+=36;
}
And I get this result:
TUSABQ0.0000000.0000000.00000010.5000000.0000000.00000010.500000
AB0.0000000.000000-10384675421112248092159136000638976.0000000.0000000.000000-10384675421112248092159136000638976.0000000.000000
AB0.0000000.000000-10384675421112248092159136000638976.0000000.0000000.000000-10384675421112248092159136000638976.0000000.000000
But it does not works correctly...
*Note: forget the arguments of the last 'fseek()', cos I've been trying too many meaningless things!
To write the words (i.e. TUS) into the file, I use this:
fwrite(x->data->key, 4, sizeof(char), fp);
and to write the floats, this:
for (i = 0; i < 7; i++) {
fwrite(&current->data->retrasos[i], sizeof(float), sizeof(float), fp);
}
I'd recommend using a structure to hold each data unit:
typedef struct {
float value[7];
char word1[5]; /* 4 + '\0' */
char word2[5]; /* 4 + '\0' */
} unit;
To make the file format portable, you need a function that packs and unpacks the above structure to/from a 36-byte array. On Intel and AMD architectures, float corresponds to IEEE-754-2008 binary32 format in little-endian byte order. For example,
#define STORAGE_UNIT (4+4+7*4)
#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
size_t unit_pack(char *target, const size_t target_len, const unit *source)
{
size_t i;
if (!target || target_len < STORAGE_UNIT || !source) {
errno = EINVAL;
return 0;
}
memcpy(target + 0, source->word1, 4);
memcpy(target + 4, source->word2, 4);
for (i = 0; i < 7; i++)
memcpy(target + 8 + 4*i, &(source->value[i]), 4);
return STORAGE_UNIT;
}
size_t unit_unpack(unit *target, const char *source, const size_t source_len)
{
size_t i;
if (!target || !source || source_len < STORAGE_UNIT) {
errno = EINVAL;
return 0;
}
memcpy(target->word1, source, 4);
target->word1[4] = '\0';
memcpy(target->word2, source + 4, 4);
target->word2[4] = '\0';
for (i = 0; i < 7; i++)
memcpy(&(target->value[i]), source + 8 + i*4, 4);
return STORAGE_UNIT;
}
#else
#error Unsupported architecture!
#endif
The above only works on Intel and AMD machines, but it is certainly easy to extend to other architectures if necessary. (Almost all machines currently use IEEE 754-2008 binary32 for float, only the byte order varies. Those that do not, typically have C extensions that do the conversion to/from their internal formats.)
Using the above, you can -- should! must! -- document your file format, for example as follows:
Words are 4 bytes encoded in UTF-8
Floats are IEEE 754-2008 binary32 values in little-endian byte order
A file contains one or more units. Each unit comprises of
Name Description
word1 First word
word2 Second word
value0 First float
value1 Second float
value2 Third float
value3 Fourth float
value4 Fifth float
value5 Sixth float
value6 Second float
There is no padding.
To write an unit, use a char array of size STORAGE_UNIT as a cache, and write that. So, if you have unit *one, you can write it to FILE *out using
char buffer[STORAGE_UNIT];
if (unit_pack(buffer, sizeof buffer, one)) {
/* Error! Abort program! */
}
if (fwrite(buffer, STORAGE_UNIT, 1, out) != 1) {
/* Write error! Abort program! */
}
Correspondingly, reading from FILE *in would be
char buffer[STORAGE_UNIT];
if (fread(buffer, STORAGE_UNIT, 1, in) != 1) {
/* End of file, or read error.
Check feof(in) or/and ferror(in). */
}
if (unit_unpack(one, buffer, STORAGE_UNIT)) {
/* Error! Abort program! */
}
If one is an array of units, and you are writing or reading one[k], use &(one[k]) (or equivalently one + k) instead of one.

How to parse hex dump

I have a flash memory dump file that spits out addresses and data.
I want to parse the data so that it will tell me the valid tags
The '002F0900' column are the starting addresses.
An example of a valid tag is "DC 08 00 06 00 00 07 26 01 25 05 09" where "DC 08" = tag number, "00 06" = tag data length, "00 00" = tag version. Tag data starts after the version and in this case would be "07 26 01 25 05 09" and the next tag would start "DC 33".
I'm able to print out the first tag up to the data length but I'm not sure how to print the data because I have to consider if the data will go onto the next line so I'd have to skip the address somehow. Each line contains 58 columns. Each address is 8 characters long plus a colon and 2 spaces until the next hex value starts.
I also will eventually have to consider when "DC" shows up in the address column.
If anyone could give some advice because I know how I'm doing this isn't the best way to do this. I'm just trying to get it to work first.
The text file is thousands of lines that look like this:
002F0900: 09 FF DC 08 00 06 00 00 07 26 01 25 05 09 DC 33
002F0910: 00 07 00 00 1F A0 26 01 25 05 09 FF 9C 3E 00 08
002F0920: 00 01 07 DD 0A 0D 00 29 35 AD 9C 41 00 0A 00 01
002F0930: 07 DD 0A 0D 00 29 36 1C 1D 01 9C 40 00 02 00 01
002F0940: 01 00 9C 42 00 0A 00 01 07 DD 0A 0D 00 29 36 21
002F0950: 1D AD 9C 15 00 20 00 00 01 00 00 00 00 04 AD AE
002F0960: C8 0B C0 8A 5B 52 01 00 00 00 00 00 FF 84 36 BA
002F0970: 4E 92 E4 16 28 86 75 C0 DC 10 00 05 00 00 00 00
002F0980: 00 00 01 FF DC 30 00 04 00 01 00 00 00 01 9C 41
Example output would be:
Tag Number: DC 08
Address: 002E0000
Data Length: 06
Tag Data: 07 26 01 25 05 09
Source Code:
#include<stdio.h>
FILE *fp;
main()
{
int i=0;
char ch;
char address[1024];
char tag_number[5];
char tag_length[4];
int number_of_addresses = 0;
long int length;
fp = fopen(FILE_NAME,"rb");
if(fp == NULL) {
printf("error opening file");
}
else {
printf("File opened\n");
while(1){
if((address[i]=fgetc(fp)) ==':')
break;
number_of_addresses++;
i++;
}
printf("\nAddress:");
for (i = 0; i < number_of_addresses;i++)
printf("%c",address[i]);
while((ch = fgetc(fp)) != 'D'){ //Search for valid tag
}
tag_number[0] = ch;
if((ch = fgetc(fp)) == 'C') //We have a valid TAG
{
tag_number[1] = ch;
tag_number[2] = fgetc(fp);
tag_number[3] = fgetc(fp);
tag_number[4] = fgetc(fp);
}
printf("\nNumber:");
for(i=0;i<5;i++)
printf("%c",tag_number[i]);
fgetc(fp); //For space
tag_length[0] = fgetc(fp);
tag_length[1] = fgetc(fp);
fgetc(fp); //For space
tag_length[2] = fgetc(fp);
tag_length[3] = fgetc(fp);
printf("\nLength:");
for(i=0;i<4;i++)
printf("%c",tag_length[i]);
length = strtol(tag_length,&tag_length[4], 16);
printf("\nThe decimal equilvant is: %ld",length);
for (i = 0;i<165;i++)
printf("\n%d:%c",i,fgetc(fp));
}
fclose(fp);
}
Update #ooga:The tags are written arbitrarily. If we also consider invalid tag in the logic then I should be able to figure out the rest if I spend some time. Thanks
This is just an idea to get you started since I'm not entirely sure what you need. The basic idea is that read_byte returns the next two-digit hex value as a byte and also returns its address.
#include <stdio.h>
#include <stdlib.h>
#define FILE_NAME "UA201_dump.txt"
void err(char *msg) {
fprintf(stderr, "Error: %s\n", msg);
exit(EXIT_FAILURE);
}
// read_byte
// Reads a single two-digit "byte" from the hex dump, also
// reads the address (if necessary).
// Returns the byte and current address through pointers.
// Returns 1 if it was able to read a byte, 0 otherwise.
int read_byte(FILE *fp, unsigned *byte, unsigned *addr_ret) {
// Save current column and address between calls.
static int column = 0;
static unsigned addr;
// If it's the beginning of a line...
if (column == 0)
// ... read the address.
if (fscanf(fp, "%x:", &addr) != 1)
// Return 0 if no address could be read.
return 0;
// Read the next two-digit hex value into *byte.
if (fscanf(fp, "%x", byte) != 1)
// Return 0 if no byte could be read.
return 0;
// Set return address to current address.
*addr_ret = addr;
// Increment current address for next time.
++addr;
// Increment column, wrapping back to 0 when it reaches 16.
column = (column + 1) % 16;
// Return 1 on success.
return 1;
}
int main() {
unsigned byte, addr, afterdc, length, version, i;
FILE *fp = fopen(FILE_NAME,"r");
if (!fp) {
fprintf(stderr, "Can't open %s\n", FILE_NAME);
exit(EXIT_FAILURE);
}
while (read_byte(fp, &byte, &addr)) {
if (byte == 0xDC) {
// Read additional bytes like this:
if (!read_byte(fp, &afterdc, &addr)) err("EOF 1");
if (!read_byte(fp, &length, &addr)) err("EOF 2");
if (!read_byte(fp, &byte, &addr)) err("EOF 3");
length = (length << 8) | byte;
if (!read_byte(fp, &version, &addr)) err("EOF 4");
if (!read_byte(fp, &byte, &addr)) err("EOF 5");
version = (version << 8) | byte;
printf("DC: %02X, %u, %u\n ", afterdc, length, version);
for (i = 0; i < length; ++i) {
if (!read_byte(fp, &byte, &addr)) err("EOF 6");
printf("%02X ", byte);
}
putchar('\n');
}
}
fclose(fp);
return 0;
}
Some explanation:
Every time read_byte is called, it reads the next printed byte (the two-digit hex values) from the hex dump. It returns that byte and also the address of that byte.
There are 16 two-digit hex values on each line. The column number (0 to 15) is retained in a static variable between calls. The column is incremented after reading each byte and reset to 0 every time the column reaches 16.
Any time the column number is 0, it reads the printed address, retaining it between calls in a static variable. It also increments the static addr variable so it can tell you the address of a byte anywhere in the line (when the column number is not zero).
As an example, you could use read_bye like this, which prints each byte value and it's address on a separate line:
// after opening file as fp
while (read_byte(fp, &byte, &addr))
printf("%08X- %02X\n", addr, byte);
(Not that it would be useful to do that, but to test it you could run it with the snippet you provided in your question.)

Not able to pack hex bytes into unsigned char array or pointer properly

I tried to imitate a struct with a unsigned char array or pointer, but I am not able to obtain the same hex values.
.input is correct with print().
I am trying to get the same effect from stringBytes_Data or data_hexStrFormatted with print().
Can anyone advise?
Given
struct _vector {
char *input;
unsigned char len;
};
static struct _vector tv2 = {
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
And function to view the data:
static void print(char *intro_message, unsigned char *text_addr,
unsigned int size) {
unsigned int i;
for (i = 0; i < size; i++) {
printf("%2x ", text_addr[i]);
if ((i & 0xf) == 0xf)
printf("\n");
}
printf("\n");
}
How may I get the same effect with:
char* stringBytes_Data = "6bc1bee22e409f96e93d7e117393172aae2d8a571e03ac9c9eb76fac45af8e5130c81c46a35ce411e5fbc1191a0a52eff69f2445df4f9b17ad2b417be66c3710";
I tried, but the result is wrong :
unsigned char* data_hexStrFormatted;
int lengthOfStr = strlen(stringBytes_Data);
int charCounterForNewStr = 0;
int formattedLength = (2*lengthOfStr)+1;
data_hexStrFormatted = (unsigned char*) malloc((formattedLength)*sizeof(unsigned char)); // x2 as we add \x to XX, and 1 for NULL end char
for(i=0; i<lengthOfStr; i=i+2) {
// prepend \x
data_hexStrFormatted[charCounterForNewStr++] = '\\';
data_hexStrFormatted[charCounterForNewStr++] = 'x';
data_hexStrFormatted[charCounterForNewStr++] = stringBytes_Data[i];
data_hexStrFormatted[charCounterForNewStr++] = stringBytes_Data[i+1];
}
data_hexStrFormatted[formattedLength-1] = '\0';
printf("%s\n", data_hexStrFormatted);
printf("%d byte length \n", strlen(data_hexStrFormatted)/4);
print("data_hexStrFormatted",
(unsigned char *)
data_hexStrFormatted,
(formattedLength)/4);
You seem to be asking:
Given a string containing pairs of hex digits, convert the hex digits to byte values?
If so, then code similar to the following can be used:
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
This function works correctly for valid hex digits; it will produce nonsense for invalid inputs. If you decide you need to detect erroneous input, you'll need to improve it. There are other ways to write this function (lots of them, in fact). One that can be effective is an array of 256 bytes statically initialized with the correct values, so you simply write return hex_array[c];.
char* stringBytes_Data = "6bc1bee22e409f96e93d7e117393172aae2d8a571e03ac9c9eb76fac45af8e5130c81c46a35ce411e5fbc1191a0a52eff69f2445df4f9b17ad2b417be66c3710";
size_t len = strlen(stringBytes_Data);
char buffer[len / 2];
assert(len % 2 == 0);
for (size_t i = 0; i < len; i += 2)
buffer[i / 2] = hexit(stringBytes_Data[i]) << 4 | hexit(stringBytes_Data[i+1]);
printf("%.*s\n", (int)len/2, buffer);
This code sets the array buffer to contain the converted code. It won't work correctly if there's an odd number of characters in the array (that's what the assertion states).
Working code - #2
Using the print() function from the question with the info_message argument removed since it is unused:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
struct Vector
{
char *input;
unsigned char len;
};
static struct Vector tv2 =
{
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
static void print(unsigned char *text_addr, unsigned int size)
{
unsigned int i;
for (i = 0; i < size; i++)
{
printf("%2x ", text_addr[i]);
if ((i & 0xf) == 0xf)
printf("\n");
}
printf("\n");
}
static void print2(const char *tag, const unsigned char *data, size_t size)
{
printf("%s:\n", tag);
for (size_t i = 0; i < size; i++)
{
printf("%2x ", data[i]);
if ((i & 0x0F) == 0x0F)
printf("\n");
}
printf("\n");
}
static void print_text(const char *tag, const char *data, size_t datalen)
{
char buffer[datalen / 2];
assert(datalen % 2 == 0);
for (size_t i = 0; i < datalen; i += 2)
buffer[i / 2] = hexit(data[i]) << 4 | hexit(data[i + 1]);
//printf("%s: [[%.*s]]\n", tag, (int)datalen / 2, buffer);
assert(memcmp(buffer, tv2.input, tv2.len) == 0);
print((unsigned char *)buffer, datalen / 2);
print2(tag, (unsigned char *)buffer, datalen / 2);
}
int main(void)
{
char *stringBytes_Data =
"6bc1bee22e409f96e93d7e117393172a"
"ae2d8a571e03ac9c9eb76fac45af8e51"
"30c81c46a35ce411e5fbc1191a0a52ef"
"f69f2445df4f9b17ad2b417be66c3710"
;
print_text("buffer", stringBytes_Data, strlen(stringBytes_Data));
return 0;
}
Sample output:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
Working code - #1
Redone — previous versions had various 'off by a factor of two' errors which were partially concealed by the system zeroing a buffer.
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
struct Vector
{
char *input;
unsigned char len;
};
static struct Vector tv2 =
{
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
static void print(const char *tag, const unsigned char *data, size_t size)
{
printf("%s:\n", tag);
for (size_t i = 0; i < size; i++)
{
printf("%2x ", data[i]);
if ((i & 0x0F) == 0x0F)
printf("\n");
}
printf("\n");
}
static void print_text(const char *tag, const char *data, size_t datalen)
{
char buffer[datalen / 2];
assert(datalen % 2 == 0);
for (size_t i = 0; i < datalen; i += 2)
buffer[i / 2] = hexit(data[i]) << 4 | hexit(data[i + 1]);
printf("%s: [[%.*s]]\n", tag, (int)datalen / 2, buffer);
assert(memcmp(buffer, tv2.input, tv2.len) == 0);
print(tag, (unsigned char *)buffer, datalen / 2);
}
int main(void)
{
char *stringBytes_Data =
"6bc1bee22e409f96e93d7e117393172a"
"ae2d8a571e03ac9c9eb76fac45af8e51"
"30c81c46a35ce411e5fbc1191a0a52ef"
"f69f2445df4f9b17ad2b417be66c3710"
;
print_text("buffer", stringBytes_Data, strlen(stringBytes_Data));
return 0;
}
Raw output on a UTF-8 terminal (it isn't valid UTF-8 data, hence the question marks):
buffer: [[k???.#???=~s?*?-?W????o?E??Q0?F?\????
R???$E?O??+A{?l7]]
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
Raw output converted into UTF-8 as if it was ISO 8859-15 (or 8859-1):
buffer: [[kÁŸâ.#é=~s*®-W¬·o¬E¯Q0ÈF£\äåûÁ
Rïö$EßO­+A{æl7]]
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
The data doesn't seem to have any particular meaning, but beauty is in the eye of the beholder.

(C) how to fix this algorithm for z827 ASCII compression?

noob warning.
I'm trying to create a compression program. It takes a .txt with ASCII characters as an argument, and cuts off the leading 0 of the binary representation of each character.
It does this by using the last 2 bytes of two different integers. A character with a leading zero is put into the 4th byte of the integer 'write', and the next character is put into the 3rd byte of the integer 'temp'. The 'temp' int is then shifted to the right once, and then OR'd with 'write', so that the leading zero slot has been filled with data we need. This repeats, with the shift counter increasing after every character. The first case is a bit odd. The algorithm isn't very complex if written out on paper.
I feel like I've tried everything. I've been over the algorithm so many times. I'm pretty sure the problem is when shift_counter gets to 8.. but it should work fine. It just doesn't. I can show you why here (the code is further down):
This is the hex dump of my output:
0000000 3f 00 00 00 41 10 68 9e 6e c3 d9 65 10 88 5e c6
0000020 d3 41 e6 74 9a 5d 06 d1 df a0 7a 7d 5e 06 a5 dd
0000040 20 3a bd 3c a7 a7 dd 67 10 e8 5d a7 83 e8 e8 72
0000060 19 a4 c7 c9 6e a0 f1 f8 dd 86 cb cb f3 f9 3c
0000077
And the correct output:
0000000 3f 00 00 00 41 d0 3c dd 86 b3 cb 20 7a 19 4f 07
0000020 99 d3 ec 32 88 fe 06 d5 e7 65 50 da 0d a2 97 e7
0000040 f4 b4 fb 0c 7a d7 e9 20 3a ba 0c d2 e3 64 37 d0
0000060 f8 dd 86 cb cb f3 79 fa ed 76 29 00 0a 0a
0000076
code:
int compress(char *filename_ptr){
int in_fd;
in_fd = open(filename_ptr, O_RDONLY);
//set pointer to the end of the file, find file size, then reset position
//by closing/opening
unsigned int file_bytes = lseek(in_fd, 0, SEEK_END);
close(in_fd);
in_fd = open(filename_ptr, O_RDONLY);
//store file contents in buffer
unsigned char read_buffer[file_bytes];
read(in_fd, read_buffer, file_bytes);
//file where the output will be stored
int out_fd;
creat("output.txt", 0644);
out_fd = open("output.txt", O_WRONLY);
//sets file size in header (needed for decompression, this is the size of the
//file before compression. everything after this we write this 4-byte int
//is a 1 byte char
write(out_fd, &file_bytes, 4);
unsigned int writer;
unsigned int temp;
unsigned char out_char;
int i;
int shift_count = 8;
for(i = 0; i < file_bytes; i++){
if(shift_count == 8){
writer = read_buffer[i];
temp = temp & 0x00000000;
temp = read_buffer[i+1] << 8;
shift_count = 1;
}else{
//moves the next char's bits to the left, for the purpose of filling the
//8 bit buffer (writer) via OR operation
temp = read_buffer[i] << 8;
}
temp = temp >> shift_count;
writer = writer | temp;
//output right byte of writer
unsigned int right_byte = writer & 0x000000ff;
//output right_byte as a char
out_char = (char) right_byte;
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
//clear right side of writer
writer = writer & 0x0000ff00;
//shift left side of writer to the right by 8
writer = writer >> 8;
shift_count++;
}
return 0;
}
It seems to me that input and output are too strongly coupled.
At some point, the program should be reading (roughly) the 80th octet from the input and writing (roughly) the 70th octet to the output, because you want to (on average) write 7 bits out for every 8 bits you read in, right?
What the loop
for(i = 0; i < file_bytes; i++){
...
... = read_buffer[i];
...
write(out_fd, &out_char, 1);
...
}
actually seems to be doing is:
On the 70th pass through the loop -- when 70==i --
it's reading the 70th octet from the input and writing the 70th octet to the output.
On the 80th pass through the loop -- when 80==i --
it's reading the 80th octet from the input and writing the 80th octet to the output.
You must decide:
Do you want "i" to represent the number of input characters processed, or the number of output chars processed?
Because it's not possible to do both -- it's not possible to have 70 equal 80.
Perhaps something like this is closer to what you wanted:
/* test.c
http://stackoverflow.com/questions/15080239/c-how-to-fix-this-algorithm-for-z827-ascii-compression
WARNING: untested code.
*/
int compress(char *filename_ptr){
int in_fd;
in_fd = open(filename_ptr, O_RDONLY);
//set pointer to the end of the file, find file size, then reset position
//by closing/opening
unsigned int file_bytes = lseek(in_fd, 0, SEEK_END);
close(in_fd);
in_fd = open(filename_ptr, O_RDONLY);
//store file contents in buffer
unsigned char read_buffer[file_bytes];
read(in_fd, read_buffer, file_bytes);
//file where the output will be stored
int out_fd;
creat("output.txt", 0644);
out_fd = open("output.txt", O_WRONLY);
//sets file size in header (needed for decompression, this is the size of the
//file before compression. everything after this we write this 4-byte int
//is a 1 byte char
write(out_fd, &file_bytes, 4);
unsigned int writer;
unsigned int temp;
unsigned char out_char;
int i;
int writer_bits = 0; // 0 bits of data in writer so far
for(i = 0; i < file_bytes; i++){
// i is the number of (7 bit ASCII) characters
// read from the input so far.
// add 7 more bits to the writer
temp = read_buffer[i];
//moves the next char's bits to the left, for the purpose of filling the
//8 bit buffer (writer) via OR operation
//(avoid overwriting the "writer_bits" of good bits
//already in the buffer).
temp = read_buffer[i] << writer_bits;
writer = writer | temp;
writer_bits = writer_bits + 7;
//output right byte of writer
unsigned int right_byte = writer & 0x000000ff;
//output right_byte as a char
out_char = (unsigned char) right_byte;
// output 8 bits of data whenever
// we have *at least* 8 bits of data in the writer buffer.
if(writer_bits >= 8){
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
//shift left side of writer to the right by 8
writer = writer >> 8;
writer_bits = writer_bits - 8;
}else{
// 7 or fewer bits in writer --
// skip writing until next time.
}
}
// is there any leftover bits still in writer?
if(writer_bits > 0){
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
}
return 0;
}
(Currently the program reads the entire input file into RAM, then writes the entire output file. Some programmers prefer to read a little at a time, then write a little at a time. Both approaches have advantages and disadvantages).

Removing bytes in a dump or utf-8 in c

I have a "C" program in my firestation that captures incoming packets to the station printer. The program then scans the packet and sends and audible alert for what apparatus is due on the call. The county recently started using UTF-8 packets and the c program can not deal with all the extra "00" in the data flow. I need to either ignore the 00 or set the program to handle UTF-8. I have looked for days and there is nothing concrete on how to handle utf-8 that a novice such as my self can handle. Below is the interpret part of the program.
72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 later in packet
43 4f 44 45 53 45 54 3d 55 54 46 38 0a 40 50 4a beginning of packet
***void compressUtf16 (char *buff, size_t count) {
int i;
for (i = 0; i < count; i++)
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
}*
{ u_int i=0;
char *searcher = 0;
char c;
int j;
int locflag;
static int locationtripped = 0;
static char currentline[256];
static int currentlinepos = 0;
static char lastdispatched[256];
static char dispatchstring[256];
char betastring[256];
static int a = 0;
static int e = 0;
static int pe = 0;
static int md = 0;
static int pulse = 0;
static char location[128];
static char type[16];
static char station[16];
static FILE *fp;
static int printoutscanning = 0;
static char printoutID[20];
static char printoutfileID[32];
static FILE *dbg;
if(pulse) {
if(pulse == 80) {
sprintf(betastring, "beta a a a");
printf("betastring: \"%s\"\n", betastring);
system(betastring);
pulse = 0;
} else
pulse++;
}
if(header->len > 96) {
for(i=55; (i < header->caplen + 1 ) ; i++) {
c = pkt_data[i-1];
if(c == 13 || c == 10) {
currentline[currentlinepos] = 0;
currentlinepos = 0;
j = strlen(currentline);
if(j && (j > 1)) {
if(strlen(printoutfileID) && printoutscanning) {
dbg = fopen(printoutfileID, "a");
fprintf(dbg, "%s\n", currentline);
fclose(dbg);
}
if(!printoutscanning) {
searcher = 0;
searcher = strstr(currentline, "INCIDENT HISTORY DETAIL:");
if(searcher) {
searcher = searcher + 26;
strncpy(printoutID, searcher, 9);
printoutID[9] = 0;
printoutscanning = 1;
a = 0;
e = 0;
pe = 0;
md = 0;
for(j = 0; j < 128; j++)
location[j] = 0;
for(j = 0; j < 16; j++) {
type[j] = 0;
station[j] = 0;
}
sprintf(printoutfileID, "calls/%s %.6d.txt", printoutID, header-> ts.tv_usec);
dbg = fopen(printoutfileID, "a");
fprintf(dbg, "%s\n", currentline);
fclose(dbg);
}
UTF-8, except for the zero code point itself, will not have any zero bytes in it. The first byte of all multi-byte encodings (non-ASCII code points) will always start with the 11 bit pattern, with subsequent bytes always starting with the 10 bit pattern.
As you can see from the following table, U+0000 is the only code point that can give you a zero byte in UTF-8.
+----------------+----------+----------+----------+----------+
| Unicode | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
+----------------+----------+----------+----------+----------+
| U+0000-007F | 0xxxxxxx | | | |
| U+0080-07FF | 110yyyxx | 10xxxxxx | | |
| U+0800-FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx | |
| U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx |
+----------------+----------+----------+----------+----------+
UTF-16 will intersperse zero bytes between your otherwise ASCII bytes but it's then a simple matter of throwing away every second byte. Whether that's 0, 2, 4, ... or 1, 3, 5, ... depends on whether your UTF-16 encoding is big-endian or little-endian.
I see from your sample that your data stream does indicate UTF-8 (43 4f 44 45 53 45 54 3d 55 54 46 38 translates to the text CODESET=UTF8) but I'll guarantee you it's lying :-)
The segment 72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 is UTF-16 for reathing, presumably a word segment since I'm not familiar with that word (in English, anyway).
I would suggest you clarify with whoever is generating that data since it's clearly erroneous. As to how you process the UTF-16, I've covered that above. Provided it's ASCII data in there (the alternate bytes are always zero), you can just throw away those alternates with something like:
// Process a UTF16 buffer containing ASCII-only characters.
// buff is the buffer, count is the quantity of UTF-16 chars.
// Will change buffer.
void compressUtf16 (char *buff, size_t count) {
int i;
for (i = 0; i < count; i++)
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
}
And, if you're using the other endian UTF-16, simply change:
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
into:
buff[i] = buff[i*2+1]; // for 00 xx 00 xx 00 xx ...

Resources