Removing bytes in a dump or utf-8 in c - c

I have a "C" program in my firestation that captures incoming packets to the station printer. The program then scans the packet and sends and audible alert for what apparatus is due on the call. The county recently started using UTF-8 packets and the c program can not deal with all the extra "00" in the data flow. I need to either ignore the 00 or set the program to handle UTF-8. I have looked for days and there is nothing concrete on how to handle utf-8 that a novice such as my self can handle. Below is the interpret part of the program.
72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 later in packet
43 4f 44 45 53 45 54 3d 55 54 46 38 0a 40 50 4a beginning of packet
***void compressUtf16 (char *buff, size_t count) {
int i;
for (i = 0; i < count; i++)
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
}*
{ u_int i=0;
char *searcher = 0;
char c;
int j;
int locflag;
static int locationtripped = 0;
static char currentline[256];
static int currentlinepos = 0;
static char lastdispatched[256];
static char dispatchstring[256];
char betastring[256];
static int a = 0;
static int e = 0;
static int pe = 0;
static int md = 0;
static int pulse = 0;
static char location[128];
static char type[16];
static char station[16];
static FILE *fp;
static int printoutscanning = 0;
static char printoutID[20];
static char printoutfileID[32];
static FILE *dbg;
if(pulse) {
if(pulse == 80) {
sprintf(betastring, "beta a a a");
printf("betastring: \"%s\"\n", betastring);
system(betastring);
pulse = 0;
} else
pulse++;
}
if(header->len > 96) {
for(i=55; (i < header->caplen + 1 ) ; i++) {
c = pkt_data[i-1];
if(c == 13 || c == 10) {
currentline[currentlinepos] = 0;
currentlinepos = 0;
j = strlen(currentline);
if(j && (j > 1)) {
if(strlen(printoutfileID) && printoutscanning) {
dbg = fopen(printoutfileID, "a");
fprintf(dbg, "%s\n", currentline);
fclose(dbg);
}
if(!printoutscanning) {
searcher = 0;
searcher = strstr(currentline, "INCIDENT HISTORY DETAIL:");
if(searcher) {
searcher = searcher + 26;
strncpy(printoutID, searcher, 9);
printoutID[9] = 0;
printoutscanning = 1;
a = 0;
e = 0;
pe = 0;
md = 0;
for(j = 0; j < 128; j++)
location[j] = 0;
for(j = 0; j < 16; j++) {
type[j] = 0;
station[j] = 0;
}
sprintf(printoutfileID, "calls/%s %.6d.txt", printoutID, header-> ts.tv_usec);
dbg = fopen(printoutfileID, "a");
fprintf(dbg, "%s\n", currentline);
fclose(dbg);
}

UTF-8, except for the zero code point itself, will not have any zero bytes in it. The first byte of all multi-byte encodings (non-ASCII code points) will always start with the 11 bit pattern, with subsequent bytes always starting with the 10 bit pattern.
As you can see from the following table, U+0000 is the only code point that can give you a zero byte in UTF-8.
+----------------+----------+----------+----------+----------+
| Unicode | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
+----------------+----------+----------+----------+----------+
| U+0000-007F | 0xxxxxxx | | | |
| U+0080-07FF | 110yyyxx | 10xxxxxx | | |
| U+0800-FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx | |
| U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx |
+----------------+----------+----------+----------+----------+
UTF-16 will intersperse zero bytes between your otherwise ASCII bytes but it's then a simple matter of throwing away every second byte. Whether that's 0, 2, 4, ... or 1, 3, 5, ... depends on whether your UTF-16 encoding is big-endian or little-endian.
I see from your sample that your data stream does indicate UTF-8 (43 4f 44 45 53 45 54 3d 55 54 46 38 translates to the text CODESET=UTF8) but I'll guarantee you it's lying :-)
The segment 72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 is UTF-16 for reathing, presumably a word segment since I'm not familiar with that word (in English, anyway).
I would suggest you clarify with whoever is generating that data since it's clearly erroneous. As to how you process the UTF-16, I've covered that above. Provided it's ASCII data in there (the alternate bytes are always zero), you can just throw away those alternates with something like:
// Process a UTF16 buffer containing ASCII-only characters.
// buff is the buffer, count is the quantity of UTF-16 chars.
// Will change buffer.
void compressUtf16 (char *buff, size_t count) {
int i;
for (i = 0; i < count; i++)
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
}
And, if you're using the other endian UTF-16, simply change:
buff[i] = buff[i*2]; // for xx 00 xx 00 xx 00 ...
into:
buff[i] = buff[i*2+1]; // for 00 xx 00 xx 00 xx ...

Related

How do I convert a byte stream to unsigned int 8 in C [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last month.
Improve this question
I am streaming a video in python via WebSocket, which is a raw bytes stream and appears like this:
b'\x00\x00\x00\x01A\x9a \x02\x04\xe1{=z\xf8FMS\xe6\\\x9eMubH\xa7R.1\xd7]F\xea3}\xa9b\x9f\x14n\x12| ....'
Now, I am passing these bytes to a C function (via ctypes) where I am trying to convert this to a uint8_t [] array (This is needed in order to decode it using a FFmpeg library). Here's my code so far:
This is how I am passing bytes to C:
import ctypes
dll = ctypes.CDLL("decode_video.so")
data = bytearray(b'\x00\x00\x00\x01A\x9a \x02\x04\xe1-{=z\xf8FM....')
b_array = ctypes.c_char * len(data)
dll.conversion_test(b_array.from_buffer(data), len(data))
decode_video.c
void conversion_test(unsigned char* buf, int bufSize) {
char temp[3];
uint8_t vals[bufSize];
// Iterate over the values
for (int i = 0; i < bufSize; i++) {
// Copy two characters into the temporary string
temp[0] = buf[i * 2];
temp[1] = buf[i * 2 + 1];
temp[2] = 0;
vals[i] = strtol(temp, NULL, 16);
}
for(int i=0; i<bufSize; i++){
printf("%02x ", vals[i] & 0xff);
}
}
Aside from this, I am simultaneously dumping the stream to a file. In C, I have another function that reads from this file and stores in a uint8_t buffer.
Streaming code in python:
f = open("video.h264", "wb")
def on_message(ws, message):
# The first 14 characters have irrelevant info to decode the video
f.write(message[14:])
ws = websocket.WebSocketApp("wss://somewebsite.com/archives",
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close,
header=[protocol_str]
)
ws.binaryType = 'arraybuffer'
ws.run_forever(dispatcher=rel)
Reading from the raw file in C:
#include "libavcodec/avcodec.h"
#define INBUF_SIZE 4096
uint8_t *data;
uint8_t inbuf[INBUF_SIZE + AV_INPUT_BUFFER_PADDING_SIZE];
f = fopen(input_name, "rb");
data_size = fread(inbuf, 1, INBUF_SIZE, f);
// Printing bytes in hex to debug
for(size_t i=0; i<data_size; i++){
printf("%02x ", inbuf[i] & 0xff);
}
However the contents of this inbuf and output of the vals buffer are not the same. Basically, I am unsure of my method of passing bytes to C and its corresponding coversion to uint8_t.
Update:
I tried printing the hex values of vals and here's what it looks like:
00 00 0a 00 00 00 00 00 00 00 00 00 00 00 01 00 00 ...
While the output of inbuf looks like this:
00 00 00 01 67 42 00 1e e2 90 14 07 b6 02 dc ...
Fix:
As suggested by #n. m. and #dreamlax, a simple identity mapping works between unsigned char and uint8_t. Now vals and inbuf output the same values!
void conversion_test(unsigned char* buf, int bufSize) {
uint8_t vals[bufSize];
for(int i = 0; i < bufSize; i++)
vals[i] = (uint8_t)buf[i];
}
(OR) A simple type-casting
uint8_t *vals = (uint8_t *)ubuf;

How can I read and obtain separated data from a file using 'fread' in C?

I've written in a file (using 'fwrite()') the following:
TUS�ABQ���������������(A����������(A��B������(A��B���A��(A��B���A������B���A������0����A������0�ABQ�������0�ABQ�����LAS����������������A�����������A��&B�������A��&B��B���A��&B��B������&B��
B����153���B����153�LAS�����153�LAS�����LAX���������������:A����������:AUUB������:AUUB��B��:
AUUB��B����UUB��B����������B��������LAX���������LAX�����MDW���������������A����������A��(�������A��(����A��A��(����A������(����A����A�89���A����A�89MDW�����A�89MDW�����OAK���������
����������������������#�����������#�����������#�����������#�������������������������OAK���������OAK�����SAN���������������LA����������LA��P#������LA��P#��#A��LA��P#��#A������P#��#A����������#A��������SAN���������SAN�����TPA�ABQ����������������B�����������B��#�����...(continues)
which is translated to this:
TUSLWD2.103.47.775.1904.06.40.03AMBRFD4.63.228.935.0043.09.113.0ASDGHU5.226.47.78.3.26...(The same structure)
and the hexdump of that would be:
00000000 54 55 53 00 41 42 51 00 00 00 00 00 00 00 00 00 |TUS.ABQ.........|
00000010 00 00 00 00 00 00 28 41 00 00 0e 42 00 00 f8 41 |......(A...B...A|
00000020 00 00 00 00 4c 41 53 00 00 00 00 00 00 00 00 00 |....LAS.........|
00000030 00 00 00 00 00 00 88 41 00 00 26 42 9a 99 11 42 |.......A..&B...B|
(Continues...)
the structure is, always 2 words of 3 characters each one (i.e. TUS and LWD) followed by 7 floats, and then it repeats again on a on until end of file.
The key thing is: I just want to read every field separated like 'TUS', 'LWD', '2.10', '3.4', '7.77'...
And I can only use 'fread()' to achieve that! For now, I'm trying this:
aux2 = 0;
fseek(fp, SEEK_SET, 0);
fileSize = 0;
while (!feof(fp) && aux<=2) {
fread(buffer, sizeof(char)*4, 1, fp);
printf("%s", buffer);
fread(buffer, sizeof(char)*4, 1, fp);
printf("%s", buffer);
for(i=0; i<7; i++){
fread(&delay, sizeof(float), 1, fp);
printf("%f", delay);
}
printf("\n");
aux++;
fseek(fp,sizeof(char)*7+sizeof(float)*7,SEEK_SET);
aux2+=36;
}
And I get this result:
TUSABQ0.0000000.0000000.00000010.5000000.0000000.00000010.500000
AB0.0000000.000000-10384675421112248092159136000638976.0000000.0000000.000000-10384675421112248092159136000638976.0000000.000000
AB0.0000000.000000-10384675421112248092159136000638976.0000000.0000000.000000-10384675421112248092159136000638976.0000000.000000
But it does not works correctly...
*Note: forget the arguments of the last 'fseek()', cos I've been trying too many meaningless things!
To write the words (i.e. TUS) into the file, I use this:
fwrite(x->data->key, 4, sizeof(char), fp);
and to write the floats, this:
for (i = 0; i < 7; i++) {
fwrite(&current->data->retrasos[i], sizeof(float), sizeof(float), fp);
}
I'd recommend using a structure to hold each data unit:
typedef struct {
float value[7];
char word1[5]; /* 4 + '\0' */
char word2[5]; /* 4 + '\0' */
} unit;
To make the file format portable, you need a function that packs and unpacks the above structure to/from a 36-byte array. On Intel and AMD architectures, float corresponds to IEEE-754-2008 binary32 format in little-endian byte order. For example,
#define STORAGE_UNIT (4+4+7*4)
#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
size_t unit_pack(char *target, const size_t target_len, const unit *source)
{
size_t i;
if (!target || target_len < STORAGE_UNIT || !source) {
errno = EINVAL;
return 0;
}
memcpy(target + 0, source->word1, 4);
memcpy(target + 4, source->word2, 4);
for (i = 0; i < 7; i++)
memcpy(target + 8 + 4*i, &(source->value[i]), 4);
return STORAGE_UNIT;
}
size_t unit_unpack(unit *target, const char *source, const size_t source_len)
{
size_t i;
if (!target || !source || source_len < STORAGE_UNIT) {
errno = EINVAL;
return 0;
}
memcpy(target->word1, source, 4);
target->word1[4] = '\0';
memcpy(target->word2, source + 4, 4);
target->word2[4] = '\0';
for (i = 0; i < 7; i++)
memcpy(&(target->value[i]), source + 8 + i*4, 4);
return STORAGE_UNIT;
}
#else
#error Unsupported architecture!
#endif
The above only works on Intel and AMD machines, but it is certainly easy to extend to other architectures if necessary. (Almost all machines currently use IEEE 754-2008 binary32 for float, only the byte order varies. Those that do not, typically have C extensions that do the conversion to/from their internal formats.)
Using the above, you can -- should! must! -- document your file format, for example as follows:
Words are 4 bytes encoded in UTF-8
Floats are IEEE 754-2008 binary32 values in little-endian byte order
A file contains one or more units. Each unit comprises of
Name Description
word1 First word
word2 Second word
value0 First float
value1 Second float
value2 Third float
value3 Fourth float
value4 Fifth float
value5 Sixth float
value6 Second float
There is no padding.
To write an unit, use a char array of size STORAGE_UNIT as a cache, and write that. So, if you have unit *one, you can write it to FILE *out using
char buffer[STORAGE_UNIT];
if (unit_pack(buffer, sizeof buffer, one)) {
/* Error! Abort program! */
}
if (fwrite(buffer, STORAGE_UNIT, 1, out) != 1) {
/* Write error! Abort program! */
}
Correspondingly, reading from FILE *in would be
char buffer[STORAGE_UNIT];
if (fread(buffer, STORAGE_UNIT, 1, in) != 1) {
/* End of file, or read error.
Check feof(in) or/and ferror(in). */
}
if (unit_unpack(one, buffer, STORAGE_UNIT)) {
/* Error! Abort program! */
}
If one is an array of units, and you are writing or reading one[k], use &(one[k]) (or equivalently one + k) instead of one.

Trouble with compression in LZW

I'm having trouble while implementing the compressor of the LZW. The compressor seems to work fine but while processing some streams it doesn't put the end of stream character (defined with the value 256), the result is that the decompressor will loop infinitely.
The code of the compressor is the following:
int compress1(FILE* input, BIT_FILE* output) {
CODE next_code; // next node
CODE current_code; // current node
CODE index; // node of the found character
int character;
int ret;
next_code = FIRST_CODE;
dictionary_init();
if ((current_code = getc(input)) == EOF)
current_code = EOS;
while ((character = getc(input)) != EOF) {
index = dictionary_lookup(current_code, (SYMBOL)character);
if (dictionary[index].code != UNUSED) {
current_code = dictionary[index].code;
}
else {
if (next_code <= MAX_CODE-1) {
dictionary[index].code = next_code++;
dictionary[index].parent = current_code;
dictionary[index].symbol = (SYMBOL)character;
}
else {
// handling full dictionary
dictionary_init();
next_code = FIRST_CODE;
}
ret = bit_write(output, (uint64_t) current_code, BITS);
if( ret != 0)
return -1;
current_code = (CODE)character;
}
}
ret = bit_write(output, (uint64_t) current_code, BITS);
if (ret != 0)
return -1;
ret = bit_write(output, (uint64_t) EOS, BITS);
if (ret != 0)
return -1;
if (bit_close(output) == -1) {
printf("Ops: error during closing\n");
return -1;
}
return 0;
}
CODE and SYMBOL are typedef of, respectively, uint32_t and uint16_t, FIRST_CODE is defined as 257. The funtion dictionary_init() simply initializes the dictionary, dictionary_lookup() returns the index of a child having symbol "character" of the parent node "current_node" (if it exists).
The writing of the binary file is defined as:
int bit_write(BIT_FILE* bf, uint64_t data, int len)
{
int space, result, offset, wbits, udata;
uint64_t* p;
uint64_t tmp;
udata = (int)data;
if (bf == NULL || len < 1 || len > (8* sizeof(data)))
return -1;
if (bf->reading == true)
return -1;
while (len > 0) {
space = bf->end - bf->next;
if (space < 0) {
return -1;
}
// if buffer is full, flush data to file and reinit BIT_IO struct
if (space == 0) {
result = bit_flush(bf);
if (result < 0)
return -1;
}
p = bf->buf + (bf->next/64);
offset = bf->next % 64;
wbits = 64 - offset;
if (len < wbits)
wbits = len;
tmp = le64toh(*p);
tmp |= (data << offset);
*p = htole64(tmp);
bf->next += wbits;
len -= wbits;
data >>= wbits;
}
return 0;
}
I already opened the file using another function, so the bit_write take as input the pointer to the bf structure.
Can someone help me finding the error?
An example of when this problem arises is the following:
If the input string is "Nel mezzo del cammi" everything works fine and I have the following compressed file (in Hexadecimal, using 12 Bits for encoding symbols):
4E 50 06 6C 00 02 6D 50 06 7A A0 07 6F 00 02 64
20 10 20 30 06 61 D0 06 6D 90 06 0D A0 00 00 01
If I add another character to the string, in particular "Nel mezzo del cammin", I have the following result:
4E 50 06 6C 00 02 6D 50 06 7A A0 07 6F 00 02 64
20 10 20 30 06 61 D0 06 6D 90 06 6E D0 00 0A 00
10
In the second case it doesn't write the End of Stream correctly.
SOLUTION: check that there is enough space in the buffer for the whole coded symbol I am going to write. Just change:
if (space == 0)
to:
if(space == 0 && space < len)

Not able to pack hex bytes into unsigned char array or pointer properly

I tried to imitate a struct with a unsigned char array or pointer, but I am not able to obtain the same hex values.
.input is correct with print().
I am trying to get the same effect from stringBytes_Data or data_hexStrFormatted with print().
Can anyone advise?
Given
struct _vector {
char *input;
unsigned char len;
};
static struct _vector tv2 = {
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
And function to view the data:
static void print(char *intro_message, unsigned char *text_addr,
unsigned int size) {
unsigned int i;
for (i = 0; i < size; i++) {
printf("%2x ", text_addr[i]);
if ((i & 0xf) == 0xf)
printf("\n");
}
printf("\n");
}
How may I get the same effect with:
char* stringBytes_Data = "6bc1bee22e409f96e93d7e117393172aae2d8a571e03ac9c9eb76fac45af8e5130c81c46a35ce411e5fbc1191a0a52eff69f2445df4f9b17ad2b417be66c3710";
I tried, but the result is wrong :
unsigned char* data_hexStrFormatted;
int lengthOfStr = strlen(stringBytes_Data);
int charCounterForNewStr = 0;
int formattedLength = (2*lengthOfStr)+1;
data_hexStrFormatted = (unsigned char*) malloc((formattedLength)*sizeof(unsigned char)); // x2 as we add \x to XX, and 1 for NULL end char
for(i=0; i<lengthOfStr; i=i+2) {
// prepend \x
data_hexStrFormatted[charCounterForNewStr++] = '\\';
data_hexStrFormatted[charCounterForNewStr++] = 'x';
data_hexStrFormatted[charCounterForNewStr++] = stringBytes_Data[i];
data_hexStrFormatted[charCounterForNewStr++] = stringBytes_Data[i+1];
}
data_hexStrFormatted[formattedLength-1] = '\0';
printf("%s\n", data_hexStrFormatted);
printf("%d byte length \n", strlen(data_hexStrFormatted)/4);
print("data_hexStrFormatted",
(unsigned char *)
data_hexStrFormatted,
(formattedLength)/4);
You seem to be asking:
Given a string containing pairs of hex digits, convert the hex digits to byte values?
If so, then code similar to the following can be used:
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
This function works correctly for valid hex digits; it will produce nonsense for invalid inputs. If you decide you need to detect erroneous input, you'll need to improve it. There are other ways to write this function (lots of them, in fact). One that can be effective is an array of 256 bytes statically initialized with the correct values, so you simply write return hex_array[c];.
char* stringBytes_Data = "6bc1bee22e409f96e93d7e117393172aae2d8a571e03ac9c9eb76fac45af8e5130c81c46a35ce411e5fbc1191a0a52eff69f2445df4f9b17ad2b417be66c3710";
size_t len = strlen(stringBytes_Data);
char buffer[len / 2];
assert(len % 2 == 0);
for (size_t i = 0; i < len; i += 2)
buffer[i / 2] = hexit(stringBytes_Data[i]) << 4 | hexit(stringBytes_Data[i+1]);
printf("%.*s\n", (int)len/2, buffer);
This code sets the array buffer to contain the converted code. It won't work correctly if there's an odd number of characters in the array (that's what the assertion states).
Working code - #2
Using the print() function from the question with the info_message argument removed since it is unused:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
struct Vector
{
char *input;
unsigned char len;
};
static struct Vector tv2 =
{
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
static void print(unsigned char *text_addr, unsigned int size)
{
unsigned int i;
for (i = 0; i < size; i++)
{
printf("%2x ", text_addr[i]);
if ((i & 0xf) == 0xf)
printf("\n");
}
printf("\n");
}
static void print2(const char *tag, const unsigned char *data, size_t size)
{
printf("%s:\n", tag);
for (size_t i = 0; i < size; i++)
{
printf("%2x ", data[i]);
if ((i & 0x0F) == 0x0F)
printf("\n");
}
printf("\n");
}
static void print_text(const char *tag, const char *data, size_t datalen)
{
char buffer[datalen / 2];
assert(datalen % 2 == 0);
for (size_t i = 0; i < datalen; i += 2)
buffer[i / 2] = hexit(data[i]) << 4 | hexit(data[i + 1]);
//printf("%s: [[%.*s]]\n", tag, (int)datalen / 2, buffer);
assert(memcmp(buffer, tv2.input, tv2.len) == 0);
print((unsigned char *)buffer, datalen / 2);
print2(tag, (unsigned char *)buffer, datalen / 2);
}
int main(void)
{
char *stringBytes_Data =
"6bc1bee22e409f96e93d7e117393172a"
"ae2d8a571e03ac9c9eb76fac45af8e51"
"30c81c46a35ce411e5fbc1191a0a52ef"
"f69f2445df4f9b17ad2b417be66c3710"
;
print_text("buffer", stringBytes_Data, strlen(stringBytes_Data));
return 0;
}
Sample output:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
Working code - #1
Redone — previous versions had various 'off by a factor of two' errors which were partially concealed by the system zeroing a buffer.
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
struct Vector
{
char *input;
unsigned char len;
};
static struct Vector tv2 =
{
.input = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
"\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
"\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
"\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
"\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
"\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
"\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
"\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
.len = 64,
};
static inline int hexit(const unsigned char c)
{
static const char hex_digits[] = "0123456789ABCDEF";
return strchr(hex_digits, toupper(c)) - hex_digits;
}
static void print(const char *tag, const unsigned char *data, size_t size)
{
printf("%s:\n", tag);
for (size_t i = 0; i < size; i++)
{
printf("%2x ", data[i]);
if ((i & 0x0F) == 0x0F)
printf("\n");
}
printf("\n");
}
static void print_text(const char *tag, const char *data, size_t datalen)
{
char buffer[datalen / 2];
assert(datalen % 2 == 0);
for (size_t i = 0; i < datalen; i += 2)
buffer[i / 2] = hexit(data[i]) << 4 | hexit(data[i + 1]);
printf("%s: [[%.*s]]\n", tag, (int)datalen / 2, buffer);
assert(memcmp(buffer, tv2.input, tv2.len) == 0);
print(tag, (unsigned char *)buffer, datalen / 2);
}
int main(void)
{
char *stringBytes_Data =
"6bc1bee22e409f96e93d7e117393172a"
"ae2d8a571e03ac9c9eb76fac45af8e51"
"30c81c46a35ce411e5fbc1191a0a52ef"
"f69f2445df4f9b17ad2b417be66c3710"
;
print_text("buffer", stringBytes_Data, strlen(stringBytes_Data));
return 0;
}
Raw output on a UTF-8 terminal (it isn't valid UTF-8 data, hence the question marks):
buffer: [[k???.#???=~s?*?-?W????o?E??Q0?F?\????
R???$E?O??+A{?l7]]
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
Raw output converted into UTF-8 as if it was ISO 8859-15 (or 8859-1):
buffer: [[kÁŸâ.#é=~s*®-W¬·o¬E¯Q0ÈF£\äåûÁ
Rïö$EßO­+A{æl7]]
buffer:
6b c1 be e2 2e 40 9f 96 e9 3d 7e 11 73 93 17 2a
ae 2d 8a 57 1e 3 ac 9c 9e b7 6f ac 45 af 8e 51
30 c8 1c 46 a3 5c e4 11 e5 fb c1 19 1a a 52 ef
f6 9f 24 45 df 4f 9b 17 ad 2b 41 7b e6 6c 37 10
The data doesn't seem to have any particular meaning, but beauty is in the eye of the beholder.

(C) how to fix this algorithm for z827 ASCII compression?

noob warning.
I'm trying to create a compression program. It takes a .txt with ASCII characters as an argument, and cuts off the leading 0 of the binary representation of each character.
It does this by using the last 2 bytes of two different integers. A character with a leading zero is put into the 4th byte of the integer 'write', and the next character is put into the 3rd byte of the integer 'temp'. The 'temp' int is then shifted to the right once, and then OR'd with 'write', so that the leading zero slot has been filled with data we need. This repeats, with the shift counter increasing after every character. The first case is a bit odd. The algorithm isn't very complex if written out on paper.
I feel like I've tried everything. I've been over the algorithm so many times. I'm pretty sure the problem is when shift_counter gets to 8.. but it should work fine. It just doesn't. I can show you why here (the code is further down):
This is the hex dump of my output:
0000000 3f 00 00 00 41 10 68 9e 6e c3 d9 65 10 88 5e c6
0000020 d3 41 e6 74 9a 5d 06 d1 df a0 7a 7d 5e 06 a5 dd
0000040 20 3a bd 3c a7 a7 dd 67 10 e8 5d a7 83 e8 e8 72
0000060 19 a4 c7 c9 6e a0 f1 f8 dd 86 cb cb f3 f9 3c
0000077
And the correct output:
0000000 3f 00 00 00 41 d0 3c dd 86 b3 cb 20 7a 19 4f 07
0000020 99 d3 ec 32 88 fe 06 d5 e7 65 50 da 0d a2 97 e7
0000040 f4 b4 fb 0c 7a d7 e9 20 3a ba 0c d2 e3 64 37 d0
0000060 f8 dd 86 cb cb f3 79 fa ed 76 29 00 0a 0a
0000076
code:
int compress(char *filename_ptr){
int in_fd;
in_fd = open(filename_ptr, O_RDONLY);
//set pointer to the end of the file, find file size, then reset position
//by closing/opening
unsigned int file_bytes = lseek(in_fd, 0, SEEK_END);
close(in_fd);
in_fd = open(filename_ptr, O_RDONLY);
//store file contents in buffer
unsigned char read_buffer[file_bytes];
read(in_fd, read_buffer, file_bytes);
//file where the output will be stored
int out_fd;
creat("output.txt", 0644);
out_fd = open("output.txt", O_WRONLY);
//sets file size in header (needed for decompression, this is the size of the
//file before compression. everything after this we write this 4-byte int
//is a 1 byte char
write(out_fd, &file_bytes, 4);
unsigned int writer;
unsigned int temp;
unsigned char out_char;
int i;
int shift_count = 8;
for(i = 0; i < file_bytes; i++){
if(shift_count == 8){
writer = read_buffer[i];
temp = temp & 0x00000000;
temp = read_buffer[i+1] << 8;
shift_count = 1;
}else{
//moves the next char's bits to the left, for the purpose of filling the
//8 bit buffer (writer) via OR operation
temp = read_buffer[i] << 8;
}
temp = temp >> shift_count;
writer = writer | temp;
//output right byte of writer
unsigned int right_byte = writer & 0x000000ff;
//output right_byte as a char
out_char = (char) right_byte;
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
//clear right side of writer
writer = writer & 0x0000ff00;
//shift left side of writer to the right by 8
writer = writer >> 8;
shift_count++;
}
return 0;
}
It seems to me that input and output are too strongly coupled.
At some point, the program should be reading (roughly) the 80th octet from the input and writing (roughly) the 70th octet to the output, because you want to (on average) write 7 bits out for every 8 bits you read in, right?
What the loop
for(i = 0; i < file_bytes; i++){
...
... = read_buffer[i];
...
write(out_fd, &out_char, 1);
...
}
actually seems to be doing is:
On the 70th pass through the loop -- when 70==i --
it's reading the 70th octet from the input and writing the 70th octet to the output.
On the 80th pass through the loop -- when 80==i --
it's reading the 80th octet from the input and writing the 80th octet to the output.
You must decide:
Do you want "i" to represent the number of input characters processed, or the number of output chars processed?
Because it's not possible to do both -- it's not possible to have 70 equal 80.
Perhaps something like this is closer to what you wanted:
/* test.c
http://stackoverflow.com/questions/15080239/c-how-to-fix-this-algorithm-for-z827-ascii-compression
WARNING: untested code.
*/
int compress(char *filename_ptr){
int in_fd;
in_fd = open(filename_ptr, O_RDONLY);
//set pointer to the end of the file, find file size, then reset position
//by closing/opening
unsigned int file_bytes = lseek(in_fd, 0, SEEK_END);
close(in_fd);
in_fd = open(filename_ptr, O_RDONLY);
//store file contents in buffer
unsigned char read_buffer[file_bytes];
read(in_fd, read_buffer, file_bytes);
//file where the output will be stored
int out_fd;
creat("output.txt", 0644);
out_fd = open("output.txt", O_WRONLY);
//sets file size in header (needed for decompression, this is the size of the
//file before compression. everything after this we write this 4-byte int
//is a 1 byte char
write(out_fd, &file_bytes, 4);
unsigned int writer;
unsigned int temp;
unsigned char out_char;
int i;
int writer_bits = 0; // 0 bits of data in writer so far
for(i = 0; i < file_bytes; i++){
// i is the number of (7 bit ASCII) characters
// read from the input so far.
// add 7 more bits to the writer
temp = read_buffer[i];
//moves the next char's bits to the left, for the purpose of filling the
//8 bit buffer (writer) via OR operation
//(avoid overwriting the "writer_bits" of good bits
//already in the buffer).
temp = read_buffer[i] << writer_bits;
writer = writer | temp;
writer_bits = writer_bits + 7;
//output right byte of writer
unsigned int right_byte = writer & 0x000000ff;
//output right_byte as a char
out_char = (unsigned char) right_byte;
// output 8 bits of data whenever
// we have *at least* 8 bits of data in the writer buffer.
if(writer_bits >= 8){
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
//shift left side of writer to the right by 8
writer = writer >> 8;
writer_bits = writer_bits - 8;
}else{
// 7 or fewer bits in writer --
// skip writing until next time.
}
}
// is there any leftover bits still in writer?
if(writer_bits > 0){
//write_buffer[i] = out_char;
write(out_fd, &out_char, 1);
}
return 0;
}
(Currently the program reads the entire input file into RAM, then writes the entire output file. Some programmers prefer to read a little at a time, then write a little at a time. Both approaches have advantages and disadvantages).

Resources