Compress a struct into a binary file? [C] - c

This is part of my homework that I'm having difficults to solve.
I have a simple structure:
typedef struct Client {
char* lname;
unsigned int id;
unsigned int car_id;
} Client;
And the exercise is:
Create a text file named as the company name and then branch number with txt extention.
the file contain all clients' details.
The file you created in exercise 1 will be compressed. as a result, a binary file be created with .cmpr extention.
I don't really have an idea how to implement 2.
I remember at the lectures that the professor said we have to use "all" the variable, with binary operators (<< , >> , | , &, ~), but I don't know how to used it.
I'm using Ubuntu, under GCC and Eclipse. I'm using C.
I'd be glad to get helped. thanks!

Let's say the file from step 1 looks like:
user1798362
2324
462345
where the three fields were simply printed on three lines. Note that the above is the text/readable (i.e. ASCII) representation of that file.
Looking at the contents of this file in hex(adecimal) representation we get (with the ASCII character printed below each byte value):
75 73 65 72 31 37 39 38 33 36 32 0a 32 33 32 34 0a 34 36 32 33 34 35 0a
u s e r 1 7 9 8 3 6 2 nl 2 3 2 4 nl 4 6 2 3 4 5 nl
here nl is of course the newline character. You can count that there are 24 bytes.
In step 2 you have to invent another format that saves as many bits as possible. The simplest way to do this is to compress each of the three fields individually.
Similar to where the text format uses a nl to mark the end of a field, you also need a way to define where a binary field begins and ends. A common way is to put a length in front of the binary field data. As a first step we could replace the nl's with a length and get:
58 75 73 65 72 31 37 39 38 33 36 32 20 32 33 32 34 30 34 36 32 33 34 35
-- u s e r 1 7 9 8 3 6 2 -- 2 3 2 4 -- 4 6 2 3 4 5
For now we simply take a whole byte for the length in bits. Note that 58 is the hex representation of 77 (i.e. 11 characters * 8 bits), the bit length of lname',20hex equals 4 * 8 = 32, and30is 6 * 8 = 48. This does not compress anything, as it's still 24 bytes in total. But we already got a binary format because58,20and30` got a special meaning.
The next step would be to compress each field. This is where it gets tricky. The lname field consists of ASCII character. In ASCII only 7 of the 8 bits are needed/used; here's a nice table For example the letter u in binary is 01110101. We can safely chop off the leftmost bit, which is always 0. This yields 1110101. The same can be done for all the characters. So you'll end up with 11 7-bit values -> 77 bits.
These 77 bits now must be fit in 8-bit bytes. Here are the first 4 bytes user in binary representation, before chopping the leftmost bit off:
01110101 01110011 01100101 01110010
Chopping off a bit in C is done by shifting the byte (i.e. unsigned char) to the left with:
unsigned char byte = lname[0];
byte = byte << 1;
When you do this for all characters you get:
1110101- 1110011- 1100101- 1110010-
Here I use - to indicate the bits in these bytes that are now available to be filled; they became available by shifting all bits one place to the left. You now use one or more bit from the right side of the next byte to fill up these - gaps. When doing this for these four bytes you'll get:
11101011 11001111 00101111 0010----
So now there's a gap of 4 bits that should be filled with the bit from the character 1, etc.
Filling up these gaps is done by using the binary operators in C which you mention. We already use the shift left <<. To combine 1110101- and 1110011- for example we do:
unsigned char* name; // name MUST be unsigned to avoid problems with binary operators.
<allocated memory for name and read it from text file>
unsigned char bytes[10]; // 10 is just a random size that gives us enough space.
name[0] = name[0] << 1; // We shift to the left in-place here, so `name` is overwritten.
name[1] = name[1] << 1; // idem.
bytes[0] = name[0] | (name[1] >> 7);
bytes[1] = name[1] << 1;
With name[1] >> 7 we have 1110011- >> 7 which gives: 00000001; the right most bit. With the bitwise OR operator | we then 'add' this bit to 1110101-, resulting in 111010111.
You have to do things like this in a loop to get all the bits in the correct bytes.
The new length of this name field is 11 * 7 = 77, so we've lost a massive 11 bits :-) Note that with a byte length, we assume that the lname field will never be more than 255 / 7 = 36 characters long.
As with the bytes above, you can then coalesce the second length against the final bits of the lname field.
To compress the numbers you first read 'em in with (fscanf(file, %d, ...)) in an unsigned int. There will be many 0s at the left side in this 4-byte unsigned int. The first field for example is (shown in chunks of 4 bit only for readability):
0000 0000 0000 0000 0000 1001 0001 0100
which has 20 unused bits at the left.
You need to get rid of these. Do 32 minus the number of zero's at the left, and you get the bit-length of this number. Add this length to the bytes array by coalescing its bits against those of previous field. Then only add the significant bits of the number to the bytes. This would be:
1001 0001 0100
In C, when working with the bits of an 'int' (but also 'short', 'long', ... any variable/number larger than 1 byte), you must take byte-order or endianness into account.
When you do the above step twice for both numbers, you're done. You then have a bytes array you can write to a file. Of course you must have kept where you were writing in bytes in the steps above; so you know the number of bytes. Note that in most cases there will be a few bits in the last byte that are not filled with data. But that doesn't hurt and it simply unavoidable waste of the fact that files are stored in chunks of 8 bits = 1 byte minimally.
When reading the binary file, you'll get a reverse process. You'll read in a unsigned char bytes array. You then know that the first byte (i.e. bytes[0]) contains the bit-length of the name field. You then fill in the bytes of the 'lname' byte-by-byte by shifting and masking. etc....
Good luck!

Related

How to convert a hexadecimal char into a 4-bit binary representation?

I wish to compare a SHA-256 hash which is stored in u8[32](after being calculated in kernel-space) with a 64 char string that the user passes as string.
For eg. : User passes a SHA-256 hash "49454bcda10e0dd4543cfa39da9615a19950570129420f956352a58780550839" as char* which will take 64 bytes. But this has to be compared with a hash inside the kernel space which is represented as u8 hash[32].
The hash inside the kernel gets properly printed in ASCII by the following code:
int i;
u8 hash[32];
for(i=0; i<32; i++)
printk(KERN_CONT "%hhx ", hash[i]);
Output :
"49 45 4b cd a1 0e 0d d4 54 3c fa 39 da 96 15 a1 99 50 57 01 29 42 0f 95 63 52 a5 87 80 55 08 39 "
As the complete hash is stored in 32 bytes and printed as 64 chars in groups of 2 chars per u8 space, I assume that currently one u8 block stores information worth 2 chars i.e. 00101111 prints to be 2f.
Is there a way to store the 64 bytes string in 32 bytes so that it can be compared?
Here is how to use scanf to do the conversion:
char *shaStr = "49454bcda10e0dd4543cfa39da9615a19950570129420f956352a58780550839";
uint8_t sha[32];
for (int i = 0 ; i != 32 ; i++) {
sscanf(shaStr+2*i, "%2" SCNx8, &sha[i]);
printf("%02x ", sha[i]);
}
The approach here is to call sscanf repeatedly with the "%2" SCNx8 format specifier, which means "two hex characters converted to uint8_t". The position is determined by the index of the loop iteration, i.e. shaStr+2*i
Demo.
Characters are often stored in ASCII, so start by having a look at an ASCII chart. This will show you the relationship between a character like 'a' and the number 97.
You will note all of the numbers are right next to each other. This is why you often see people do c-'0' or c-48 since it will convert the ASCII-encoded digits into numbers you can use.
However you will note that the letters and the numbers are far away from each other, which is slightly less convenient. If you arrange them by bits, you may notice a pattern: Bit 6 (&64) is set for letters, but unset for digits. Observing that, converting hex-ASCII into numbers is straightforward:
int h2i(char c){return (9*!!(c&64))+(c&15);}
Once you have converted a single character, converting a string is also straightforward:
void hs(char*d,char*s){while(*s){*d=(h2i(*s)*16)+h2i(s[1]);s+=2;++d;}}
Adding support for non-hex characters embedded (like whitespace) is a useful exercise you can do to convince yourself you understand what is going on.

What does it mean "bytes numbered from 0 (LSB) to 3 (MSB)"?

I should extract byte n from word x.
Example: getByte(0x12345678,1) = 0x56.
And there is written, that bytes numbered from 0(LSB) to 3(MSB), the meaning of which I can't understand.
Thank you.
Consider your 32 bit word (0x12345678) as 4 bytes:
Word : 12 34 56 78 (hex)
Byte #: 3 2 1 0
MSB<-----LSB
MSB = Most Significant Byte
LSB = Least Signficant Byte
It means that you are supposed to consider x composed of bytes as x = &Sum;n&in;[0,4) bn × 256n, and given x you are supposed to compute bn. That is, b0 is the least-significant byte and b3 is the most-significant byte.
MSB and LSB mean Most Significative Byte and Least Significative Byte, respectively. A byte being a 8-bit number that can be directly represented by 2 hexadecimal positions. So, the number 0x12345678 is a word containing 4 bytes, 12 34 56 78. The rightmost is the LSB, and the leftmost is the MSB. So you are taking the byte 1 that is the SECOND byte from right to left.

Why are decimal numbers used in bitmasks?

This is a pretty basic question, and I'm sure that there's an easy answer to it, but I don't know the search term I should be using to look for an answer. Here it goes:
I'm trying to understand how bitmasks work. On Linux systems there's:
struct stat
that has a st_mode member that's used to determine whether the file being inspected is a regular file, a directory, a symbolic link, and others. So, it's possible to write a simple function that you can pass a name to and get whether or not the name represents a directory:
16 int isadir( char *name )
17 /*
18 * calls stat, then masks the st_mode word to obtain the
19 * filetype portion and sees if that bit pattern is the
20 * pattern for a directory
21 */
22 {
23 struct stat info;
24
25 return ( stat(name,&info)!=-1 && (info.st_mode & S_IFMT) == S_IFDIR );
26 }
When I look at the bitmask, I see it's represented as follows:
/* Encoding of the file mode. */
#define __S_IFMT 0170000 /* These bits determine file type. */
I thought bitmasks could only have 0s and 1s. Why is there a 7 in the mask?
Numbers starting with a leading 0 are octal numbers — this is standard C syntax.
And these can be useful for bitmasks, especially to represent Unix permissions.
A byte is 8 bits, and can be expressed in decimal (0 to 255), octal (000 to 377), hexadecimal (00 to FF) or binary (00000000 to 11111111). Let's number the bits, from bit 0 to bit 7:
76543210
Actually a number may be expressed in any base, but mainly octal and hexadecimal are convenient when one want to break down the number into bits ; expressing a byte in octal is easier as
z y x
76543210
x is bits 0 to 2, y is bits 3 to 5 and z is bits 6 and 7.
Thus in your exemple, 017 octal number is
0 1 7
00 001 111
Numbers expressed in octal base (8-base) are easier to be converted to binary. (in hexa that would be 0F).
In C (...), octal literal numbers start with a leading zero (0...), and in hexadecimal they start with leading 0x (0x...). As it is easier to visualize bits of numbers expressed in octal,
022 & 017
gives in binary
"00 010 010" &
"00 001 111"
result can be found out easily
"00 000 010"
In decimal, that would be 18 & 15.

Unicode stored in C char

I'm learning the C language on Linux now and I've came across a little weird situation.
As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.
In my program I use char input[], which is filled by getchar function like this pseudocode:
char input[20];
int z, i;
for(i = 0; i < 20; i++)
{
z = getchar();
input[i] = z;
}
The weird thing is that it works not only for ASCII characters, but for any character I imagine, such as #&#{čřžŧ¶'`[łĐŧđж←^€~[←^ø{&}čž on the input.
My question is - how is it possible? It seems to be one of many beautiful exceptions in C, but I would really appreciate explanation. Is it a matter of OS, compiler, hidden language's additional super-feature?
Thanks.
There is no magic here - The C language gives you acess to the raw bytes, as they are stored in the computer memory.
If your terminal is using utf-8 (which is likely), non-ASCII chars take more than one byte in memory. When you display then again, is our terminal code which converts these sequences into a single displayed character.
Just change your code to print the strlen of the strings, and you will see what I mean.
To properly handle utf-8 non-ASCII chars in C you have to use some library to handle them for you, like glib, qt, or many others.
ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character.
Also notice that you are not guaranteed ASCII as base, tho many ignore other scenarios. If you want to check if a "primitive" byte is a alpha character you can in other words not, when taking heed to all systems, say:
is_alpha = (c > 0x40 && c < 0x5b) || (c > 0x60 && c < 0x7b);
Instead you'll have to use ctype.h and say:
isalpha(c);
Only exception, AFAIK, is for numbers, on most tables at least, they have contiguous values.
Thus this works;
char ninec = '9';
char eightc = '8';
int nine = ninec - '0';
int eight = eightc - '0';
printf("%d\n", nine);
printf("%d\n", eight);
But this is not guaranteed to be 'a':
alhpa_a = 0x61;
Systems not based on ASCII, i.e. using EBCDIC; C on such a platform still runs fine but here they (mostly) use 8 bits instead of 7 and i.e. A can be coded as decimal 193 and not 65 as it is in ASCII.
For ASCII however; bytes having decimal 128 - 255, (8 bits in use), is extended, and not part of the ASCII set. I.e. ISO-8859 uses this range.
What is often done; is also to combine two or more bytes to one character. So if you print two bytes after each other that is defined as say, utf8 0xc3 0x98 == Ø, then you'll get this character.
This again depends on which environment you are in. On many systems/environments printing ASCII values give same result across character sets, systems etc. But printing bytes > 127 or double byted characters gives a different result depending on local configuration.
I.e.:
Mr. A running the program gets
Jasŋ€
While Mr. B gets
Jasπß
This is perhaps especially relevant to the ISO-8859 series and Windows-1252 of single byte representation of extended characters, etc.
ASCII_printable_characters , notice they are 7 not 8 bits.
ISO_8859-1 and ISO_8859-15, widely used sets, with ASCII as core.
Windows-1252, legacy of Windows.
UTF-8#Codepage_layout, In UTF-8 you have ASCII, then you have special sequences of byes.
Each sequence starts with a byte > 127 (which is last ASCII byte),
followed by a given number of bytes which all starts with the bits 10.
In other words, you will never find an ASCII byte in a multi byte UTF-8 representation.
That is; the first byte in UTF-8, if not ASCII, tells how many bytes this character has. You could also say ASCII characters say no more bytes follow - because highest bit is 0.
I.e if file interpreted as UTF-8:
fgetc(c);
if c < 128, 0x80, then ASCII
if c == 194, 0xC2, then one more byte follow, interpret to symbol
if c == 226, 0xE2, then two more byte follows, interpret to symbol
...
As an example. If we look at one of the characters you mention. If in an UTF-8 terminal:
$ echo -n "č" | xxd
Should yield:
0000000: c48d ..
In other words "č" is represented by the two bytes 0xc4 and 0x8d. Add -b to the xxd command and we get the binary representation of the bytes. We dissect them as follows:
___ byte 1 ___ ___ byte 2 ___
| | | |
0xc4 : 1100 0100 0x8d : 1000 1101
| |
| +-- all "follow" bytes starts with 10, rest: 00 1101
|
+ 11 -> 2 bits set = two byte symbol, the "bits set" sequence
end with 0. (here 3 bits are used 110) : rest 0 0100
Rest bits combined: xxx0 0100 xx00 1101 => 00100001101
\____/ \_____/
| |
| +--- From last byte
+------------ From first byte
This give us: 00100001101 2 = 26910 = 0x10D => Uncode codepoint U+010D == "č".
This number can also be used in HTML as č == č
Common for this and lots of other code systems is that an 8 bit byte is the base.
Often it is also a question about context. As an example take GSM SMS, with ETSI GSM 03.38/03.40 (3GPP TS 23.038, 3GPP 23038). There we also find an 7bit character table, 7-bits GSM default alphabet, but instead of storing them as 8 bits they are stored as 7 bits1. This way you can pack more characters into a given number of bytes. Ie standard SMS 160 characters becomes 1280 bits or 160 bytes as ASCII and 1120 or 140 bytes as SMS.
1 Not without exception, (it is more to the story).
I.e. a simple example of bytes saved as septets (7bit) C8329BFD06 in SMS UDP format to ASCII:
_________
7 bit UDP represented | +--- Alphas has same bits as ASCII
as 8 bit hex '0.......'
C8329BFDBEBEE56C32 1100100 d * Prev last 6 bits + pp 1
| | | | | | | | +- 00 110010 -> 1101100 l * Prev last 7 bits
| | | | | | | +--- 0 1101100 -> 1110010 r * Prev 7 + 0 bits
| | | | | | +----- 1110010 1 -> 1101111 o * Last 1 + prev 6
| | | | | +------- 101111 10 -> 1010111 W * Last 2 + prev 5
| | | | +--------- 10111 110 -> 1101111 o * Last 3 + prev 4
| | | +----------- 1111 1101 -> 1101100 l * Last 4 + prev 3
| | +------------- 100 11011 -> 1101100 l * Last 5 + prev 2
| +--------------- 00 110010 -> 1100101 e * Last 6 + prev 1
+----------------- 1 1001000 -> 1001000 H * Last 7 bits
'------'
|
+----- GSM Table as binary
And 9 bytes "unpacked" becomes 10 characters.
ASCII is 7 bits, not 8 bits. a char [] holds bytes, which can be in any encoding - iso8859-1, utf-8, whatever you want. C doesn't care.
This is the magic of UTF-8, that you don't even had to worry about how it works. The only problem is that the C data-type is named char (for character), while what it actually means is byte. there is no 1:1 correspondence between characters and the bytes that encode them.
What happens in your code is that, from the program's point of view, you input a sequence of bytes, it stores the bytes in memory and if you print the text it prints bytes. This code doesn't care how these bytes encode the characters, it's only the terminal which needs to worry about encoding them on input and correctly interpreting them on output.
There are of course many libraries that does the job, but to quickly decode any UTF8 unicode, this little function is handy:
typedef unsigned char utf8_t;
#define isunicode(c) (((c)&0xc0)==0xc0)
int utf8_decode(const char *str,int *i) {
const utf8_t *s = (const utf8_t *)str; // Use unsigned chars
int u = *s,l = 1;
if(isunicode(u)) {
int a = (u&0x20)? ((u&0x10)? ((u&0x08)? ((u&0x04)? 6 : 5) : 4) : 3) : 2;
if(a<6 || !(u&0x02)) {
int b,p = 0;
u = ((u<<(a+1))&0xff)>>(a+1);
for(b=1; b<a; ++b)
u = (u<<6)|(s[l++]&0x3f);
}
}
if(i) *i += l;
return u;
}
Considering your code; you can iterate the string and read the unicode values:
int l;
for(i=0; i<20 && input[i]!='\0'; ) {
if(!isunicode(input[i])) i++;
else {
l = 0;
z = utf8_decode(&input[i],&l);
printf("Unicode value at %d is U+%04X and it\'s %d bytes.\n",i,z,l);
i += l;
}
}
There is a datatype wint_t (#include <wchar.h>) for non-ASCII characters. You can use the method getwchar() to read them.

Decomposition of an IP header

I have to do a sniffer as an assignment for the security course. I am using C and the pcap library. I got everything working well (since I got a code from the internet and changed it). But I have some questions about the code.
u_int ip_len = (ih->ver_ihl & 0xf) * 4;
ih is of type ip_header, and its currently pointing the to IP header in the packet.
ver_ihl gives the version of the IP.
I can't figure out what is: & 0xf) * 4;
& is the bitwise and operator, in this case you're anding ver_ihl with 0xf which has the effect of clearing all the bits other than the least signifcant 4
0xff & 0x0f = 0x0f
ver_ihl is defined as first 4 bits = version + second 4 = Internet header length. The and operation removes the version data leaving the length data by itself. The length is recorded as count of 32 bit words so the *4 turns ip_len into the count of bytes in the header
In response to your comment:
bitwise and ands the corresponding bits in the operands. When you and anything with 0 it becomes 0 and anything with 1 stays the same.
0xf = 0x0f = binary 0000 1111
So when you and 0x0f with anything the first 4 bits are set to 0 (as you are anding them against 0) and the last 4 bits remain as in the other operand (as you are anding them against 1). This is a common technique called bit masking.
http://en.wikipedia.org/wiki/Bitwise_operation#AND
Reading from RFC 791 that defines IPv4:
A summary of the contents of the internet header follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The first 8 bits of the IP header are a combination of the version, and the IHL field.
IHL: 4 bits
Internet Header Length is the length of the internet header in 32
bit words, and thus points to the beginning of the data. Note that
the minimum value for a correct header is 5.
What the code you have is doing, is taking the first 8 bits there, and chopping out the IHL portion, then converting it to the number of bytes. The bitwise AND by 0xF will isolate the IHL field, and the multiply by 4 is there because there are 4 bytes in a 32-bit word.
The ver_ihl field contains two 4-bit integers, packed as the low and high nybble. The length is specified as a number of 32-bit words. So, if you have a Version 4 IP frame, with 20 bytes of header, you'd have a ver_ihl field value of 69. Then you'd have this:
01000101
& 00001111
--------
00000101
So, yes, the "&0xf" masks out the low 4 bits.

Resources