How to write a bitstream - c

I'm thinking about writing some data into a bit stream using C. There are two ways come in mind. One is to concatenate variable bit-length symbols into a contiguous bit sequence, but in this way my decoder will probably have a hard time separating those symbols from this continuous bit stream. Another way is to distribute same amount of bits for which symbol and in that way the decoder can easily recover the original data, but there may be a waste of bits since the symbols have different values which in turn cause many bits in the bit stream being zero(this waste bits I guess).
Any hint what I should do?
I'm new to programming. Any help will be appreciated.

Sounds like your trying to do something similiar to a Huffman compression scheme? I would just go byte-by-byte (char)and keep track of the offset within the byte where I read off the last symbol.
Assuming none of your symbols would be bigger than char. It would look something like this:
struct bitstream {
char *data;
int data_size; // size of 'data' array
int last_bit_offset; // last bit in the stream
int current_data_offset; // position in 'data', i.e. data[current_data_offset] is current reading/writing byte
int current_bit_offset; // which bit we are currently reading/writing
}
char decodeNextSymbol(bitstream *bs) {
}
int encodeNextSymbol(bitstream *bs, char symbol) {
}
The matching code for decodeNextSymbol and encodeNextSymbol would have to use the C bitwise operations ('&' (bitwise AND), and '|' (bitwise OR) for example. I would then come up with a list of all my symbols, starting with the shortest first, and do a while loop that matches the shortest symbol. For example, if one of your symbols is '101', then if the stream is '1011101', it would match the first '101' and would continue to match the rest of the stream '1101' You would also have to handle the case where your symbol values overflow from one byte to the next.

Related

Array to use for appending unknown number of bytes into single large array in System verilog

I am trying to append unknown number of bytes into a single large array . Which array type should I use ? I am trying to this
len=temp_i.len()
for(i=0;i<len;i++)begin
bit [7:0] temp_ascii;
temp_ascii=temp_i.getc(i);
arr = {arr,temp_ascii};
where temp_i is an input srting. My Final aim is convert input String into binary representation of its ASCII value and concatenate them together into a single large array.
I having a hard time choosing what kind of array to use dynamic or associative or if I can use queue.
Any help will be highly appreciable.
You use associative arrays when the index values are not consecutive, or the ordering is meaningless. Not applicable here.
You use queues when adding or removing one element at a time to an array. If arr was declared as a queue, you could write
string temp_i;
bit [7:0] arr[$];
int len;
len = temp_i.len();
for(int i=0,i<len;i++)
arr.push_back(temp_i.getc(i));
If your strings are small, or you plan to concatenate many strings together, a queue is your best option. But if you only plan to convert one string to an array, then using a bit-stream cast to a dynamic array will be the most efficient.
string temp_i;
typedef bit [7:0] uint8_da_t[]; // typedef required for cast to target
uint8_da_t arr; // using typedef not required here, but A VERY GOOD IDEA
arr = uint8_da_t'(temp_i);
is it supposed to be a synthesizable code or a test bench?
None of the above is synthesizable.
you would do it differently in different worlds.

Why do we need unsigned char for Huffman tree code

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows:
Given the following data structure:
struct huffman
{
unsigned char sym; /* symbol */
struct huffman *left, *right; /* left and right subtrees */
};
write a program that takes the name of a binary file as sole argument,
builds the Huffman tree of that file assuming that atoms (elementary
symbols) are 8-bit unsigned characters, and prints the tree as well as
the dictionary.
allocations must be done using nothing else than
malloc(), and sorting can be done using qsort().
Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things:
We need to take a frequency array (That could be Farray[]={.......})
Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head).
Now the question is here: why and where do we need those unsigned char data? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)?
If you purely want to display the shape of the tree, then yes, you just need to build it. However, for it to be of any use whatsoever you need to know what original symbol each node represents.
Imagine your input symbols are [ABCD]. An imaginary Huffman tree/dictionary might look like this:
( )
/ \ A = 1
( ) (A) B = 00
/ \ C = 010
(B) ( ) D = 011
/ \
(C) (D)
If you don't store sym, it looks like this:
( )
/ \ A = ?
( ) ( ) B = ?
/ \ C = ?
( ) ( ) D = ?
/ \
( ) ( )
Not very useful, that, is it?
Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea:
FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
freq[c]++;
fclose(input);
/* do Huffman algorithm */
...
Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;)
It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. Of course, you can always build one from the other.
While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value.
Usually data compression is divided into 2 big steps; given a stream of data:
evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset
once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. Hoffman gives you an algorithm to automate this process and get a fairly good result.
In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary.
There is a complete tutorial here.

How to read complex numbers from a file in C?

I have a text file which contains a 64x64 matrix values and they are complex numbers. I want to read them from a file but I'm having difficulties. Either using the complex library of C or creating a new data type for complex numbers is okay for me, I just need them read correctly.
What I mean is, whether using:
#include <complex.h>
int complex matrix[64][64];
or creating a data type for it:
typedef struct {
int real, imag;
} Complex;
Complex matrix[64][64];
is okay for me as long as they are read correctly.
Below you can find 2x3 matrix, just to demonstrate how the numbers are in my file:
{{-32767, 12532 + 5341I, -3415 - 51331I}
{32767I, 32609 + 3211I, 32137 + 6392I}}
So as you can see some parts have both the real and imaginary part, some just the imaginary and some just the real part, and all the imaginary numbers have upper case 'i' letter at the end. If you could help me with that, I would be glad.
Two common design patterns apply:
1) Recursive descent parser that accepts a 'language'
2) State Machine
Both cases can benefit from a read_char() function, that exits with error if it encounters
anything else than '{', '}', 0-9, i, +, - and which skips all white spaces (ch<=32).
A state machine can be a bit more versatile: if at any point there is '+' or '-', one can just add or subtract the next value (ending with 'i' or non 'i') to the currently accumulated value. (Then the state machine is able to also calculate 1+2-1+1i while it goes...)
It doesn't matter how the number are organized, if they' re aligned in memory you can read them all at once:
Complex matrix[64][64];
fread(matrix,sizeof(Complex),64*64, your_file_pointer);
Same for writing:
fwrite(matrix, sizeof(Complex), 64*64, your_file_pointer);

custom data type in C

I am working with cryptography and need to use some really large numbers. I am also using the new Intel instruction for carryless multiplication that requires m128i data type which is done by loading it with a function that takes in floating point data as its arguments.
I need to store 2^1223 integer and then square it and store that value as well.
I know I can use the GMP library but I think it would be faster to create two data types that both store values like 2^1224 and 2^2448. It will have less overhead.I am going to using karatsuba to multiply the numbers so the only operation I need to perform on the data type is addition as I will be breaking the number down to fit m128i.
Can someone direct me in the direction towards material that can help me create the size of integer I need.
If you need your own datatypes (regardless of whether it's for math, etc), you'll need to fall back to structures and functions. For example:
struct bignum_s {
char bignum_data[1024];
}
(obviously you want to get the sizing right, this is just an example)
Most people end up typedefing it as well:
typedef struct bignum_s bignum;
And then create functions that take two (or whatever) pointers to the numbers to do what you want:
/* takes two bignums and ORs them together, putting the result back into a */
void
bignum_or(bignum *a, bignum *b) {
int i;
for(i = 0; i < sizeof(a->bignum_data); i++) {
a->bignum_data[i] |= b->bignum_data[i];
}
}
You really want to end up defining nearly every function you might need, and this frequently includes memory allocation functions (bignum_new), memory freeing functions (bignum_free) and init routines (bignum_init). Even if you don't need them now, doing it in advance will set you up for when the code needs to grow and develop later.

Need to make a very large array [2^16+1][2^16+1] - size of array is too large :(

Greetings
I need to calculate a first-order entropy (Markov source, like on wiki here http://en.wikipedia.org/wiki/Entropy_(information_theory) of a signal that consists of 16bit words.
This means, i must calculate how frequently each combination of a->b (symbol b appears after a) is happening in the data stream.
When i was doing it for just 4 less significant or 4 more significant bits, i used a two dimensional array, where first dimension was the first symbol and second dimension was the second symbol.
My algorithm looked like this
Read current symbol
Array[prev_symbol][curr_symbol]++
prev_symbol=curr_symbol
Move forward 1 symbol
Then, Array[a][b] would mean how many times did symbol b going after symbol a has occurred in a stream.
Now, i understand that array in C is a pointer that is incremented to get exact value, like to get element [3][4] from array[10][10] i have to increment pointer to array[0][0] by (3*10+4)(size of variable stored in array). I understand that the problem must be that 2^32 elements of type unsigned long must be taking too much.
But still, is there a way to deal with it?
Or maybe there is another way to accomplish this?
An two-dimensional array of integers (4 byte) with 32'000 by 32'000 elements occupies about 16 GByte of RAM. Does your machine have that much memory?
Anyhow, out of the more than 1 billion array elements, only very few will have a count different from zero. So it's probably better to go with some sort of sparse storage.
One solution would be to use a dictionary where the tuple (a, b) is the key and the count of occurrences is the value.
Perhaps you could do multiple passes over the data. The entropy contribution from pairs beginning with symbol X is essentially independent of pairs beginning with any other symbol (aside from the total number of them, of course), so you can calculate the entropy for all such pairs and then throw away the distribution data. At the end, combine 2^16 partial entropy values to get the total. You don't necessarily have to do 2^16 passes over the data, you can be "interested" in as many initial characters in a single pass as you have space for.
Alternatively, if your data is smaller than 2^32 samples, then you know for sure that you won't see all possible pairs, so you don't actually need to allocate a count for each one. If the sample is small enough, or the entropy is low enough, then some kind of sparse array would use less memory than your full 16GB matrix.
Did a quick test on Ubuntu 10.10 x64
gt#thinkpad-T61p:~/test$ uname -a
Linux thinkpad-T61p 2.6.35-25-generic #44-Ubuntu SMP Fri Jan 21 17:40:44 UTC 2011 x86_64 GNU/Linux
gt#thinkpad-T61p:~/test$ cat mtest.c
#include <stdio.h>
#include <stdlib.h>
short *big_array;
int main(void)
{
if((big_array = (short *)malloc(4UL*1024*1024*1024*sizeof (short))) == NULL) {
perror("malloc");
return 1;
}
big_array[0]++;
big_array[100]++;
big_array[1UL*1024*1024*1024]++;
big_array[2UL*1024*1024*1024]++;
big_array[3UL*1024*1024*1024]++;
printf("array[100] = %d\narray[3G] = %d\n", big_array[100], big_array[3UL*1024*1024*1024]);
return 0;
}
gt#thinkpad-T61p:~/test$ gcc -Wall mtest.c -o mtest
gt#thinkpad-T61p:~/test$ ./mtest
array[100] = 1
array[3G] = 1
gt#thinkpad-T61p:~/test$
It looks like the virtual memory system on linux is up to the job, as long as you have enough memory and/or swap.
Have fun!

Resources