Constructing sequential Huffman Tree From Scratch - c

Given some textual file, I need to read each alphanumeric characters and code them using Huffman's algorithm.
Reading characters, storing probabilities and creating nodes are solved as well as creating Huffman's trie using pointers.
However, I need to create and initialize Huffman's tree using a sequential representation of a binary tree, without any pointers.
This could be done by creating a regular tree using pointers and then just reading it into the array, but I aim to directly populate an array with the nodes.
I considered creating smaller trees and merging them together but opted for a matrix representation where I would gather elements with the smallest probabilities from a binary heap and store them into the rows of a matrix where row of a matrix would represent the level at which the node should be in a binary tree, in a reverse order that is.
E.g. Given characters and their probabilities as char[int] pairs.
a[1], b[1], c[2], d[1], e[3], f[11], g[2]
I aim to create matrix that looks like
____________________________________
a | b | d | g |
____________________________________
ab | c | dg | e |
____________________________________
abc | deg | | |
____________________________________
abcdeg | f | | |
____________________________________
abcdefg | | | |
____________________________________
Where levels of a, b, c, d, e & f would be rows of a matrix.
Currently, I'm stuck on how to recursively increment levels of elements when their "parent" moves (If I'm combining two nodes from the different levels ['ab' and 'c'], I easily equal level of c with ab and solve problem, but in case that for example 'c' and 'd' where both in second row) and how to create the full binary tree (If it has left son, it needs to have right one) with only levels of terminal nodes.
In advance, I understand that the question is not very specific and would appreciate to hear if there's another approach to this problem instead of just solving the mentioned one.

Is this a contrived problem for homework? I ask because representations of trees that don't use links require O(2^h) space to store a tree of height h. This is because they assume the tree is complete, allowing index calculations to replace pointers. Since Huffman trees can have height h=m-1 for an alphabet of size m, the size of the worst case array could be enormous. Most of it would be unused.
But if you give up the idea that a link must be a pointer and allow it to be an array index, then you're fine. A long time ago - before the dynamic memory allocators became common - this was standard. This problem is particularly good for this method because you always know the number of nodes in the tree in advance: one less than twice the alphabet size. In C you might do something like this
typedef struct {
char ch;
int f;
int left, right; // Indices of children. If both -1, this is leaf for char ch.
} NODE;
#define ALPHABET_SIZE 7
NODE nodes[2 * ALPHABET_SIZE - 1] = {
{ 'a', 1, , -1, -1},
{ 'b', 1, -1, -1 },
{ 'c', 2, -1, -1 },
{ 'd', 1, -1, -1 },
{ 'e', 3, -1, -1 },
{ 'f', 11, -1, -1 },
{ 'g', 2, -1, -1 },
// Rest of array for internal nodes
};
int n_nodes = ALPHABET_SIZE;
int add_internal_node(int f, int left, int right) {
// Allocate a new node in the array and fill in its values.
int i = n_nodes++;
nodes[i] = (NODE) { .f = f, .left = left, .right = right };
return i;
}
Now you'd use the standard tree-building algorithm like this:
int build_huffman_tree(void) {
// Add the indices of the leaf nodes to the priority queue.
for (int i = 0; i < ALPHABET_SIZE; ++i)
add_to_frequency_priority_queue(i);
while (priority_queue_size() > 1) {
int a = remove_min_frequency(); // Removes index of lowest freq node from the queue.
int b = remove_min_frequency();
int p = add_internal_node(nodes[a].f + nodes[b].f, a, b);
add_to_frequency_priority_queue(p);
}
// Last node is huffman tree root.
return remove_min_frequency();
}
The decoding algorithm will use the index of the root like this:
char decode(BIT bits[], int huffman_tree_root_index) {
int i = 0, p = huffman_tree_root_index;
while (node[p].left != -1 || node[p].right != -1) // while not a leaf
p = bits[i++] ? nodes[p].right : nodes[p].left;
return nodes[p].ch;
}
Of course this doesn't return how many bits were consumed, which a real decoder needs to do. A real decoder is also not getting its bits in an array. Finally, for encoding you want parent indices in addition to the children. Working out these matters ought to be fun. Good luck with it.

Related

Picking random indexes into a sorted array

Let's say I have a sorted array of values:
int n=4; // always lower or equal than number of unique values in array
int i[256] = {};
int v = {1 1 2 4 5 5 5 5 5 7 7 9 9 11 11 13}
// EX 1 ^ ^ ^ ^
// EX 2 ^ ^ ^ ^
// EX 3 ^ ^ ^ ^
I would like to generate n random index values i[0] ... i[n-1], so that:
v[i[0]] ... v[i[n-1]] points to a unique number (ie. must not point to 5 twice)
Each number to must be the rightmost of its kind (ie. must point to the last 5)
An index to the final number (13 in this case) should always be included.
What I've tried so far:
Getting the indexes to the last of the unique values
Shuffling the indexes
Pick out the n first indexes
I'm implementing this in C, so the more standard C functions I can rely on and the shorter code, the better. (For example, shuffle is not a standard C function, but if I must, I must.)
Create an array of the last index values
int last[] = { 1, 2, 3, 8, 10, 12, 14 };
Fisher-Yates shuffle the array.
Take the first n-1 elements from the shuffled array.
Add the index to the final number.
Sort the resulting array, if desired.
This algorithm is called reservoir sampling, and can be used whenever you know how big a sample you need but not how many elements you're sampling from. (The name comes from the idea that you always maintain a reservoir of the correct number of samples. When a new value comes in, you mix it into the reservoir, remove a random element, and continue.)
Create the return value array sample of size n.
Start scanning the input array. Each time you find a new value, add its index to the end of sample, until you have n sampled elements.
Continue scanning the array, but now when you find a new value:
a. Choose a random number r in the range [0, i) where i is the number of unique values seen so far.
b. If r is less than n, overwrite element r with the new element.
When you get to the end, sort sample, assuming you need it to be sorted.
To make sure you always have the last element in the sample, run the above algorithm to select a sample of size n-1. Only consider a new element when you have found a bigger one.
The algorithm is linear in the size of v (plus an n log n term for the sort in the last step.) If you already have the list of last indices of each value, there are faster algorithms (but then you would know the size of the universe before you started sampling; reservoir sampling is primarily useful if you don't know that.)
In fact, it is not conceptually different from collecting all the indices and then finding the prefix of a Fisher-Yates shuffle. But it uses O(n) temporary memory instead of enough to store the entire index list, which may be considered a plus.
Here's an untested sample C implementation (which requires you to write the function randrange()):
/* Produces (in `out`) a uniformly distributed sample of maximum size
* `outlen` of the indices of the last occurrences of each unique
* element in `in` with the requirement that the last element must
* be in the sample.
* Requires: `in` must be sorted.
* Returns: the size of the generated sample, while will be `outlen`
* unless there were not enough unique elements.
* Note: `out` is not sorted, except that the last element in the
* generated sample is the last valid index in `in`
*/
size_t sample(int* in, size_t inlen, size_t* out, size_t outlen) {
size_t found = 0;
if (inlen && outlen) {
// The last output is fixed so we need outlen-1 random indices
--outlen;
int prev = in[0];
for (size_t curr = 1; curr < inlen; ++curr) {
if (in[curr] == prev) continue;
// Add curr - 1 to the output
size_t r = randrange(0, ++found);
if (r < outlen) out[r] = curr - 1;
prev = in[curr];
}
// Add the last index to the output
if (found > outlen) found = outlen;
out[found] = inlen - 1;
}
return found;
}

How to do a set union of two double floating point arrays but allowing a tolerance of error of 1 microsecond

I am trying to calculate the union of two arrays containing double floating point values (they are timestamps in milliseconds), but I need to allow a tolerance of +/- one microsecond.
For example:
consider the two values from the two different lists (or arrays) below:
[ref 0 : 1114974059.841] [dut 0 : 1114974059.840]
there is a small delta between the above two numbers of .001 microseconds. So when I make my new union list, they shouldn't both appear as unique, but should be counted as ONE item, and should only have the item from the first list (in this example, ref one, ending in 059.841).
More examples of the above type:
[ref 21 : 1114974794.562] [dut 18 : 1114974794.560]
[ref 22 : 1114974827.840] [dut 19 : 1114974827.840]
[ref 23 : 1114974861.121] [dut 20 : 1114974861.120]
All the above should be considered as ONE, and hence the union list should ONLY have the ONE item of the first list: the union list would have all three from the ref array, and NONE from the dut array.
Now consider the example :
[ref 8 : 1114974328.641] [dut 8 : 1114974361.921]
Here, the delta between the two values in the list above is quite significant with respect to microseconds, and it comes under .01 micro-seconds, and hence should be considered as TWO unique items in the new union list.
Here is another example like the above one :
[ref 13 : 1114974495.041] [dut 12 : 1114974528.321]
[ref 26 : 1114974960.960] [dut 23 : 1114975027.520]
[ref 27 : 1114974994.240] [dut 23 : 1114975027.780]
They all should be considered unique in the new union list.
Can you help me?
I made a subroutine that allows me to detect the tolerance like this:
unsigned int AlmostEqualRelative(double A, double B, double maxRelDiff){
double diff = fabs(A - B); // Calculate the difference.
//
if (diff < maxRelDiff) {
//printf("\n page hit [ref: %10.3f] [dut: %10.3f]",A,B);
return 1;
}
//printf("\n page miss [ref: %10.3f] [dut: %10.3f]",A,B);
return 0;
}
I give maxRelDiff as .02.
Assuming that the two lists are of same size:
unsigned int // size of output array
union_list(const double *list1, // first list
const double *list2, // second list
double* ulist, // output list
unsigned int size) // size of input list
{
unsigned int i = 0u, j = 0u ;
for(; i < size; ++i, ++j)
{
result = is_near(list1[i], list[2]);
if (result == 1)
{
ulist[j] = list1[i] ;
}
else
{
ulist[j] = list1[i] ;
j += 1 ;
ulist[j] = list2[i];
}
}
return j ;
}
Now ulist is the output array. The maximum value it can contain is size*2 and minimum is size. Allocate max number of elements. The returned value is the size of the output array.
int is_near(double a, double b)
{
int result = 1 ;
if (fabs(a - b) >= relative_error)
result = 2 ;
return result ;
}
You can do it like this:
First, make a copy foo of the ref array. foo must be large enough to store both ref and dut arrays' elements as they may be mutually exclusive.
len=0;
for(i=0;i<length_ref;i++)
foo[len++]=ref[i];
Then, add to foo only those elements of dut which are not in ref.
for(i=0;i<length_dut;i++)
{
flag=1;
for(j=0;j<length_ref;j++)
if(AlmostEqualRelative(dut[i],ref[j],MAXRELDIFF))
{
flag=0;
break;
}
if(flag)
foo[len++]=dut[i];
}
One way to approach this is to rewrite your AlmostEqualRelative function as a function which conforms to the prototype int (*compar)(const void *, const void *) (ie, return negative, zero or positive depending on the relative ordering of the arguments). That makes it suitable for passing to the qsort(3) function.
That done, you could concatenate your two starting arrays, qsort the concatenation, and then work through the resulting array copying to a final result only a single exemplar of a sequence of elements which mutually compar to zero.
That would require a little bit of bookkeeping, and a bit of extra space, but it uses existing code (ie, qsort), and unless your arrays are so big that the extra space is a problem, it's probably the most straightforward route to a solution.
That is, this is the moral equivalent of getting the set-union of two files by doing cat f1 f2 | sort | uniq, but with some fuzziness on the uniq comparison.

Sort an increasing array

The pseudo codes:
S = {};
Loop 10000 times:
u = unsorted_fixed_size_array_producer();
S = sort(S + u);
I need an efficient implementation of sort, which takes a sorted array and an unsorted one, then sort them all. But here we know after a few iterations, size(S) will be much bigger than size(u), that's a prior.
Update: There's another prior: the size of u is known, say 10 or 20, and the looping times is also known.
Update: I implemented the algorithm that #Dukelnig advised in C https://gist.github.com/blackball/bd7e5619a1e83bd985a3 which fits for my needs. Thanks!
Sort u, then merge S and u.
Merging simply involves iterating through two sorted arrays at the same time, and picking the smaller element and incrementing that iterator at each step.
The running time is O(|u| log |u| + |S|).
This is very similar to what merge sort does, so that it would result in a sorted array can be derived from there.
Some Java code for merge, derived from Wikipedia: (the C code wouldn't look all that different)
static void merge(int S[], int u[], int newS[])
{
int iS = 0, iu = 0;
for (int j = 0; j < S.length + u.length; j++)
if (iS < S.length && (iu >= u.length || S[iS] <= u[iu]))
newS[j] = S[iS++]; // Increment iS after using it as an index
else
newS[j] = u[iu++]; // Increment iu after using it as an index
}
This can also be done in-place (in S, assuming it has enough additional space) by going from the back.
Here's some working Java code that does this:
static void mergeInPlace(int S[], int SLength, int u[])
{
int iS = SLength-1, iu = u.length-1;
for (int j = SLength + u.length - 1; j >= 0; j--)
if (iS >= 0 && (iu < 0 || S[iS] >= u[iu]))
S[j] = S[iS--];
else
S[j] = u[iu--];
}
public static void main(String[] args)
{
int[] S = {1,5,9,13,22, 0,0,0,0}; // 4 additional spots reserved here
int[] u = {0,10,11,15};
mergeInPlace(S, 5, u);
// prints [0, 1, 5, 9, 10, 11, 13, 15, 22]
System.out.println(Arrays.toString(S));
}
To reduce the number of comparisons, we can also use binary search (although the time complexity would remain the same - this can be useful when comparisons are expensive).
// returns the first element in S before SLength greater than value,
// or returns SLength if no such element exists
static int binarySearch(int S[], int SLength, int value) { ... }
static void mergeInPlaceBinarySearch(int S[], int SLength, int u[])
{
int iS = SLength-1;
int iNew = SLength + u.length - 1;
for (int iu = u.length-1; iu >= 0; iu--)
{
if (iS >= 0)
{
int index = binarySearch(S, iS+1, u[iu]);
for ( ; iS >= index; iS--)
S[iNew--] = S[iS];
}
S[iNew--] = u[iu];
}
// assert (iS != iNew)
for ( ; iS >= 0; iS--)
S[iNew--] = S[iS];
}
If S doesn't have to be an array
The above assumes that S has to be an array. If it doesn't, something like a binary search tree might be better, depending on how large u and S are.
The running time would be O(|u| log |S|) - just substitute some values to see which is better.
If you really really have to use a literal array for S at all times, then the best approach would be to individually insert the new elements into the already sorted S. I.e. basically use the classic insertion sort technique for each element in each new batch. This will be expensive in a sense that insertion into an array is expensive (you have to move the elements), but that's the price of having to use an array for S.
So if the size of S is much more than the size of u, isn't what you want simply an efficient sort for a mostly sorted array? Traditionally this would be insertion sort. But you will only know the real answer by experimentation and measurement - try different algorithms and pick the best one. Without actually running your code (and perhaps more importantly, with your data), you cannot reliably predict performance, even with something as well studied as sorting algorithms.
Say we have a big sorted list of size n and a little sorted list of size k.
Binary search, starting from the end (position n-1, n-2, n-4, &c) for the insertion point for the largest element of the smaller list. Shift the tail end of the larger list k elements to the right, insert the largest element of the smaller list, then repeat.
So if we have the lists [1,2,4,5,6,8,9] and [3,7], we will do:
[1,2,4,5,6, , ,8,9]
[1,2,4,5,6, ,7,8,9]
[1,2, ,4,5,6,7,8,9]
[1,2,3,4,5,6,7,8,9]
But I would advise you to benchmark just concatenating the lists and sorting the whole thing before resorting to interesting merge procedures.

Any ideas on how to solve this matrix / 2x2D array computation?

I have 2 3x3 matrices each represented in 2D arrays.
First matrix holds elements [ I store PID so,the range of elements could be from millions Iam just simplifying it as A in my actual application it is an integer range A could be 200 and B could be 200000]
e.g., matrix element
{ A B C
B D C
C F B }
second holds weight of each location
e.g., Matrix weight
{ 9 7 5
8 6 1
7 5 4 }
so in the above example B is the heaviest element because its weight is 7+8+4 followed by C etc.,
How do I find out the top 3 highest element ?
One solution is:
Is to store the elements in a separate array A[9][2]( element, value and unique) looping the element matrix and then another loop to go through the value array and filling up the value corresponding to the element.
[ iterate to create a 9x2 key value matrix,iterate to sort, iterate to remove duplicates(since weights need to be consolidated ), - Is there a better way ? ]
Any other efficient way ? [hint : I need only 3 so i shouldnt use 9x2 ]
Let's assume you know you have only letters A-Z available and they are capitals.
char elems[3][3] = {
{ 'A', 'B', 'C' },
{ 'B', 'D', 'C' },
{ 'C', 'F', 'B' }
};
And you have similarly set up your weights...
You can keep track of counts like this:
int counts[26] = {0};
for( int i = 0; i < 3; i++ ) {
for( int j = 0; j < 3; j++ ) {
counts[elems[i][j] - 'A'] += weights[i][j];
}
}
Then it's just a case of finding the index of the three largest counts, which I'm sure you can do easily.
Forget that they're 2D arrays, and merge the two data sources into a single array (of pairs). For your example, you would get {{'A', 9}, {'B', 7}, {'C', 5}, {'B', 8}, ...}. Sort these (for example, with qsort), and then scan through the list, summing as you go -- and maintaining the top 3 scored keys you find.
[This solution always works, but only makes sense if the arrays are large, which on re-reading the question they're not].

Interleave array in constant space

Suppose we have an array
a1, a2,... , an, b1, b2, ..., bn.
The goal is to change this array to
a1, b1, a2, b2, ..., an, bn in O(n) time and in O(1) space.
In other words, we need a linear-time algorithm to modify the array in place, with no more than a constant amount of extra storage.
How can this be done?
This is the sequence and notes I worked out with pen and paper. I think it, or a variation, will hold for any larger n.
Each line represents a different step and () signifies what is being moved this step and [] is what has been moved from last step. The array itself is used as storage and two pointers (one for L and one for N) are required to determine what to move next. L means "letter line" and N is "number line" (what is moved).
A B C D 1 2 3 4
L A B C (D) 1 2 3 4 First is L, no need to move last N
N A B C (3) 1 2 [D] 4
L A B (C) 2 1 [3] D 4
N A B 1 (2) [C] 3 D 4
L A (B) 1 [2] C 3 D 4
N A (1) [B] 2 C 3 D 4
A [1] B 2 C 3 D 4 Done, no need to move A
Note the varying "pointer jumps" - the L pointer always decrements by 1 (as it can not be eaten into faster than that), but the N pointer jumps according to if it "replaced itself" (in spot, jump down two) or if it swapped something in (no jump, so the next something can get its go!).
This problem isn't as easy as it seems, but after some thought, the algorithm to accomplish this isn't too bad. You'll notice the first and last element are already in place, so we don't need to worry about them. We will keep a left index variable which represents the first item in the first half of the array that needs changed. After that we set a right index variable to the first item in the 2nd half of the array that needs changed. Now all we do is swap the item at the right index down one-by-one until it reaches the left index item. Increment the left index by 2 and the right index by 1, and repeat until the indexes overlap or the left goes past the right index (the right index will always end on the last index of the array). We increment the left index by two every time because the item at left + 1 has already naturally fallen into place.
Pseudocode
Set left index to 1
Set right index to the middle (array length / 2)
Swap the item at the right index with the item directly preceding it until it replaces the item at the left index
Increment the left index by 2
Increment the right index by 1
Repeat 3 through 5 until the left index becomes greater than or equal to the right index
Interleaving algorithm in C(#)
protected void Interleave(int[] arr)
{
int left = 1;
int right = arr.Length / 2;
int temp;
while (left < right)
{
for (int i = right; i > left; i--)
{
temp = arr[i];
arr[i] = arr[i - 1];
arr[i - 1] = temp;
}
left += 2;
right += 1;
}
}
This algorithm uses O(1) storage (with the temp variable, which could be eliminated using the addition/subtraction swap technique) I'm not very good at runtime analysis, but I believe this is still O(n) even though we're performing many swaps. Perhaps someone can further explore its runtime analysis.
First, the theory: Rearrange the elements in 'permutation cycles'. Take an element and place it at its new position, displacing the element that is currently there. Then you take that displaced element and put it in its new position. This displaces yet another element, so rinse and repeat. If the element displaced belongs to the position of the element you first started with, you have completed one cycle.
Actually, yours is a special case of the question I asked here, which was: How do you rearrange an array to any given order in O(N) time and O(1) space? In my question, the rearranged positions are described by an array of numbers, where the number at the nth position specifies the index of the element in the original array.
However, you don't have this additional array in your problem, and allocating it would take O(N) space. Fortunately, we can calculate the value of any element in this array on the fly, like this:
int rearrange_pos(int x) {
if (x % 2 == 0) return x / 2;
else return (x - 1) / 2 + n; // where n is half the size of the total array
}
I won't duplicate the rearranging algorithm itself here; it can be found in the accepted answer for my question.
Edit: As Jason has pointed out, the answer I linked to still needs to allocate an array of bools, making it O(N) space. This is because a permutation can be made up of multiple cycles. I've been trying to eliminate the need for this array for your special case, but without success.. There doesn't seem to be any usable pattern. Maybe someone else can help you here.
It's called in-place in-shuffle problem. Here is its implementation in C++ based on here.
void in_place_in_shuffle(int arr[], int length)
{
assert(arr && length>0 && !(length&1));
// shuffle to {5, 0, 6, 1, 7, 2, 8, 3, 9, 4}
int i,startPos=0;
while(startPos<length)
{
i=_LookUp(length-startPos);
_ShiftN(&arr[startPos+(i-1)/2],(length-startPos)/2,(i-1)/2);
_PerfectShuffle(&arr[startPos],i-1);
startPos+=(i-1);
}
// local swap to {0, 5, 1, 6, 2, 7, 3, 8, 4, 9}
for (int i=0; i<length; i+=2)
swap(arr[i], arr[i+1]);
}
// cycle
void _Cycle(int Data[],int Lenth,int Start)
{
int Cur_index,Temp1,Temp2;
Cur_index=(Start*2)%(Lenth+1);
Temp1=Data[Cur_index-1];
Data[Cur_index-1]=Data[Start-1];
while(Cur_index!=Start)
{
Temp2=Data[(Cur_index*2)%(Lenth+1)-1];
Data[(Cur_index*2)%(Lenth+1)-1]=Temp1;
Temp1=Temp2;
Cur_index=(Cur_index*2)%(Lenth+1);
}
}
// loop-move array
void _Reverse(int Data[],int Len)
{
int i,Temp;
for(i=0;i<Len/2;i++)
{
Temp=Data[i];
Data[i]=Data[Len-i-1];
Data[Len-i-1]=Temp;
}
}
void _ShiftN(int Data[],int Len,int N)
{
_Reverse(Data,Len-N);
_Reverse(&Data[Len-N],N);
_Reverse(Data,Len);
}
// perfect shuffle of satisfying [Lenth=3^k-1]
void _PerfectShuffle(int Data[],int Lenth)
{
int i=1;
if(Lenth==2)
{
i=Data[Lenth-1];
Data[Lenth-1]=Data[Lenth-2];
Data[Lenth-2]=i;
return;
}
while(i<Lenth)
{
_Cycle(Data,Lenth,i);
i=i*3;
}
}
// look for 3^k that nearnest to N
int _LookUp(int N)
{
int i=3;
while(i<=N+1) i*=3;
if(i>3) i=i/3;
return i;
}
Test:
int arr[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int length = sizeof(arr)/sizeof(int);
in_place_in_shuffle(arr, length);
After this, arr[] will be {0, 5, 1, 6, 2, 7, 3, 8, 4, 9}.
If you can transform the array into a linked-list first, the problem becomes trivial.

Resources