I wrote this function that performs a slightly modified variation of run-length encoding on text files in C.
I'm trying to generalize it to binary files but I have no experience working with them. I understand that, while I can compare bytes of binary data much the same way I can compare chars from a text file, I am not sure how to go about printing the number of occurrences of a byte to the compressed version like I do in the code below.
A note on the type of RLE I'm using: bytes that occur more than once in a row are duplicated to signal the next-to-come number is in fact the number of occurrences vs just a number following the character in the file. For occurrences longer than one digit, they are broken down into runs that are 9 occurrences long.
For example, aaaaaaaaaaabccccc becomes aa9aa2bcc5.
Here's my code:
char* encode(char* str)
{
char* ret = calloc(2 * strlen(str) + 1, 1);
size_t retIdx = 0, inIdx = 0;
while (str[inIdx]) {
size_t count = 1;
size_t contIdx = inIdx;
while (str[inIdx] == str[++contIdx]) {
count++;
}
size_t tmpCount = count;
// break down counts with 2 or more digits into counts ≤ 9
while (tmpCount > 9) {
tmpCount -= 9;
ret[retIdx++] = str[inIdx];
ret[retIdx++] = str[inIdx];
ret[retIdx++] = '9';
}
char tmp[2];
ret[retIdx++] = str[inIdx];
if (tmpCount > 1) {
// repeat character (this tells the decompressor that the next digit
// is in fact the # of consecutive occurrences of this char)
ret[retIdx++] = str[inIdx];
// convert single-digit count to string
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
}
inIdx += count;
}
return ret;
}
What changes are in order to adapt this to a binary stream? The first problem I see is with the snprintf call since it's operating using a text format. Something that rings a bell is also the way I'm handling the multiple-digit occurrence runs. We're not working in base 10 anymore so that has to change, I'm just unsure how having almost never worked with binary data.
A few ideas that can be useful to you:
one simple method to generalize RLE to binary data is to use a bit-based compression. For example the bit sequence 00000000011111100111 can be translated to the sequence 0 9623. Since the binary alphabet is composed by only two symbols, you need to only store the first bit value (this can be as simple as storing it in the very first bit) and then the number of the contiguous equal values. Arbitrarily large integers can be stored in a binary format using Elias gamma coding. Extra padding can be added to fit the entire sequence nicely into an integer number of bytes. So using this method, the above sequence can be encoded like this:
00000000011111100111 -> 0 0001001 00110 010 011
^ ^ ^ ^ ^
first bit 9 6 2 3
If you want to keep it byte based, one idea is to consider all the even bytes frequencies (interpreted as an unsigned char) and all the odd bytes the values. If one byte occur more than 255 times, than you can just repeat it. This can be very inefficient, though, but it is definitively simple to implement, and it might be good enough if you can make some assumptions on the input.
Also, you can consider moving out from RLE and implement Huffman's coding or other sophisticated algorithms (e.g. LZW).
Implementation wise, i think tucuxi already gave you some hints.
You only have to address 2 problems:
you cannot use any str-related functions, because C strings do not deal well with '\0'. So for example, strlen will return the index of the 1st 0x0 byte in a string. The length of the input must be passed in as an additional parameter: char *encode(char *start, size_t length)
your output cannot have an implicit length of strlen(ret), because there may be extra 0-bytes sprinkled about in the output. You again need an extra parameter: size_t encode(char *start, size_t length, char *output) (this version would require the output buffer to be reserved externally, with a size of at least length*2, and return the length of the encoded string)
The rest of the code, assuming it was working before, should continue to work correctly now. If you want to go beyond base-10, and for instance use base-256 for greater compression, you would only need to change the constant in the break-things-up loop (from 9 to 255), and replace the snprintf as follows:
// before
snprintf(tmp, 2, "%ld", tmpCount);
ret[retIdx++] = tmp[0];
// after: much easier
ret[retIdx++] = tmpCount;
Can someone explain to me how the calculation works?
what I don't understand is:
the getch(); function, what does that function does?
2.
Can someone explain to me how the int decimal_binary(int n) operates mathematically?
#include<stdio.h>
int decimal_binary (int n);
void main()
{
int n;
printf("Enter decimal number: ");
scanf("%d", &n);
printf("\n%d", decimal_binary(n));
getch();
}
int decimal_binary(int n)
{
int rem, i = 1, binary = 0;
while(n!=0)
{
rem = n % 2;
n = n/2;
binary = binary + rem*i;
i = i*10;
}
return binary;
}
if for example the n = 10
and this is how i calculate it
I'm not going to explain the code in the question, because I fundamentally (and rather vehemently) disagree with its implementation.
When we say something like "convert a number to base 2", it's useful to understand that we are not really changing the number. All we're doing is changing the representation. An int variable in a computer program is just a number (although deep down inside it's already in binary). The base matters when we print the number out as a string of digit characters, and also when we read it from as a string of digit characters. So any sensible "convert to base 2" function should have as its output a string, not an int.
Now, when you want to convert a number to base 2, and in fact when you want to convert to base b, for any base "b", the basic idea is to repeatedly divide by b.
For example, if we wanted to determine the base-10 digits of a number, it's easy. Consider the number 12345. If we divide it by 10, we get 1234, with a remainder of 5. That remainder 5 is precisely the last digit of the number 12345. And the remaining digits are 1234. And then we can repeat the procedure, dividing 1234 by 10 to get 123 remainder 4, etc.
Before we go any further, I want you to study this base-10 example carefully. Make sure you understand that when we split 12345 up into 1234 and 5 by dividing it by 10, we did not just look at it with our eyes and pick off the last digit. The mathematical operation of "divide by 10, with remainder" really did do the splitting up for us, perfectly.
So if we want to determine the digits of a number using a base other than 10, all we have to do is repeatedly divide by that other base. Suppose we're trying to come up with the binary representation of eleven. If we divide eleven by 2, we get five, with a remainder of 1. So the last bit is going to be 1.
Next we have to work on five. If we divide five by 2, we get two, with a remainder of 1. So the next-to-last bit is going to be 1.
Next we have to work on two. If we divide two by 2, we get one, with a remainder of 0. So the next bit is going to be 0.
Next we have to work on one. If we divide one by 2, we get zero, with a remainder of 1. So the next bit is going to be 1.
And now we have nothing left to work with -- the last division has resulted in 0. The binary bits we've picked off were, in order, 1, 1, 0, and 1. But we picked off the last bit first. So rearranging into conventional left-to-right order, we have 1011, which is the correct binary representation of the number eleven.
So with the theory under our belt, let's look at some actual C code to do this. It's perfectly straightforward, except for one complication. Since the algorithm we're using always gives us the rightmost bit of the result first, we're going to have to do something special in order to end up with the bits in conventional left-to-right order in the final result.
I'm going to write the new code as function, sort of like your decimal_binary. This function will accept an integer, and return the binary representation of that integer as a string. Because strings are represented as arrays of characters in C, and because memory allocation for arrays can be an issue, I'm going to also have the function accept an empty array (passed by the caller) to build the return string in. And I'm also going to have the function accept a second integer giving the size of the array. That's important so that the function can make sure not to overflow the array.
If it's not clear from the explanation so far, here's what a call to the new function is going to look like:
#include <stdio.h>
char *integer_binary(int n, char *str, int sz);
int main()
{
int n;
char result[40];
printf("Enter decimal number: ");
scanf("%d", &n);
char *str = integer_binary(n, result, 40);
printf("%s\n", str);
}
As I said, the new function, integer_binary, is going to create its result as a string, so we have to declare an array, result, to hold that string. We're declaring it as size 40, which should be plenty to hold any 32-bit integer, with some left over.
The new function returns a string, so we're printing its return value using %s.
And here's the implementation of the integer_binary function. It's going to look a little scary at first, but bear with me. At its core, it's using the same algorithm as the original decimal_binary function in the question did, repeatedly dividing by 2 to pick off the bits of the binary number being generated. The differences have to do with constructing the result in a string instead of an int. (Also, it's not taking care of quite everything yet; we'll get to one or two more improvements later.)
char *integer_binary(int n, char *binary, int sz)
{
int rem;
int j = sz - 2;
do {
if(j < 0) return NULL;
rem = n % 2;
n = n / 2;
binary[j] = '0' + rem;
j--;
} while(n != 0);
binary[sz-1] = '\0';
return &binary[j+1];
}
You can try that, and it will probably work for you right out of the box, but let's explain the possibly-confusing parts.
The new variable j keeps track of where in the array result we're going to place the next bit value we compute. And since the algorithm generates bits in right-to-left order, we're going to move j backwards through the array, so that we stuff new bits in starting at the end, and move to the left. That way, when we take the final string and print it out, we'll get the bits in the correct, left-to-right order.
But why does j start out as sz - 2? Partly because arrays in C are 0-based, partly to leave room for the null character '\0' that terminates arrays in C. Here's a picture that should make things clearer. This will be the situation after we've completely converted the number eleven:
0 1 2 31 32 33 34 35 36 37 38 39
+---+---+---+-- ~ --+---+---+---+---+---+---+---+---+---+
result: | | | | ... | | | | | 1 | 0 | 1 | 1 |\0 |
+---+---+---+-- ~ --+---+---+---+---+---+---+---+---+---+
^ ^ ^ ^
| | | |
binary final return initial
j value j
The result array in the caller is declared as char result[40];, so it has 40 elements, from 0 to 39. And sz is passed in as 40. But if we want j to start out "at the right edge" of the array, we can't initialize j to sz, because the leftmost element is 39, not 40. And we can't initialize j as sz - 1, either, because we have to leave room for the terminating '\0'. That's why we initialize j to sz - 2, or 38.
The next possibly-confusing aspect of the integer_binary function is the line
binary[j] = '0' + rem;
Here, rem is either 0 or 1, the next bit of our binary conversion we've converted. But since we're creating a string representation of the binary number, we want to fill the binary result in with one of the characters '0' or '1'. But characters in C are represented by tiny integers, and you can do arithmetic on them. The constant '0' is the value of the character 0 in the machine's character set (typically 48 in ASCII). And the bottom line is that '0' + 1 turns into the character '1'. So '0' + rem turns into '0' if rem is 0, or '1' if rem is 1.
Next to talk about is the loop I used. The original decimal_binary function used while(n != 0) {...}, but I'm using do { ... } while(n != 0). What's the difference? It's precisely that the do/while loop always runs once, even if the controlling expression is false. And that's what we want here, so that the number 0 will be converted to the string "0", not the empty string "". (That wasn't an issue for integer_binary, because it returned the integer 0 in that case, but that was a side effect of its otherwise-poor choice of int as its return value.)
Next we have the line
binary[sz-1] = '\0';
We've touched on this already: it simply fills in the necessary null character which terminates the string.
Finally, there's the last line,
return &binary[j+1];
What's going on there? The integer_binary function is supposed to return a string, or in this case, a pointer to the first character of a null-terminated array of characters. Here we're returning a pointer (generated by the & operator) to the element binary[j+1] in the result array. We have to add one to j because we always subtract 1 from it in the loop, so it always indicates the next cell in the array where we'd store the next character. But we exited the loop because there was no next character to generate, so the last character we did generate was at j's previous value, which is j+1.
(This integer_binary function is therefore mildly unusual in one respect. The caller passes in an empty array, and the function builds its result string in the empty array, but the pointer it returns, which points to the constructed string, does not usually point to the beginning of the passed-in array. It will work fine as long as the caller uses the returned pointer, as expected. But it's unusual, and the caller would get confused if accidentally using its own original result array as if it would contain the result.)
One more thing: that line if(j < 0) return NULL; at the top of the loop is a double check that the caller gave us a big enough array for the result we're generating. If we run out of room for the digits we're generating, we can't generate a correct result, so we return a null pointer instead. (That's likely to cause problems in the caller unless explicitly checked for, but that's a story for another day.)
So integer_binary as discussed so far will work, although I'd like to make three improvements to address some remaining deficiencies:
The decimal_binary function as shown won't handle negative numbers correctly.
The way the decimal_binary function uses the j variable is a bit clumsy. (Evidence of the clumsiness is the fact that I had to expend so many words explaining the j = sz-2 and return &binary[j+1] parts.)
The decimal_binary functions as shown only handles, obviously, binary, but what I really want (although you didn't ask for it) is a function that can convert to any base.
So here's an improved version. Based on the integer_binary function we've already seen, there are just a few small steps to achieve the desired improvements. I'm calling the new function integer_base, because it converts to any base (well, any base up to 10, anyway). Here it is:
char *integer_base(int n, int base, char *result, int sz)
{
int rem;
int j = sz - 1;
int negflag = 0;
if(n < 0) {
n = -n;
negflag = 1;
}
result[j] = '\0';
do {
j--;
if(j < 0) return NULL;
rem = n % base;
n = n / base;
result[j] = '0' + rem;
} while(n != 0);
if(negflag) {
j--;
result[j] = '-';
}
return &result[j];
}
As mentioned, this is just like integer_binary, except:
I've changed the way j is used. Before, it was always the index of the next element of the result array we were about to fill in. Now, it's always one to the right of the next element we're going to fill in. This is a less obvious choice, but it ends up being more convenient. Now, we initialize j to sz-1, not sz-2. Now, we do the decrement j-- before we fill in the next character of the result, not after. And now, we can return &binary[j], without having to remember to subtract 1 at that spot.
I've moved the insertion of the terminating null character '\0' up to the top. Since we're building the whole string right-to-left, it makes sense to put the terminator in first.
I've handled negative numbers, in a kind of brute-force but expedient way. If we receive a negative number, we turn it into a positive number (n = -n) and use our regular algorithm on it, but we set a flag negflag to remind us that we've done so and, when we're all done, we tack a '-' character onto the beginning of the string.
Finally, and this is the biggie, the new function works in any base. It can create representations in base 2, or base 3, or base 5, or base 7, or any base up to 10. And what's really neat is how few modifications were required in order to achieve this. In fact, there were just two: In two places where I had been dividing by 2, now I'm dividing by base. That's it! This is the realization of something I said back at the very beginning of this too-long answer: "The basic idea is to repeatedly divide by b."
(Actually, I lied: There was a fourth change, in that I renamed the result parameter from "binary" to "result".)
Although you might be thinking that this integer_base function looks pretty good, I have to admit that it still has at least three problems:
It won't work for bases greater than 10.
It can occasionally overflow its result buffer.
It has an obscure problem when trying to convert the largest negative number.
The reason it only works for bases up to 10 is the line
result[j] = '0' + rem;
This line only knows how to create ordinary digits in the result. For (say) base 16, it would also have to be able to create hexadecimal digits A - F. One quick but obfuscated way to achieve this is to replace that line with
result[j] = "0123456789ABCDEF"[rem];
This answer is too long already, so I'm not going to get into a side discussion on how this trick works.
The second problem is hiding in the lines I added to handle negative numbers:
if(negflag) {
j--;
result[j] = '-';
}
There's no check here that there's enough room in the result array for the minus sign. If the array was just barely big enough for the converted number without the minus sign, we'll hit this part of the code with j being 0, and we'll subtract 1 from it, and fill the minus sign in to result[-1], which of course doesn't exist.
Finally, on a two's complement machine, if you pass the most negative integer, INT_MIN, in to this function, it won't work. On a 16-bit 2's complement machine, the problem number is -32768. On a 32-bit machine, it's -2147483648. The problem is that +32768 can't be represented as a signed integer on a 16-bit machine, nor will +2147483648 fit in 32 signed bits. So a rewrite of some kind will be necessary in order to achieve a perfectly general function that can also handle INT_MIN.
In order to convert a decimal number to a binary number, there is a simple recursive algorithm to apply to that number (recursive = something that is repeated until something happen):
take that number and divide by 2
take the reminder
than repeat using as current number, the original number divided by 2 (take in account that this is a integer division, so 2,5 becomes 2) until that number is different to 0
take all the reminders and read from the last to the first, and that's the binary form of that number
What that function does is exactly this
take the number and divide it by 2
takes the reminder and add it in into the variable binary multiplied by and i that each time is multiplied by 10, in order to have the first reminder as the less important digit, and the last one as the most significant digit, that is the same of take all the reminders and read them from the last to the first
save as n the n/2
and than repeat it until the current number n is different to 0
Also getch() is sometimes used in Windows in order to hold the command prompt open, but is not that recommended
getchar() stops your program in console. Maths behind function looks like this:
n=7:
7%2=1; //rem=1
7/2=3; //n=3
binary=1;
next loop
n=3:
3%2=1;
3/2=1; //n=1;
binary=11 //1 + 1* 10
final loop
n=1:
1%2=1;
1/2=0; //n=0;
binary=111 //11+1*100
So, I wrote a function converting a decimal number into a hexadecimal number by using recursion, but I can't seem to figure out how to add the prefix "0x" and leading zeros to my converted hexadecimal number. Let's say I pass the number 18 into the parameters of my function. The equivalent hexadecimal number should be 0x00000012. However, I only end up getting 12 as my hexidecimal number. The same applies when I pass in a hexidecimal number 0xFEEDDAD. I end up getting only FEEDDAD without the prefix as my answer. Can someone please help me figure this out? I've listed my code below. Also, I'm only allowed to use fputc to display my output.
const char digits[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
void hexout (unsigned long number, FILE * stream)
{
long quotient;
long remainder;
quotient = number / 16;
remainder = number % 16;
if(quotient != 0)
hexout(quotient,stream);
fputc(digits[remainder],stream);
}
void hexout (unsigned long number, FILE * stream)
{
fprintf(stream, "0x%08lX", number);
}
If you cannot use fprintf (neither sprintf), you can use this kind of code (no recursion, but a 8-chars array on the stack):
const char digits[] = "0123456789ABCDEF";
void hexout(unsigned long number, FILE * stream)
{
unsigned long int input = number;
unsigned long int quotient;
unsigned long int remainder;
unsigned short ndigit = 0;
char result[8] = {0};
// Compute digits
do
{
quotient = input / 16;
remainder = input % 16;
result[7-ndigit] = digits[remainder];
input = quotient;
ndigit++;
}
while (ndigit < 8);
// Display result
fputc('0', stream);
fputc('x', stream);
for (ndigit = 0; ndigit < 8; ndigit++)
{
fputc(result[ndigit], stream);
}
}
Of course, this can be improved a lot...
Add digits to a string, and print out string with zero-padding using fprintf. Or just use fprintf to begin with.
Your own hexout fails for obvious reasons. You cannot 'continue' to output a number of zeroes when the value reaches 0, because you don't know how much numbers you already emitted. Also, you don't know when to prepend "0x" -- it should be before you start to emit hex digits, but how can you know you are at the start?
The logical way¹ to do this is to not use recursion, but a simple loop instead. Then again -- unsaid, but a fair bet this is a homework assignment, and in that case any number of silly constraints are possible ("write a C program without using the character '{'" comes to mind). In your case it's "you must use recursion".
You must add a counter to your recursive function; when it reaches 0, you know you have output 0x, and if it's not 0 you need to output a hex digit, irrespective if your value is 0 or not. There are a couple of ways of adding a counter to a recursive function: a global variable (which would be the easiest and utterly ugliest way, so please don't stop reading here), a static variable -- only semantically better than a global --, or a pass-by-reference argument (of which some say is a myth, but then again the end result is the same).
Which method is best for you depends on how well you can defend why you used that method.
¹ So is printf("0x%08X") an "illogical" solution? Yes. It solves the problem but without any further insights. The purpose of this assignment is not to find out the existence of printf and its parameters, it's to learn how (and why) to use recursion.
I really don't know how to implement this function:
The function should take a pointer to an integer, a pointer to an array of strings, and a string for processing. The function should write to array all variations of exchange 'ch' combination to '#' symbol and change the integer to the size of this array. Here is an example of processing:
choker => {"choker","#oker"}
chocho => {"chocho","#ocho","cho#o","#o#o"}
chachacha => {"chachacha","#achacha","cha#acha","chacha#a","#a#acha","cha#a#a","#acha#a","#a#a#a"}
I am writing this in C standard 99. So this is sketch:
int n;
char **arr;
char *string = "chacha";
func(&n,&arr,string);
And function sketch:
int func(int *n,char ***arr, char *string) {
}
So I think I need to create another function, which counts the number of 'ch' combinations and allocates memory for this one. I'll be glad to hear any ideas about this algorithm.
You can count the number of combinations pretty easily:
char * tmp = string;
int i;
for(i = 0; *tmp != '\0'; i++){
if(!(tmp = strstr(tmp, "ch")))
break;
tmp += 2; // Skip past the 2 characters "ch"
}
// i contains the number of times ch appears in the string.
int num_combinations = 1 << i;
// num_combinations contains the number of combinations. Since this is 2 to the power of the number of occurrences of "ch"
First, I'd create a helper function, e.g. countChs that would just iterate over the string and return the number of 'ch'-s. That should be easy, as no string overlapping is involved.
When you have the number of occurences, you need to allocate space for 2^count strings, with each string (apart from the original one) of length strlen(original) - 1. You also alter your n variable to be equal to that 2^count.
After you have your space allocated, just iterate over all indices in your new table and fill them with copies of the original string (strcpy() or strncpy() to copy), then replace 'ch' with '#' in them (there are loads of ready snippets online, just look for "C string replace").
Finally make your arr pointer point to the new table. Be careful though - if it pointed to some other data before, you should think about freeing it or you'll end up having memory leaks.
If you would like to have all variations of replaced string, array size will have 2^n elements. Where n - number of "ch" substrings. So, calculating this will be:
int i = 0;
int n = 0;
while(string[i] != '\0')
{
if(string[i] == 'c' && string[i + 1] == 'h')
n++;
i++;
}
Then we can use binary representation of number. Let's note that incrementing integer from 0 to 2^n, the binary representation of i-th number will tell us, which "ch" occurrence to change. So:
for(long long unsigned int i = 0; i < (1 << n); i++)
{
long long unsigned int number = i;
int k = 0;
while(number > 0)
{
if(number % 2 == 1)
// Replace k-th occurence of "ch"
number /= 2;
k++;
}
// Add replaced string to array
}
This code check every bit in binary representation of number and changes k-th occurrence if k-th bit is 1. Changing k-th "ch" is pretty easy, and I leave it for you.
This code is useful only for 64 or less occurrences, because unsigned long long int can hold only 2^64 values.
There are two sub-problems that you need to solve for your original problem:
allocating space for the array of variations
calculating the variations
For the first problem, you need to find the mathematical function f that takes the number of "ch" occurrences in the input string and returns the number of total variations.
Based on your examples: f(1) = 1, f(2) = 4 and f(3) = 8. This should give you a good idea of where to start, but it is important to prove that your function is correct. Induction is a good way to make that proof.
Since your replace process ensures that the results have either the same of a lower length than the original you can allocate space for each individual result equal to the length of original.
As for the second problem, the simplest way is to use recursion, like in the example provided by nightlytrails.
You'll need another function which take the array you allocated for the results, a count of results, the current state of the string and an index in the current string.
When called, if there are no further occurrences of "ch" beyond the index then you save the result in the array at position count and increment count (so the next time you don't overwrite the previous result).
If there are any "ch" beyond index then call this function twice (the recurrence part). One of the calls uses a copy of the current string and only increments the index to just beyond the "ch". The other call uses a copy of the current string with the "ch" replaced by "#" and increments the index to beyond the "#".
Make sure there are no memory leaks. No malloc without a matching free.
After you make this solution work you might notice that it plays loose with memory. It is using more than it should. Improving the algorithm is an exercise for the reader.