Why is fwrite writing each character so many times? - c

I'm trying to write a program that uncompresses a file which was already compressed using run length encoding. For some reason each character is getting printed to the file many times. For example if the input file contains 1l1i1n... my output file is showing llllllllll...
I've tried printing the reps variable from the for loop to the terminal to make sure it is set to the correct number of repetitions, and even tried using fprintf but getting the same results. Im not sure what it is but there must be something here Im not understanding?
By the way, the compressed file is opened in binary mode as well from the main function.
int uncompress_file(FILE *fd_compressed, const char *fname_out)
{
FILE *fd_out;
if (fd_compressed == NULL) {
fprintf(stderr, ...);
return -1;
}
if ((fd_out = fopen(fname_out, "wb")) == NULL) {
fprintf(stderr, ...);
return -1;
}
unsigned char cur, reps;
int i = 0;
while (fread(&cur, sizeof(unsigned char), 1, fd_compressed) > 0) {
if (i % 2 == 0) {
reps = cur;
}
else {
for (int j = 0; j < reps; j++)
fwrite(&cur, sizeof(unsigned char), 1, fd_out);
}
i++;
}
fclose(fd_out);
return 0;
}

The problem is your line reps = cur. If your file is 1l1i1n then when we first enter the loop, reps will be assigned the value 1. But this is ASCII 1, not the actual number 1. A 1 in ASCII maps to the decimal 49, so you will get 49 l's. To convert the char from an ASCII number to the proper int value, you can subtract 48 from it, i.e. reps = cur - 48.
Note that this, along with your code, only works if the maximum number possible is 9 (no double-digits)

Related

Access Binary form of Text saved in memory using an Array

The storage representation of the string or equivalently text from a file, is the ASCII code for each character of the string or text from a file, I have been told that I/O functions like fread and fgets will read a string from disk into memory without conversion. The C compiler always works with the storage representation, so when we "retrieve" a string in C, it's always in binary form.
I need to access this binary form to use in my code (without saving this as a binary file, also not asking to print in binary format).
For example, the text string "AA" is saved in memory as "0100000101000001", I need to access directly, without any conversion (like we do when we print, integer using %s, %d) this binary form "0100000101000001" of "AA" using an integer array, say, D[16] which has elements 0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1. So if I use an index int i, I will get 0 from D[4] for i=0.
Array-index operations like buffer[i] (for example, in the sample code in the below) will extract one character from a string:
FILE *fp = fopen("a.txt", "r");
if (fp == NULL)
return 1;
char buffer[100];
int r = fread(buf, 1, sizeof(buffer), fp);
if (r <= 0)
return 1;
printf("As string: %.*s", r, buffer);
printf("As integers:");
for (i = 0; i < r; i++)
printf(" %d", buffer[i]);
But I would like to have the complete text as an array of 0 and 1, whereas here, buffer[i] contains 8 bits which I cannot access individually each bit, how can I do that?
I have been told that I/O functions like fread and fgets will read a string from disk into memory without conversion.
This is true if the file has been open as binary, ie: with "rb". Such streams do not undergo any translation when read into memory, and all stream functions will read the contents as it is stored on disk, getc() included. If your system is unix based, there is no difference with "r", but on legacy systems, there can be substantial differences: text mode, which is the default, may imply end of line conversion, code page translation, end of file mitigation... If you want the actual file contents, always use binary mode ("rb").
You should also avoid the char type when dealing with binary representation, because char is signed by default on many architectures, hence inappropriate for byte values which are usually considered positive. Use unsigned char to prevent this issue.(*)
The most common way to display binary contents is using hexadecimal representation, where each byte is output as exactly 2 hex digits.
If you want to output binary representation, there is no standard printf conversion to output base-2 numbers, but you can write a loop to convert the byte to its bit values.
(*) among other historical issues such as non two's complement signed value representations
Here is a modified version:
#include <stdio.h>
int main() {
FILE *fp = fopen("a.txt", "r");
if (fp == NULL) {
perror("a.txt");
return 1;
}
unsigned char buffer[100];
unsigned char bits[100 * 8];
int r = fread(buffer, 1, sizeof(buffer), fp);
if (r <= 0) {
fprintf(stderr, "empty file\n");
fclose(fp);
return 1;
}
printf("As a string: %.*s\n\n", r, (char *)buffer);
int pos;
pos = printf("As 8-bit integers:");
for (int i = 0; i < r; i++) {
if (pos > 72) {
printf("\n");
pos = 0;
}
pos += printf(" %d", buffer[i]);
}
printf("\n\n");
pos = printf("As hex bytes:");
for (int i = 0; i < r; i++) {
if (pos > 72) {
printf("\n");
pos = 0;
}
pos += printf(" %02X", buffer[i]);
}
printf("\n\n");
pos = printf("Converting to a bit array:");
for (int i = 0; i < r; i++) {
for (int j = 8; j-- > 0;) {
bits[i * 8 + 7 - j] = (buffer[i] >> j) & 1;
}
}
/* output the bit array */
for (int i = 0; i < r * 8; i++) {
if (pos > 72) {
printf("\n ");
pos = 4;
}
pos += printf("%d", bits[i]);
}
printf("\n");
fclose(fp);
return 0;
}
Use bit masking to check the value of individual bits. Checkout a brief description here https://www.learn-c.org/en/Bitmasks
Then you can write the result to your array for the corresponding bit.

c program to read file and count words specified in array

I'm trying to read a file containing a paragraph, count the number of times specific words occur (words that I have specified and stored in an array) and then print that result to another file that would look something like,
systems, 2
computer, 3
programming, 6
and so on. Currently, all this code does is spit out every word in the paragraph and their respective counts. Any help would be much appreciated.
#include <stdio.h>
#include <string.h>
int main()
{
FILE* in;
FILE* out;
char arr1[13][100] = { "systems", "programming", "computer", "applications", "language", "machine"};
int arr2[180] = {0};
int count = 0;
char temp[150];
in = fopen("out2.dat", "r");
out = fopen("out3.dat", "w");
while (fscanf(in, "%s", temp) != EOF)
{
int i, check = 8;
for (i = 0;i < count;i++)
{
if (strcmp(temp, arr1[i]) == 0)
{
arr2[i]++;
check = 1;
break;
}
}
if (check == 1) continue;
strcpy(arr1[count], temp);
arr2[count++] = 1;
}
int i;
for (i = 0; i < count; i++)
fprintf(out, "%s, %d\n", arr1[i], arr2[i]);
return 0;
}
The use of count does not make much sense throughout this program.
It is declared as int count = 0;, and then used as the upper bound in this loop
for (i = 0; i < count; i++)
limiting which search words will be used. This also means that this loop will not be entered on the first iteration of the surrounding while loop.
As such, check != 1, so after this count is used as the index in arr1 at which the currently read "word" will be copied into
strcpy(arr1[count], temp);
which makes absolutely no sense. Why overwrite data you are searching for?
Then count is incremented to 1 after being used to set the first element of arr2 to 1.
On the second iteration of the while loop, the for loop will run for exactly one iteration, comparing the newly read "word" (temp) against the first element of arr1 (which is now the last "word" read).
If this matches: the first element in arr2 is incremented from 1 to 2, the string copy is skipped, and count is not incremented.
If this does not match, the new "word" is copied into the second element of arr1, the second element of arr2 is set to 1, and count is incremented to 2.
This spirals out of control from here.
Given the input shown above, this accesses arr1 out-of-bounds when count reaches 13.
With files that have a small selection of data (<= 13 unique "words", lengths < 100), this may accidentally "work" by populating arr1 with the words from the file. This will have the end effect of showing you the counts of each "word" in the input file.
Eventually, you will invoke Undefined Behavior when one of the following occurs:
fscanf(in, "%s", temp) reads a string that overflows the temp buffer.
count exceeds the bounds of arr1 or arr2.
strcpy(arr1[count], temp); copies a string that overflows a buffer in arr1.
Either fopen fail.
In addition to being unsafe, fscanf(in, "%s", temp) will consider anything other than whitespace as being part of a valid string. This includes trailing punctuation, which may or may not be an issue depending on which tokens you want to match (systems. vs. systems). You may need more robust parsing.
In any case, either create an array of structures composed of search words and frequencies, or, create two arrays of the same length to represent this data:
const char *words[6] = { "systems", "programming", "computer", "applications", "language", "machine"};
unsigned freq[6] = { 0 };
There is no need to copy anything. Remember to check if fopen fails, and to limit %s when reading as not to overflow the input buffer.
The rest of the program looks similar: test each input "word" against all search words; increment the corresponding frequency if a match.
An example using an array of structures:
#include <stdio.h>
#include <string.h>
int main(void) {
struct {
const char *word;
unsigned freq;
} search_words[] = {
{ "systems", 0 },
{ "programming", 0 },
{ "computer", 0 },
{ "applications", 0 },
{ "language", 0 },
{ "machine", 0 }
};
size_t length = sizeof search_words / sizeof *search_words;
FILE *input_file = fopen("out2.dat", "r");
FILE *output_file = fopen("out3.dat", "w");
if (!input_file || !output_file) {
fclose(input_file);
fclose(output_file);
fprintf(stderr, "Could not access files.\n");
return 1;
}
char word[256];
while (1 == fscanf(input_file, "%255s", word))
for (size_t i = 0; i < length; i++)
if (0 == strcmp(word, search_words[i].word))
search_words[i].freq++;
fclose(input_file);
for (size_t i = 0; i < length; i++)
fprintf(output_file, "%s, %u\n",
search_words[i].word,
search_words[i].freq);
fclose(output_file);
}
cat out3.dat:
systems, 1
programming, 1
computer, 2
applications, 2
language, 1
machine, 1

When using fprintf I get mirrored array of chars in the output

I'm converting numbers from 10 counting system (c.s.) to another and the printing it to the file.
void Fileoutkey(char *res1, char *res2, int sys1, int sys2) //KANON
{
FILE *fp;
if(fp = fopen("task_out.cpp", "w"))
{
fprintf(fp, "%d: %s\n", sys1, res1);
fprintf(fp, "%d: %s\n", sys2, res2);
fclose(fp);
}
else
{
printf("No such file in directory.\n");
exit(1);
}
}
The function of converting (and it is OK)
int numSystem1 = 12;
char digits1 [13] = "0123456789AB";
char result1 [18] = "";
int digCount1 = 0;
while (num)
{
int rem1 = num % numSystem1;
result1 [digCount1] = digits1[rem1];
num /= numSystem1;
digCount1++;
for (int i = digCount1; i >= 0; i--)
{
cout << result1[i]; //here i get 10
}
}
When converting from 10 to 12 c.s. for, example, number 12 , instead of 10 I get 01.
Output in the console is right.
Conversion code definitely doesn't work. Taking your example of converting 12 from base 10 to base 12, the loop does the following:
First time around num is 12. 12 % 12 is 0 - so that is what is stored as the first character of your string. num is then divided by 12 to become 1.
Second time around 1 % 12 is 1 and that is added as the second character. This means your string now contains "01". Which is what you're seeing in your output - your code is adding the digits in reverse order.
You could either work out how big your number is and then count down from that to add the characters in the other direction or reverse the string using.
And also after the loop you need to add the NUL terminator character like this:
result1 [digCount1] = '\0';

Transfer results to txt file C

so i'm completely new to programming (i've been learning for 3 days) and i find myself infront of a problem i simply don't know how to resolve.
I want this program to give me every single combination from 0 to a specific number in base 36. That is easy enough when the number is only about 50000 or so. But my goal from this is to extract actual words(with numbers too) and if i try to get words with 5 characters, the terminal will start overwriting the previous words(not helpful, i want ALL of them).
So i thought i should look for a way to transfer everything into a txt file and there resides my problem: I don't know how... Sorry for the long text but i wanted to explain precisely what i'm trying to get. Thanks for the help.
int main() {
int dec, j, i, q, r, k;
char val[80];
printf("Enter a decimal number: ");
scanf("%d", &dec);
for (k = 0; k <= dec; k++) { /*repeat for all possible combinations*/
q = k;
for (i = 1; q != 0; i++) { /*convert decimal number to value for base 36*/
r = q % 36;
if (r < 10)
r = r + 48;
else
r = r + 55;
val[i] = r;
q = q / 36;
}
for (j = i - 1; j > 0; j--) { /*print every single value*/
printf("%c", val[j]);
}
printf(" "); /*add spaces because why not*/
}
return (0);
}
A few observations that might help:
First is type related:
In your declarations you create the following:
int dec, j, i, q, r, k;
char val[80];
Then later you make the assignment:
val[i] = r;//assigning an int to a char, dangerous
While r is type int with a range (typically) of –2,147,483,648 to 2,147,483,647,
val[i] is of type char with a range (typically) of only –128 to 127.
Because of this, you may run into an overflow, leading to unexpected results.
The most immediate solution is use the same type for both variables. Pick either int or char, but not both.
The other has already been addressed correctly by #Nasim. Use the file version of printf() - fprintf(). As the link shows, the prototype for fprintf() is:
int fprintf( FILE *stream, const char *format [, argument ]...);
Usage example:
FILE *fp = fopen(".\somefile.txt", "w");//create a pointer to a FILE
if(fp)//if the FILE was successfully created, write to it...
{
// some of your previous code...
for (j = i - 1; j > 0; j--)
{ /*print every single value*/
fprintf(fp, "%c", val[j]);//if val is typed as char
//OR
fprintf(fp, "%d", val[j]);//if val is typed as int
}
fclose(fp);
}
Lastly, there are a wide range of methods to perform base conversion. Some more complicated than others.
create a file and then you can use fprintf() instead of printf the only difference between the two is that you need to specify the file as an argument
FILE *myFile = fopen("file.txt", "w"); //"w" erase previous content, "a" appends
If(myFile == NULL) {printf("Error in openning file\n"); exit(1);}
fprintf(myFile, "some integer : %d\n", myInteger); // same as printf my specify file pointer name in first argument
fclose(myFile); //dont forget to close the file

How to read a 10 GB txt file consisting of tab-separated double data line by line in C

I have a txt file consisting of tab-separated data with type double. The data file is over 10 GB, so I just wish to read the data line-by-line and then do some processing. Particularly, the data is layout as an matrix with, say 1001 columns, and millions of rows. Below is just a fake sample to show the layout.
10.2 30.4 42.9 ... 3232.000 23232.45
...
...
7.234 824.23232 ... 4009.23 230.01
...
For each line I'd like to store the first 1000 values in an array, and the last value in a separate variable. I am new to C, so it would be nice if you could kindly point out major steps.
Update:
Thanks for all valuable suggestions and solutions. I just figured out one simple example where I just read a 3-by-4 matrix row by row from a txt file. For each row, the first 3 elements are stored in x, and the last element is stored in vector y. So x is a n-by-p matrix with n=p=3, y is a 1-by-3 vector.
Below is my data file and my code.
Data file:
1.112272 -0.345324 0.608056 0.641006
-0.358203 0.300349 -1.113812 -0.321359
0.155588 2.081781 0.038588 -0.562489
My code:
#include<math.h>
#include <stdlib.h>
#include<stdio.h>
#include <string.h>
#define n 3
#define p 3
void main() {
FILE *fpt;
fpt = fopen("./data_temp.txt", "r");
char line[n*(p+1)*sizeof(double)];
char *token;
double *x;
x = malloc(n*p*sizeof(double));
double y[n];
int index = 0;
int xind = 0;
int yind = 0;
while(fgets(line, sizeof(line), fpt)) {
//printf("%d\n", sizeof(line));
//printf("%s\n", line);
token = strtok(line, "\t");
while(token != NULL) {
printf("%s\n", token);
if((index+1) % (p+1) == 0) { // the last element in each line;
yind = (index + 1) / (p+1) - 1; // get index for y vector;
sscanf(token, "%lf", &(y[yind]));
} else {
sscanf(token, "%lf", &(x[xind]));
xind++;
}
//sscanf(token, "%lf", &(x[index]));
index++;
token = strtok(NULL, "\t");
}
}
int i = 0;
int j = 0;
puts("Print x matrix:");
for(i = 0; i < n*p; i++) {
printf("%f\n", x[i]);
}
printf("\n");
puts("Print y vector:");
for(j = 0; j < n; j++) {
printf("%f\t", y[j]);
}
printf("\n");
free(x);
fclose(fpt);
}
With above, hopefully things will work if I replace data_temp.txt with my raw 10 GB data file (of course change values of n,p, and some other code wherever necessary.)
I have additional questions that I wish if you could help me.
I first initialized char line[] as char line[(p+1)*sizeof(double)] (note not multiplying n). But the line cannot be read completely. How could I assign memory JUST for one single line? What's the lenght? I assume it's (p+1)*sizeof(double) since there are (p+1) doubles in each line. Should I also assign memory for \t and \n? If so, how?
Does the code look reasonable to you? How could I make it more efficient since this code will be executed over millions of rows?
If I don't know the number of columns or rows in the raw 10 GB file, how could I quickly count rows and columns?
Again I am new to C, any comments are very appreciated. Thanks a lot!
1st way
Read file in chunks into preallocated buffer using fread.
2nd way
Map the file into your process memory space using mmap, move the pointer then over the file.
3rd way
Since your file is delimited by lines, open the file with fopen, use setvbuf or similar to set a buffer size greater than about 10 lines or so, then read the file line-by-line using fgets.
To potentially read the file even faster, use open with O_DIRECT (assuming Linux), then use fdopen to get a FILE * for the open file, then use setvbuf to set a page-aligned buffer. Doing that will allow you to bypass the kernel page cache - if your system's implementation works successfully using direct IO that way. (There can be many restrictions to direct IO)
Something to get you started: Reading 1 line
#define COLUMN (1000+1)
double data[COLUMNS];
for (int i = 0; i< COLUMN; i++) {
char delim = '\n';
int cnt = fscanf(in_stream, "%lf%c", &data[i], &delim);
if (cnt < 1) {
if (cnt == EOF && i == 0) return 0; // None read, OK as end of file
puts("Missing or bad data");
return -1; // problem
}
if (delim != '\t') {
// If tab not found, should be at end of line
if (delim == '\n' && i == COLUMN-1) {
return COLUMN; // Success
}
puts("Bad delimiter");
return -1;
}
}
puts("Extra data");
return -1;

Resources