rle compression algorithm c - c

I have to do a rle algorithm in c with the escape character (Q)
example if i have an input like: AAAAAAABBBCCCDDDDDDEFG
the output have to be: QA7BBBCCCQD6FFG
this is the code that i made:
#include <stdio.h>
#include <stdlib.h>
void main()
{
FILE *source = fopen("Test.txt", "r");
FILE *destination = fopen("Dest.txt", "w");
char carCorrente; //in english: currentChar
char carSucc; // in english: nextChar
int count = 1;
while(fread(&carCorrente, sizeof(char),1, source) != 0) {
if (fread(&carCorrente, sizeof(char),1, source) == 0){
if(count<=3){
for(int i=0;i<count;i++){
fprintf(destination,"%c",carCorrente);
}
}
else {
fwrite("Q",sizeof(char),1,destination);
fprintf(destination,"%c",carCorrente);
fprintf(destination,"%d",count);
}
break;
}
else fseek(source,-1*sizeof(char), SEEK_CUR);
while (fread(&carSucc, sizeof(char), 1, source) != 0) {
if (carCorrente == carSucc) {
count++;
}
else {
if(count<=3){
for(int i=0;i<count;i++){
fprintf(destination,"%c",carCorrente);
}
}
else {
fwrite("Q",sizeof(char),1,destination);
fprintf(destination,"%c",carCorrente);
fprintf(destination,"%d",count);
}
count = 1;
goto OUT;
}
}
OUT:fseek(source,-1*sizeof(char), SEEK_CUR); //exit 2° while
}
}
the problem is when i have an input like this: ABBBCCCDDDDDEFGD
in this case the output is: QB4CCCQD5FFDD
and i don't know why :(

There is no need to use Fseek to rewind as u have done , Here is a code that is have written without using it by using simple counter & current sequence character.
C implementation:
#include<stdio.h>
#include<stdlib.h>
void main()
{
FILE *source = fopen("Test.txt", "r");
FILE *destination = fopen("Dest.txt", "w");
char currentChar;
char seqChar;
int count = 0;
while(1) {
int flag = (fread(&currentChar, sizeof(char),1, source) == 0);
if(flag||seqChar!=currentChar) {
if(count>3) {
char ch = 'Q';
int k = count;
char str[100];
int digits = sprintf(str,"%d",count);
fwrite(&ch,sizeof(ch),1,destination);
fwrite(&seqChar,sizeof(ch),1,destination);
fwrite(&str,sizeof(char)*digits,1,destination);
}
else {
for(int i=0;i<count;i++)
fwrite(&seqChar,sizeof(char),1,destination);
}
seqChar = currentChar;
count =1;
}
else count++;
if(flag)
break;
}
fclose(source);
fclose(destination);
}

Your code has various problems. First, I'm not sure whether you should read straight from the file. In your case, it might be better to read the source string to a text buffer first with fgets and then do the encoding. (I think in your assignment, you should only encode letters. If source is a regular text file, it will have at least one newline.)
But let's assume that you need to read straight from the disk: You don't have to go backwards. You already habe two variables for the current and the next char. Read the next char from disk once. Before reading further "next chars", assign the :
int carSucc, carCorr; // should be ints for getc
carSucc = getc(source); // read next character once before loop
while (carSucc != EOF) { // test for end of input stream
int carCorr = next; // this turn's char is last turn's "next"
carSucc = getc(source);
// ... encode ...
}
The going forward and backward makes the loop complicated. Besides, what happens if the second read read zero characters, i.e. has reached the end of the file? Then you backtrace once and go into the second loop. That doesn't look as if it was intended.
Try to go only forward, and use the loop above as base for your encoding.

I think the major problem in your approach is that it's way too complicated with multiple different places where you read input and seek around in the input. RLE can be done in one pass, there should not be a need to seek to the previous characters. One way to solve this is to change the logic into looking at the previous characters and how many times they have been repeated, instead of trying to look ahead at future characters. For instance:
int repeatCount = 0;
int previousChar = EOF;
int currentChar; // type changed to 'int' for fgetc input
while ((currentChar = fgetc(source)) != EOF) {
if (currentChar != previousChar) {
// print out the previous run of repeated characters
outputRLE(previousChar, repeatCount, destination);
// start a new run with the current character
previousChar = currentChar;
repeatCount = 1;
} else {
// same character repeated
++repeatCount;
}
}
// output the final run of characters at end of input
outputRLE(previousChar, repeatCount, destination);
Then you can just implement outputRLE to do the output to print out a run of the character c repeated count times (note that count can be 0); here's the function declaration:
void outputRLE(const int c, const int count, FILE * const destination)
You can do it pretty much the same way as in your current code, although it can be simplified greatly by combining the fwrite and two fprintfs to a single fprintf. Also, you might want to think what happens if the escape character 'Q' appears in the input, or if there is a run of 10 or more repeated characters. Deal with those cases in outputRLE.
An unrelated problem in your code is that the return type of main should be int, not void.

Thank you so much, i fixed my algorithm.
The problem was a variable, in the first if after the while.
Before
if (fread(&carCorrente, sizeof(char),1, source) == 0)
now
if (fread(&carSucc, sizeof(char),1, source) == 0){
for sure all my algorithm is wild. I mean it is too much slow!
i made a test with my version and with the version of Vikram Bhat and i saw how much my algorithm losts time.
For sure with getc() i can save more time.
now i'm thinking about the encoding (decompression) and i can see a little problem.
example:
if i have an input like: QA7QQBQ33TQQ10QQQ
how can i recognize which is the escape character ???
thanks

Related

Append data from an array to a char variable in C

i have a program that i'm writing, and i need to read in a configuration file. if you can't tell by the way it's written it is a placeholder for another program, it opens the second program in its memory space. I have the readline function all set up, but my "main" operation will only support a variable for arguments (unless im incorrect), like this: "arg1 arg2 arg3..." I have seen things on the net like 'strcat' and others, but since im not so versed in C these seem to only add a single character. my needed solution would be:
char args[ 10 ];
FILE *fp=fopen("file.cfg","r");
void readLine(FILE* file, char* line, int limit)
{
int i;
int read;
read = fread(line, sizeof(char), limit, file);
line[read] = '\0';
for(i = 0; i <= read;i++)
{
if('\0' == line[i] || '\n' == line[i] || '\r' == line[i])
{
line[i] = '\0';
break;
}
}
if(i != read)
{
fseek(file, i - read + 1, SEEK_CUR);
}
}
int main(void)
{
_spawnl( P_OVERLAY, "prog1.exe", "prog1.exe", args, NULL );
return 0;
}
the 'args' variable in 'int main(void)' would need to be the one that is = line[i].
unless a complete rewrite would be nessecary.
Also b4 you flame me, i dont want a loop in main, because this program just calls another, then dies. A loop might make it call an infinite number of the same program, and... well that would be bad. thanks in advance!

How to store the even lines of a file to one array and the odd lines to another

I am given a file of DNA sequences and asked to compare all of the sequences with each other and delete the sequences that are not unique. The file I am working with is in fasta format so the odd lines are the headers and the even lines are the sequences that I want to compare. SO I am trying to store the even lines in one array and the odd lines in another. I am very new to C so I'm not sure where to begin. I figured out how to store the whole file in one array like this:
int main(){
int total_seq = 50;
char seq[100];
char line[total_seq][100];
FILE *dna_file;
dna_file = fopen("inabc.fasta", "r");
if (dna_file==NULL){
printf("Error");
}
while(fgets(seq, sizeof seq, dna_file)){
strcpy(line[i], seq);
printf("%s", seq);
i++;
}
}
fclose(dna_file);
return 0;
}
I was thinking I would have to incorporate some sort of code that looked like this:
for (i = 0; i < rows; i++){
if (i % 2 == 0) header[i/2] = getline();
else seq[i/2] = getline();
but I'm not sure how to implement it.
Any help would be greatly appreciated!
To store the even lines of a file to one array and the odd lines to another,
read each char and swap output files when '\n' encountered.
void Split(FILE *even, FILE* odd, FILE *source) {
int evenflag = 1;
int ch;
while ((ch = fgetc(source)) != EOF) {
if (evenflag) {
fputc(ch, even);
} else {
fputc(ch, odd);
}
if (ch == '\n') {
evenflag = !evenflag;
}
}
}
It is not clear if this post also requires code to do the unique filtering step.
Could you please give me an example of the data in the file?
Am I right in thinking it'd be something like:
Header
Sequence
Header
Sequence
And so on
Perhaps you could do something like this:
int main(){
int total_seq = 50;
char seq[100];
char line[total_seq][100];
FILE *dna_file;
dna_file = fopen("inabc.fasta", "r");
if (dna_file==NULL){
printf("Error");
}
// Put this in an else statement
int counter = 1;
while(fgets(seq, sizeof seq, dna_file)){
// If counter is odd
// Place next line read in headers array
// If counter is even
// Place next line read in sequence array
// Increment counter
}
// Now you have all the sequences & headers. Remove any duplicates
// Foreach number of elements in 'sequence' array - referenced by, e.g. 'j' where 'j' starts at 0
// Foreach number of elements in 'sequence' array - referenced by 'k' - Where 'k' Starts at 'j + 1'
// IF (sequence[j] != '~') So if its not our chosen escape character
// IF (sequence[j] == sequence[k]) (I think you'd have to use strcmp for this?)
// SET sequence[k] = '~';
// SET header[k] = '~';
// END IF
// END IF
// END FOR
// END FOR
}
// You'd then need an algorithm to run through the arrays. If a '~' is found. Move the following non tilda/sequence down to its position, and so on.
// EDIT: Infact. It would probably be easier if when writing back to file, just ignore/don't write if sequence[x] == '~' (where 'x' iterates through all)
// Finally write back to file
fclose(dna_file);
return 0;
}
First: write a function that counts the number of newline (\n) characters in the file.
Then write a function that searches for the n-th newline
Last, write a function to go through and read from one '\n' to the next.
Alternately, you could just go online and read about string parsing.

Reading a text file in C, stopping at multiple points, breaking it into sections

I have a program that has a text file that is variable in length. It must be capable of being printed in the terminal. My problem is that if the code is too large, part of it becomes inaccessible due to the limited scroll of terminal. I was thinking of having a command executed by a character to continue the lines after a certain point, allowing the user to see what they needed, and scroll if they needed. However the closest I have come is what you see here, which prints the text file one line at a time as you press enter. This is extremely slow and cumbersome. Is there another solution?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
int main()
{
FILE *audit;
audit = fopen("checkout_audit.txt", "r");
char length_of_code[60000];
int ch;
while ((ch = fgetc(audit)) != EOF)
{
fgets(length_of_code, sizeof length_of_code, audit);
fprintf(stdout, length_of_code, audit);
getch();
if (ferror(audit))
{
printf("This is an error message!");
return 13;
}
}
fclose(audit);
return 0;
}
The libraries are included as I tried various methods. Perhaps there is something obvious I am missing, however after looking around I found nothing that suited my needs in C.
You can keep a count of something like num_of_lines and keep incrementing it and when it reaches some number(say 20 lines) then do a getchar() instead of doing it for each line.
Make sure you don't use feof() as already suggested. Just for the purpose of how it can be done I am showing the below snippet.
int num_of_lines = 0;
while(!feof(fp))
{
// fgets();
num_of_lines++;
if(num_of_lines == 20)
{
num_of_lines = 0;
getch();
}
}
Putting the same thing in your code:
int main()
{
FILE *audit;
audit = fopen("checkout_audit.txt", "r");
char length_of_code[60000];
int num_of_lines = 0;
int ch;
while (fgets(length_of_code, sizeof length_of_code, audit) != NULL)
{
fprintf(stdout, length_of_code, audit);
if (ferror(audit))
{
printf("This is an error message!");
return 13;
}
num_of_lines++;
if(num_of_lines == 20)
{
num_of_lines = 0;
getch();
}
}
fclose(audit);
return 0;
}
From the man page of fgets()
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s.
Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte is stored after the last character in the buffer.
So char length_of_code[60000]; is not a better option.
Try to set the size of array to optimum value which in most case is 80.
Also as fgets fetches line by line you will have to output line by line untill EOF
EDIT:
1. 2nd argument to fprintf should be the format specifier and not length
2. 3rd arg should be a string and not the file pointer
fprintf(stdout, "%s", length_of_code);
Code Snippet:
while (fgets(length_of_code, sizeof(length_of_code), audit))
{
fprintf(stdout, "%s", length_of_code);
getch();
if (ferror(audit))
{
printf("This is an error message!");
return 13;
}
}

Comparing 2 substrings in C

I having trouble reading a string of characters from a file and then comparing them for the first part of my homework on ubuntu using C.
So the program compiles fine but it seems I get stuck in an infinite loop when it gets to the while loop under the compare string portion of the code. Thanks.
Also, can I get some advice on how to take multiple inputs from the terminal to compare the string from the 'bar' file and the string of x substring of characters after that in the terminal. My output should look like:
% echo "aaab" > bar
% ./p05 bar aa B
2
1
%
This is what I have so far:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(void /*int argc, char *argv[]*/)
{
/******* Open, Read, Close file**********/
FILE *ReadFile;
ReadFile = fopen(/*argv[1]*/"bar", "r");
if(NULL == ReadFile)
{
printf("\n file did not open \n");
return 1;
}
fseek(ReadFile, 0 , SEEK_END);
int size = ftell(ReadFile);
rewind(ReadFile);
char *content = calloc( size +1, 1);
fread(content,1,size,ReadFile);
/*fclose(ReadFile); */
printf("you made it past opening and reading file\n");
printf("your file size is %i\n",size);
/*********************************/
/******String compare and print*****/
int count =0;
const char *tmp = "Helololll";
while (content = strstr(content,"a"))
{
count++;
tmp++;
}
printf("Your count is:%i\n",count);
/***********************************/
return 0;
}
The following loop is infinite if the character 'a' occurs in content.
while (content = strstr(content, "a"))
{
count ++;
tmp ++;
}
It resets content to point to the location of the first occurrence of 'a' on the first iteration. Future iterations will not change the value of content. IOW, content points to "aaab" so the call to strstr will find the first 'a' every time. If you replace tmp++ with content++ inside of your loop, then it will be closer to what you want. I would probably write this with a for loop to make it a little more clear that you are iterating.
char const * const needle = "a";
for (char *haystack=content; haystack=strstr(haystack, needle); haystack++) {
count++;
}
The haystack is incremented so that it always decreases in size. Eventually, you will not find the needle in the haystack and the loop will terminate.

Reading a file in C

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.
while(!feof(file))
{
fscanf(file,"%s",string);
printf("%s\n",string);
}
Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
So I used fgets to read the first line and use sscanf:
sscanf(line,"%[A-z]%n,word,len);
line+=len;
This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters
while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");
This one obviously take blank character as a delimitter but I have literally 100s of delimitters.
Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?
Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.
Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.
#include <stdio.h>
#include <string.h>
int main() {
static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
int state = WORD, ptr = 0;
char buffer[BUFLEN], *digits = "1234567890";
while ((c = getchar()) != EOF) {
if (strchr(digits, c)) {
if (WORD == state) {
buffer[ptr++] = c;
} else {
buffer[0] = c;
ptr = 1;
}
state = WORD;
} else {
if (WORD == state) {
buffer[ptr] = '\0';
printf("%s\n", buffer);
}
state = DELIM;
}
}
return 0;
}
If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.
In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.
Several points:
First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.
Second, you mentioned this:
fscanf(file,"%[A-z]",string);
It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.
That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.
Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.
#include <stdio.h>
#include <ctype.h>
int get_next_word(FILE *file, char *word, size_t wordSize)
{
size_t i = 0;
int c;
/**
* Skip over any non-alphanumeric characters
*/
while ((c = fgetc(file)) != EOF && !isalnum(c))
; // empty loop
if (c != EOF)
word[i++] = c;
/**
* Read up to the next non-alphanumeric character and
* store it to word
*/
while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
{
word[i++] = c;
}
word[i] = 0;
return c != EOF;
}
int main(void)
{
char word[SIZE]; // where SIZE is large enough to handle expected inputs
FILE *file;
...
while (get_next_word(file, word, sizeof word))
// do something with word
...
}
I would use:
FILE *file;
char string[200];
while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
/* do something with string... */
}
This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...
What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:
char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);
while(p)
{
p = strtok(NULL, ",");
printf("%S\n", p);
}

Resources