Sorting duplicate lines in C

Sorting duplicate lines in C - c

I am trying to write a C program that can filter through lines. It is supposed to print only one line when there are consecutive duplicate lines. I have to use arrays of chars to compare the lines. The size of the arrays are inconsequential (set at 79 chars for the project). I have initialized the arrays as such:
char newArray [MAXCHARS];
char oldArray [MAXCHARS];
and have filled the array by using this for loop, to check for newlines and the end of file:
for(i = 0; i<MAXCHARS;i++){
if((newChar = getc(ifp)) != EOF){
if(newChar != '/n'){
oldArray[i] = newChar;
oldCount++;
}
else if(newChar == '/n'){
oldArray[i] = newChar;
oldCount++;
break;
}
}
else{
endOf = true;
break;
}
}
To cycle through the next line(s) and search for duplicates, I am using a while loop that is initially set to true. It fills the next array up to the newline and tests for EOF as well. Then, I use two for loops to test the arrays. If they are the same at each position in the arrays, duplicate remains unchanged and nothing is printed. If they are not the same, duplicate is set to false and a function (testArrays) is called to print the contents of each array.
while(duplicate){
newCount = 0;
/* fill second array, test for newlines and EOF*/
for(i =0; i< MAXCHARS; i++){
if((newChar = getc(ifp)) != EOF){
if(newChar != '/n'){
newArray[i] = newChar;
newCount++;
}
else if(newChar == '/n'){
newArray[i] = newChar;
newCount++;
break;
}
}
else{
endOf = true;
break;
}
}
/* test arrays against each other to spot duplicate lines*
if they are duplicates, continue the while loop getting new
arrays of characters in newArray until these tests fail*/
for(i =0; i< oldCount; i++){
if(oldArray[i] == newArray[i]){
continue;
}
else{
duplicate = false;
break;
}
}
for(i =0; i <newCount; i++){
if(oldArray[i] == newArray[i]){
continue;
}
else{
duplicate = false;
break;
}
}
if(endOf && duplicate){
testArray(oldArray);
break;
}
}
if((endOf && !duplicate) || (!endOf && !duplicate)){
testArray(oldArray);
testArray(newArray);
}
I find that this does not work and consecutive identical lines are being printed anyways. I cannot figure out how this could be happening. I know this is a lot of code to wade through but it is pretty straight forward and I think that another set of eyes on this will spot the problem easily. Thanks for the help.

is there a reason why you read a character at a time and instead of calling fgets() to read a line?
char instr[MAXCHARS];
for( iline = 0; ( fgets( instr, 256, ifp ) ); iline++ ) {
. . .<strcmp() current line to previous line here>. . .
}
EDIT:
You might want to declare 2 character strings and 3 char pointers -- one point to the current line and the other to the previous line. Then swap the two pointers using the third pointer.

You need to use a function to read lines — either fgets() or one you write (or POSIX getline() if you are familiar with dynamic memory allocation).
You then need to use an algorithm equivalent to:
Read first line into old.
If there is no line (EOF), stop.
Print the first line.
For every extra line read into new.
If there is no line (EOF), stop.
If new is the same as old, go to step 4.
Print new.
Copy new to old.
Go to step 4.
Those 'go to' steps would be part of normal loop controls, not actual goto statements.

I would do it by strings instead of char by char. I would use gets() to get the full input line and strcmp it to the previous string. You can also use fgets(str, MAX_CHARS, stdin) if you want. strcmp assumes your strings are nul terminated and you may need special EOF handling but something like whats below should work:
int main(){
char newStr[MAX_CHARS] = {0}; //string for new input
char oldStr[MAX_CHARS] = {0};
// Loop over input as long as there is something to read
while(gets(newStr) != NULL){
if(strcmp(newStr,oldStr) != 0){
printf("%s", newStr);
}
else{
//This is the case when you have duplicate strings. Dont print
}
memset(oldStr, 0, sizeof(oldStr)); //clear out old string incase it was longer
strcpy(oldStr, newStr); //copy new string into old string for future compare
}
}

At the part where you tested for duplicate, maybe you could test if oldCount == newCount first? My reasoning is that, if it is a duplicate line, oldCount will be equals to newCount. If it’s true, then proceed to check against the two array?

Related

Removing neighboring duplicate lines from a file using C

Empty lines also should be removed if they are duplicates. If line has escape sequences (like \t), it's different than empty line. Code below is deleting too many lines, or sometimes leave duplicates. How to fix this?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char a[6000];
char b[6000];
int test = 0;
fgets(a, 6000, stdin);
while (fgets(b, 6000, stdin) != NULL) {
for (int i = 0; i < 6000; i++) {
if (a[i] != b[i]) {
test = 1;
}
}
if (test == 0) {
fgets(b, 6000, stdin);
} else {
printf("%s", a);
}
int j = 0;
while (j < 6000) {
a[j] = b[j];
j++;
}
test = 0;
}
return 0;
}

Your logic is mostly sound. You are on the right track with your train of thought:
Read a line into previous (a).
Read another line into current (b).
If previous and current have the same contents, go to step 2.
Print previous.
Move current to previous.
Go to step 2.
This still has some problems, however.
Unnecessary line-read
To start, consider this bit of code:
while(fgets(b,6000,stdin)!=NULL) {
...
if(test==0) {
fgets(b,6000,stdin);
}
else {
printf("%s",a);
}
...
}
If a and b have the same contents (test==0), you use an unchecked fgets to read a line again, except you read again when the loop condition fgets(b,6000,stdin)!=NULL is evaluated. The problem is that you're mostly ignoring the line you just read, meaning you're moving an unknown line from b to a. Since the loop already reads another line and checks for failure appropriately, just let the loop read the line, and invert the if statement's equality test to print a if test!=0.
Where's the last line?
Your logic also will not print the last line. Consider a file with 1 line. You read it, then fgets in the loop condition attempts to read another line, which fails because you're at the end of the file. There is no print statement outside the loop, so you never print the line.
Now what about a file with 2 lines that differ? You read the first line, then the last line, see they're different, and print the first line. Then you overwrite the first line's buffer with the last line. You fail to read another line because there aren't any more, and the last line is, again, not printed.
You can fix this by replacing the first (unchecked) fgets with a[0] = 0. That makes the first byte of a a null byte, which means the end of the string. It won't compare equal to a line you read, so test==1, meaning a will be printed. Since there is no string in a to print, nothing is printed. Things then continue as normal, with the contents of b being moved into a and another line being read.
Unique last line problem
This leaves one problem: the last line won't be printed if it's not a duplicate. To fix this, just print b instead of a.
The final recipe
Assign 0 to the first byte of previous (a[0]).
Read a line into current (b).
If previous and current have the same contents, go to step 2.
Print current.
Move current to previous.
Go to step 2.
As you can see, it's not much different from your existing logic; only steps 1 and 4 differ. It also ensures that all fgets calls are checked. If there are no lines in a file, nothing is printed. If there is only 1 line in a file, it is printed. If 2 lines differ, both are printed. If 2 lines are the same, the first is printed.
Optional: optimizations
Instead of checking all 6000 bytes, you only check up to the first null byte in either string since fgets will automatically add one to mark the end of the string.
Faster still would be to add a break statement inside the if statement of your for loop. If a single byte doesn't match, the entire line is not a duplicate, so you can stop comparing early—a lot faster if only byte 10 differs in two 1000-byte lines!

#include <stdio.h>
#include <string.h>
int main(void)
{
char buff[2][6000];
unsigned count=0;
char *prev=NULL
, *this= buff[count%2]
;
while( fgets(this, sizeof buff[0] , stdin)) {
if(!prev || strcmp(prev, this) ) { // first or different
fputs(this, stdout);
prev=this;
count++;
this=buff[count%2];
}
}
fprintf(stderr, "Number of lines witten: %u\n", count);
return 0;
}

There are few problems in your code, like :
for(int i=0; i<6000; i++) {
if(a[i]!=b[i]) {
test=1;
}
}
In this loop, every time the whole buffer will be compared character by character even if it finds if(a[i]!=b[i]) for some value of i. Probably you should break loop after test=1.
Your logic will also not work for a file with just 1 line as you are not printing line outside the loop.
Another problem is fixed length buffer of size of 6000 char.
May you can use getline to solve your problem. You can do -
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char * line = NULL;
char * comparewith = NULL;
int notduplicate;
size_t len = 0;
ssize_t read;
while ((read = getline(&line, &len, stdin)) != -1) {
((comparewith == NULL ) || (strcmp (line, comparewith) != 0)) ? (notduplicate = 1) : (notduplicate = 0);
if (notduplicate) {
printf ("%s\n", line);
if (comparewith != NULL)
free(comparewith);
comparewith = line;
line = NULL;
}
}
if (line)
free (line);
if (comparewith)
free (comparewith);
return 0;
}
An important point to note:
getline() is not in the C standard library. getline() was originally GNU extension and standardized in POSIX.1-2008. So, this code may not be portable. To make it portable, you'll need to roll your own getline() something like this.

Here is a much simpler solution that has no limitation on line length:
#include <stdio.h>
int main(void) {
int c, last1 = 0, last2 = 0;
while ((c = getchar()) != EOF) {
if (c != '\n' || last1 != '\n' || last2 != '\n')
putchar(c);
last2 = last1;
last1 = c;
}
return 0;
}
The code skips sequences of more than 2 consecutive newline characters, hence it removes duplicate blank lines.

How to store the even lines of a file to one array and the odd lines to another

I am given a file of DNA sequences and asked to compare all of the sequences with each other and delete the sequences that are not unique. The file I am working with is in fasta format so the odd lines are the headers and the even lines are the sequences that I want to compare. SO I am trying to store the even lines in one array and the odd lines in another. I am very new to C so I'm not sure where to begin. I figured out how to store the whole file in one array like this:
int main(){
int total_seq = 50;
char seq[100];
char line[total_seq][100];
FILE *dna_file;
dna_file = fopen("inabc.fasta", "r");
if (dna_file==NULL){
printf("Error");
}
while(fgets(seq, sizeof seq, dna_file)){
strcpy(line[i], seq);
printf("%s", seq);
i++;
}
}
fclose(dna_file);
return 0;
}
I was thinking I would have to incorporate some sort of code that looked like this:
for (i = 0; i < rows; i++){
if (i % 2 == 0) header[i/2] = getline();
else seq[i/2] = getline();
but I'm not sure how to implement it.
Any help would be greatly appreciated!

To store the even lines of a file to one array and the odd lines to another,
read each char and swap output files when '\n' encountered.
void Split(FILE *even, FILE* odd, FILE *source) {
int evenflag = 1;
int ch;
while ((ch = fgetc(source)) != EOF) {
if (evenflag) {
fputc(ch, even);
} else {
fputc(ch, odd);
}
if (ch == '\n') {
evenflag = !evenflag;
}
}
}
It is not clear if this post also requires code to do the unique filtering step.

Could you please give me an example of the data in the file?
Am I right in thinking it'd be something like:
Header
Sequence
Header
Sequence
And so on
Perhaps you could do something like this:
int main(){
int total_seq = 50;
char seq[100];
char line[total_seq][100];
FILE *dna_file;
dna_file = fopen("inabc.fasta", "r");
if (dna_file==NULL){
printf("Error");
}
// Put this in an else statement
int counter = 1;
while(fgets(seq, sizeof seq, dna_file)){
// If counter is odd
// Place next line read in headers array
// If counter is even
// Place next line read in sequence array
// Increment counter
}
// Now you have all the sequences & headers. Remove any duplicates
// Foreach number of elements in 'sequence' array - referenced by, e.g. 'j' where 'j' starts at 0
// Foreach number of elements in 'sequence' array - referenced by 'k' - Where 'k' Starts at 'j + 1'
// IF (sequence[j] != '~') So if its not our chosen escape character
// IF (sequence[j] == sequence[k]) (I think you'd have to use strcmp for this?)
// SET sequence[k] = '~';
// SET header[k] = '~';
// END IF
// END IF
// END FOR
// END FOR
}
// You'd then need an algorithm to run through the arrays. If a '~' is found. Move the following non tilda/sequence down to its position, and so on.
// EDIT: Infact. It would probably be easier if when writing back to file, just ignore/don't write if sequence[x] == '~' (where 'x' iterates through all)
// Finally write back to file
fclose(dna_file);
return 0;
}

First: write a function that counts the number of newline (\n) characters in the file.
Then write a function that searches for the n-th newline
Last, write a function to go through and read from one '\n' to the next.
Alternately, you could just go online and read about string parsing.

Need to reverse file in place but it only works for one line files

So i think im closer here but im still getting funny results when printing the reversed string in place. I'll try to be detailed.
Here is the input:
Writing code in c
is fun
Here is what i want:
c in code Writing
fun is
Here is the actual output:
C
in code Writing
fun
is
Here is my code:
char str[1000]; /*making array large. I picked 1000 beacuse it'll never be written over. A line will never hole 1000 characters so fgets won't write into memory where it doesnt belong*/
int reverse(int pos)
{
int strl = strlen(str)-1,i;
int substrstart = 0,substrend = 0;
char temp;
for(;;)
{
if( pos <= strl/2){ /*This will allow the for loop to iterate to the middle of the string. Once the middle is reached you no longer need to swap*/
temp = str[pos]; /*Classic swap algorithm where you move the value of the first into a temp variable*/
str[pos]= str[strl-pos]; /*Move the value of last index into the first*/
str[strl-pos] = temp; /*move the value of the first into the last*/
}
else
break;
pos++; /*Increment your position so that you are now swaping the next two indicies inside the last two*/
} /* If you just swapped index 5 with 0 now you're swapping index 4 with 1*/
for(;substrend-1 <= strl;)
{
if(str[substrend] == ' ' || str[substrend] == '\0' ) /*in this second part of reverse we take the now completely reversed*/
{
for(i = 0; i <= ((substrend-1) - substrstart)/2; i++) /*Once we find a word delimiter we go into the word and apply the same swap algorthim*/
{
temp = str[substrstart+i]; /*This time we are only swapping the characters in the word so it looks as if the string was reversed in place*/
str[substrstart+i] = str[(substrend-1)-i];
str[(substrend-1)-i] = temp;
}
if(str[substrend] == '\t' || str[substrend] == '\n')
{
str[substrend] = ' ';
for(i = 0; i <= ((substrend-1) - substrstart)/2; i++) /*Once we find a word delimiter we go into the word and apply the same swap algorthim*/
{
temp = str[substrstart+i]; /*This time we are only swapping the characters in the word so it looks as if the string was reversed in place*/
str[substrstart+i] = str[(substrend-1)-i];
str[(substrend-1)-i] = temp;
}
}
if(str[substrend] == '\0')
{
break;
}
substrstart=substrend+1;
}
substrend++; /*Keep increasing the substrend until we hit a word delimiter*/
}
printf("%s\n", str); /*Print the reversed line and then jump down a line*/
return 0;
}
int main(int argc, char *argv[])
{
char *filename; /*creating a pointer to a filename*/
FILE *file20; /*creating FIlE pointer to a file to open*/
int n;
int i;
if (argc==1) /*If there is no line parameter*/
{
printf("Please use line parameter!\n");
return(5); /*a return of 5 should mean that now line parameter was given*/
}
if(argc>1){
for(i=1; i < argc; i++)
{
filename = argv[i]; //get first line parameter
file20 = fopen(filename, "r"); //read text file, use rb for binary
if (file20 == NULL){
printf("Cannot open empty file!\n");
}
while(fgets(str, 1000, file20) != NULL) {
reverse(0);
}
fclose(file20);
}
return(0); /*return a value of 0 if all the line parameters were opened reveresed and closed successfully*/
}
}
Can anyone point me to an error in the logic of my reverse function?

What you've written reads out the whole file into a single buffer and runs your reverse function over the whole file at once.
If you want the first line reversed then the next line reversed, etc, you'll need to read the lines one at a time using something like fgets. Run reverse over each line, one at a time and you should get what you want.
http://www.cplusplus.com/reference/cstdio/fgets/

Assuming you want to continue reading in the whole file into a single buffer and then doing the line-by-line reverse on the buffer all at once (instead of reading in one line, reversing it, reading in the next line, reversing it, and so on), you'll need to re-write your reverse() algorithm.
What you have in place seems to work already; I think you can get what you need by adding another loop around your existing logic, with a few modifications to your existing logic. Start with a pointer to the beginning of str[], let's call it char* cp1 = str. At the top of this new loop, create another pointer, char* cp2, and set it equal to cp1. Using cp2, scan to the end of the current line looking for a newline or '\0'. Now you have a pointer to the start of the current line (cp1) and a pointer to the end of the current line (cp2). Now modify your existing logic to use those pointers instead of str[] directly. You can compute the length of the current line by simply lineLen = cp2 - cp1; (you wouldn't want to use strlen() because the line might not have a terminating '\0'). After that, it will loop back up to the top of your new loop and continue with the next line (if *cp2 doesn't point to '\0')... just set cp1 = cp2+1 and continue with the next line.

Same Array in Different Procedures

I'm really new to C, and currently I'm trying to read in from a file which contains a list of names, and import that into an array. The current array is of type char[][] since it will have more information than just the name, but essentially I want team[0][0] to be the first name i read in, team[1][0] to be the second, etc. I'm pretty sure the actual importing of the names is correct, but I'm having problems storing these arrays.
FILE *teamfile;
teamfile = fopen(file, "r");
char line[MAXLENGTH+1];
int i = 0;
while( fgets(line, sizeof line, teamfile) != NULL )
{
trim_line(line);
strcpy(&team[i][NAME],line);
i++;
}
fclose(teamfile);
Which is called from the main function as teams = teamlist(argv[1], team);
But when I try to refer to the array from elsewhere in my program eg printf(&team[0][0]) it outputs what seems to be all names in one block...
What am I doing wrong?
edit:
static void trim_line(char line[])
{
int i = 0;
// LOOP UNTIL WE REACH THE END OF line
while(line[i] != '\0')
{
// CHECK FOR CARRIAGE-RETURN OR NEWLINE
if( line[i] == '\r' || line[i] == '\n' )
{
line[i] = '\0'; // overwrite with nul-byte
break; // leave the loop early
}
i = i+1; // iterate through character array
}
}
thanks for the help so far! :D

if team is declared as char team[NUM_OF_TEAMS][LENGHT_OF_NAME]
then it should always be strcpy(&team[i],line);
Hint: it is a char array, not a "string object" in C

Scanning in more than one word in C

I am trying to make a program which needs scans in more than one word, and I do not know how to do this with an unspecified length.
My first port of call was scanf, however this only scans in one word (I know you can do scanf("%d %s",temp,temporary);, but I do not know how many words it needs), so I looked around and found fgets. One issue with this is I cannot find how to make it move to the next code, eg
scanf("%99s",temp);
printf("\n%s",temp);
if (strcmp(temp,"edit") == 0) {
editloader();
}
would run editloader(), while:
fgets(temp,99,stdin);
while(fgets(temporary,sizeof(temporary),stdin))
{
sprintf(temp,"%s\n%s",temp,temporary);
}
if (strcmp(temp,"Hi There")==0) {
editloader();
}
will not move onto the strcmp() code, and will stick on the original loop. What should I do instead?

I would scan in each loop a word with scanf() and then copy it with strcpy() in the "main" string.

maybe you can use getline method ....I have used it in vc++ but if it exists in standard c library too then you are good to go
check here http://www.daniweb.com/software-development/c/threads/253585
http://www.cplusplus.com/reference/iostream/istream/getline/
Hope you find what you are looking for

I use this to read from stdin and get the same format that you would get by passing as arguments... so that you can have spaces in words and quoted words within a string. If you want to read from a specific file, just fopen it and change the fgets line.
#include <stdio.h>
void getargcargvfromstdin(){
char s[255], **av = (char **)malloc(255 * sizeof(char *));
unsigned char i, pos, ac;
for(i = 0; i < 255; i++)
av[i] = (char *)malloc(255 * sizeof(char));
enum quotes_t{QUOTED=0,UNQUOTED}quotes=UNQUOTED;
while (fgets(s,255,stdin)){
i=0;pos=0;ac=0;
while (i<strlen(s)) {
/* '!'=33, 'ÿ'=-1, '¡'=-95 outside of these are non-printables */
if ( quotes && ((s[i] < 33) && (s[i] > -1) || (s[i] < -95))){
av[ac][pos] = '\0';
if (av[ac][0] != '\0') ac++;
pos = 0;
}else{
if (s[i]=='"'){ /* support quoted strings */
if (pos==0){
quotes=QUOTED;
}else{ /* support \" within strings */
if (s[i-1]=='\\'){
av[ac][pos-1] = '"';
}else{ /* end of quoted string */
quotes=UNQUOTED;
}
}
}else{ /* printable ascii characters */
av[ac][pos] = s[i];
pos++;
}
}
i++;
}
//your code here ac is the number of words and av is the array of words
}
}

If it exceeds the buffer size you simply can't do it.
You will have to do multiple loops
the maximum size you can scan with scanf() will come from
char *name;
scanf("%s",name);
reed this
http://sekrit.de/webdocs/c/beginners-guide-away-from-scanf.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Sorting duplicate lines in C - c

At the part where you tested for duplicate, maybe you could test if oldCount == newCount first? My reasoning is that, if it is a duplicate line, oldCount will be equals to newCount. If it’s true, then proceed to check against the two array?

Related

Removing neighboring duplicate lines from a file using C

How to store the even lines of a file to one array and the odd lines to another

Need to reverse file in place but it only works for one line files

Same Array in Different Procedures

Scanning in more than one word in C

Categories

Resources