Why do these two methods of counting words differ significantly? - c

I wrote a program that allows a user to find the number of instances of a word or collection of words in any text file. The user can enter something like this in the command line:
$ ./wordCount Mars TripToMars.txt
to search for the number of instances of the word "Mars" in the book Trip To Mars, or
$ ./wordCount -f collectionOfSearchWords.txt TripToMars.txt
to search for the number of instances of several words on individual lines in collectionOfSearchWords.txt.
To ensure that the program was correct, I used the grep commands:
$ grep -o 'Mars' TripToMars.txt | wc -w
and
$ grep -o -w 'Mars' TripToMars.txt | wc -w
The first command finds the number of instances of the word anywhere, which would include terms like "Marsa", "Marseen", "Marses", etc., while the second command finds only instances of "Mars" as a standalone word, which would include trailing punctuation such as "Mars.", "Mars!", "Mars?", etc.
Both grep commands return 49 as the number of instances of "Mars" in the book.
When I use the code in the while loop below (for simplicity, I'm only including the relevant code), the program returns 49. Awesome!
FILE *textToSearch;
char *readMode = "r";
int count;
char nextWord[100];
char d;
textToSearch = fopen(argVector[argCount-1], readMode);
if (textToSearch == NULL) {
fprintf(stderr, "Cannot open %s to be searched\n", argVector[argCount-1]);
return 1;
} else {
while (fscanf(textToSearch, "%*[^a-zA-Z]"), fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) {
// increment the counter if the word is a match
if (strcmp(nextWord, argVector[word]) == 0) {
count++;
}
}
}
But when I substitute this while loop for the previous one, the program returns 17.
while(1) {
d = fscanf(textToSearch, "%s", nextWord);
if (d == EOF) break;
// increment the counter if the word is a match
if (strcmp(nextWord, argVector[word]) == 0) {
count++;
}
}
So, what's the big difference between
while (fscanf(textToSearch, "%*[^a-zA-Z]"), fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) {}
and
while(1) {
d = fscanf(textToSearch, "%s", nextWord);
if (d == EOF) break;
}
?
EDIT:
I added this code:
if (strcmp(nextWordDict, nextWord) == 0 ||
strcmp(nextWordDict, strcat(nextWord, ".")) == 0 ||
strcmp(nextWordDict, strcat(nextWord, "?")) == 0 ||
strcmp(nextWordDict, strcat(nextWord, "!")) == 0 ||
strcmp(nextWordDict, strcat(nextWord, ",")) == 0) {
count++;
}
to the code producing 17 for Mars to try and account for cases with trailing punctuation, and there was no change. Still 17.
EDIT2:
As John Bollinger correctly points out below, this code does nothing because the string buffered into nextWord would already have the trailing punctuation, and the code would only be adding more. This is faulty thinking on my part.

You are incorrect when you say that the command ...
$ grep -o -w 'Mars' TripToMars.txt | wc -w
... "finds only instances of 'Mars' as a standalone word", or at least that statement is misleading in context. The command finds instances of "Mars" that are not part of a larger word, where a "word" is defined as a contiguous string of letters, digits, and/or underscores. In particular, it will match "Mars" where it is followed by a punctuation mark, which conflicts with what you seem to be claiming.
But what is the difference between your two scanning approaches? Well, this ...
while (fscanf(textToSearch, "%*[^a-zA-Z]"),
fscanf(textToSearch, "%80[a-zA-Z]", nextWord) > 0) { /* ... */ }
... scans zero or more characters that are not Latin letters, ignoring whether any are matched and whether any input error occurs, then scans a contiguous sequence of up to 80 Latin letters, recording that sequence in the nextWord buffer.
On the other hand, this ...
while(1) {
d = fscanf(textToSearch, "%s", nextWord);
if (d == EOF) break;
}
... ignores leading whitespace, then scans the next contiguous string of non-whitespace into nextWord.
The two differ significantly in what they do with characters that are neither Latin letters nor whitespace: the former ignores them, whereas the latter includes them in nextWord. When you then compare nextWord with the string "Mars", the latter misses on
Going to Mars.
and
The name "Mars"
and
Is there water on Mars?
because the adjacent punctuation is included in the comparison. Your text is quite likely to have many constructions similar to those, and your grep commands do not demonstrate otherwise.

Related

Creating a Rhombus with a framed message

Good evening,
I have been tasked with creating a dynamic Rhombus with a frame message inside.
I've been looking for a way to do this for a few hours now, but I'm honestly starting to lose my mind.
Note that my level is pretty basic, so my code doesn't include any complex materials.
I managed to create the upper half of the Rhombus, but I have no idea how to get the message exactly in the middle of it.
The user should be able to enter a string, a size and a character for the frame.
F.ex:
user input is:
5, $, Hello:
$
$ $
$ $
$ $
$ Hello $
$ $
$ $
$ $
$
My code so far (few unreferenced variables, ignore them):
void framedMessage()
{
int size = 6, i, sp, sym, a, b;
char rSymbol = '$', str[100] = { "Hello" };
//while ((strlen(str) % 2) == 0)
//{
// puts("Please Enter a string (even number of letters will be denied):");
// gets(str);
//}
//puts("Please enter a char for the rhombus(NOTE that the letters 'C','M','S','P','E' will return as small chars and not capital:");
//rSymbol = charInput();
//size = numericalInput(strlen(str) + 2, 999);
for (a = 0;a <= size;a++)
{
for (sp = 1;sp <= size;sp++)
{
if (sp == size && a == 0)
printf("%c", rSymbol);
if ((sp == size - a && a != 0))
{
printf("%c", rSymbol);
for (;sp < size + a;sp++)
{
printf(" ");
if (sp == size + a -2)
printf("%c", rSymbol);
}
}
printf(" ");
}
puts("");
}
}
The top half of the Rhombus wasn't so hard, but now I have no idea how to put the message in a way that will fit the space perfectly (have in mind that the string can't have an even amount of characters because it would ruin the rhombus).
I would appreciate any help with this. constructive posts about the code itself (what should I try to improve, avoid using etc..) would be greatly received.
Thanks a lot.
Since in this example you have used 6 as the size. The total no of spaces in next line between the two '$'s will be 11. ie double the value of variable size -1. So you get the length of the string and use the below logic to get the string in the middle.
len = strlen(str);
x = ( (size*2-1) - len)/2
in the case x will be 3
x No of spaces is required before and after str.

Justifying text in C and general array

I'm attempting to fully justify (left and right columns line-up) input from files and this is what I came up with. The input files have embedded commands so from my pseudo output below I start justifying at the company's line and end at telephone As you can see it randomly joins two of the lines read together. Can someone please tell me why it's doing this? My input files definitely have newline characters in them since I double checked they were entered.
Also how do I do the following: Check if my read line will fit into my output array (of 40 char)? If it doesn't I want to move the overflowed string(s) into the next line or char(s) if it's easier. This one isn't as necessary as my first question but I would really like to make the output as nice as possible and I don't know how to restrict and carry overflow from read lines into the next output array.
Since it began to escape from AT&T's Bell Laboratories in
the early 1970's, the success of the UNIX
operating system has led to many different
versions: recipients of the (at that time free) UNIX system
code all began developing their own different
versions in their own different ways for use and sale.
Universities, research
institutes, government bodies and computer
companies all began using the powerful
UNIX system to develop many of the
technologies which today are part of a
UNIX system. Computer aided design,
manufacturing control systems,laboratorysimulations,even the Internet itself,
all began life with and because of UNIX
Today, without UNIX systems, the Internewould come to a screeching halt.
Most telephone calls could not be made,
electronic commerce would grind to a halt and
there would have never been "Jurassic Park"!
Below is my justify function that's passed the read file line using fgets in another function. The printf lines are just for debugging.
void justify(char strin[]){
int i = 0; //strin iterator
int j = 0; //out iterator
int endSpaces = LINE + 1 - strlen(strin);
int voids = countwords(strin) - 1;
printf("Voids: %d\n", voids);
printf("Input: %s", strin);
//No words in line, exit
if (voids <= 0)
return;
//How many to add between words
int addEvenly = endSpaces/voids;
int addUnevenly = endSpaces % voids;
printf("space to distribute: %d evenly: %d unevenly: %d\n", endSpaces, addEvenly, addUnevenly);
//Copy space left of array to output
while (strin[i] == ' '){
outLine[j++] = ' ';
i++;
}
//One word at a time
while (endSpaces > 0 || addUnevenly > 0){
//Copy letters into out
while (strin[i] != ' '){
outLine[j] = strin[i];
i++;
j++;
}
//Add the necessary spaces between words
if (addEvenly > 0){
for (int k = 0; k < addEvenly; k++){
outLine[j++] = ' ';
}
}
//Distribute to the left
if (addUnevenly > 0){
outLine[j++] = ' ';
endSpaces--;
addUnevenly--;
}
printf("Output: %s\n\n", outLine);
endSpaces = endSpaces - addEvenly;
//Finish copying rest of input to output when no more spaces to add
if (endSpaces == 0 && addUnevenly == 0){
while (strin[i] != '\0')
outLine[j++] = strin[i++];
printf("Output 2: %s\n", outLine);
}
}
fprintf(out, "%s", outLine);
}
On sunday I created a function (justifyline()) able to justify and indent a line you give it as input. It outputs a buffer containing the justified (formatted) text and any eventual text-remainder; such a remainder may be used as input to the function justifyline().
After this step I've used the file below (text.txt) to test the behaviour of such a function. That test demonstrates me the need to use also word wrapping between lines. Then I've written the function formatLineByLine(). The function formatLineByLine() doesn't care of void lines.
Text file (text.txt): (I used the text in your question trying to correct it, but not all I've corrected, then the input file suffers of this fact!)
Since it began to escape from AT&T's
Bell Laboratories in the early 1970's,
the success of the UNIX operating system
has led to many different versions:
recipients of the (at that time free)
UNIX system code all began developing
their own different versions in their
own different ways for use and sale.
Universities, research institutes,
government bodies and computer companies
all began using the powerful UNIX system
to develop many of the technologies which
today are part of a UNIX system.
Computer aided design, manufacturing
control systems, laboratory simulations,
even the Internet itself, all began life
with and because of UNIX Today, without
UNIX systems, the Internet would come to a
screeching halt. Most telephone calls
could not be made, electronic commerce
would grind to a halt and there would
have never been "Jurassic Park"!
The output of the function formatLineByLine()
ABCDE12345678901234567890123456789012345
Since it began to escape from
AT&T's Bell Laboratories in the
early 1970's, the success of the
UNIX operating system has led to
many different versions: recipients
of the (at that time free) UNIX
system code all began developing
their own different versions in
their own different ways for use
and sale. Universities, research
institutes, government bodies and
computer companies all began using
the powerful UNIX system to develop
many of the technologies which
today are part of a UNIX system.
Computer aided design,
manufacturing control systems,
laboratory simulations, even the
Internet itself, all began life
with and because of UNIX Today,
without UNIX systems, the Internet
would come to a screeching halt.
Most telephone calls could not be
made, electronic commerce would
grind to a halt and there would
have never been "Jurassic Park"!
Another step is the idea to use a paragraph per paragraph justifycation. Then I've written the function justifyParagraph(). The function formatInParagraphs() reads the file text.txt and prints it justified using the function justifyParagraph().
The output of the function formatInParagraphs()
ABCDE12345678901234567890123456789012345
Since it began to escape from
AT&T's Bell Laboratories in the
early 1970's, the success of the
UNIX operating system has led to
many different versions: recipients
of the (at that time free) UNIX
system code all began developing
their own different versions in
their own different ways for use
and sale.
Universities, research
institutes, government bodies and
computer companies all began using
the powerful UNIX system to develop
many of the technologies which
today are part of a UNIX system.
Computer aided design,
manufacturing control systems,
laboratory simulations, even the
Internet itself, all began life
with and because of UNIX Today,
without UNIX systems, the Internet
would come to a screeching halt.
Most telephone calls could not be
made, electronic commerce would
grind to a halt and there would
have never been "Jurassic Park"!
The function justifyline() is able to create a justified buffer with indentation (parameter size_t indent) and to use also a single space between the words (parameter int nospacing sent as 1).
The function justifyParagraph() is able to create a justified buffer with line indentation (parameter: size_t indent) and 1st line indentation (parameter: size_t indentstart). The formatted output may be directly printed when a NULL output buffer is sent to the function (parameter char **outbuf sent as NULL). The last line the function generates may be justified or not (parameter: int notFrmtLast sent as 1).
Both justification functions, when the parameter char **outbuf points a NULL pointer ( *outbuf == NULL ), allocate memory using malloc() . In this case you have to free the buffer after its use. If this parameter is passed as NULL to the function justifyParagraph(), the function prints the elaborated output, if outbuf is passed as NULL to the function justifyline(), the function returns an error.
The code is below. An issue of this code is that, in some cases, the length of the string should be computed using a function different from strlen(). To avoid this problem you may use these functions with lines that have a single space between the words. Such a problem affects the functions justifyParagraph() and formatLineByLine().
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int justifyLine(char *inbuf, char **outbuf, size_t linelen, char ** endptr, size_t indent, int nospacing);
int justifyParagraph(char *inbuf,char **outbuf,size_t linelen,size_t indentstart,size_t indent,int notFmtLast);
int formatLineByLine(FILE *f, size_t linelen,size_t indent, int notFrmtLast);
int formatInParagraphs(FILE *f, size_t linelen,size_t indentstart,size_t indent, int notFrmtLast);
int justifyParagraph(char *inbuf,char **outbuf,size_t linelen,size_t indentstart,size_t indent,int notFmtLast)
{
char *optr=NULL,*endp=NULL;
size_t len,s;
int retval,nf;
for(;;) { //Error control loop
if (inbuf==NULL) {
retval=0x10;break;
}
if (indent+indentstart>linelen) {
retval=0x20;break;
}
if (outbuf!=NULL) {
if (*outbuf==NULL) {
if ( (*outbuf=malloc(linelen+1))==NULL ){
retval=0x30;break;
}
}
optr=*outbuf;
}
endp=inbuf;
indent+=indentstart;
len=linelen-indent;
s=indentstart;nf=0;
while( *endp!=0) {
if (notFmtLast && strlen(endp)<linelen-indent)
nf=1;
if ( (retval=justifyLine(endp,&optr,linelen,&endp,
indent,nf)) ) {
retval|=0x40;break;
}
if (outbuf!=NULL) {
optr+=strlen(optr);
*optr++='\n';
*optr=0;
} else {
puts(optr);
}
indent-=s;
len+=s;
s=0;
}
break; //Close error ctrl loop!
}
if (outbuf==NULL && optr!=NULL)
free(optr);
return retval;
}
int justifyLine(char *inbuf,char **outbuf,size_t linelen, char ** endptr,size_t indent,int nospacing)
{
size_t textlen,tmp;
size_t spctoadd,spcodd,spcin;
size_t timetoodd;
size_t ibidx,obidx,k,wc;
char * endp;
char * outb=NULL;
int retval=0;
for(;;) { //Error control loop
endp=inbuf;
if (inbuf==NULL) {
retval=1;break;
}
if (indent>linelen) {
retval=2;break;
}
if (outbuf==NULL) {
retval=3;break;
}
if (*outbuf==NULL) {
if ( (*outbuf=malloc(linelen+1))==NULL ){
retval=4;break;
}
}
outb=*outbuf;
//Leave right spaces
while(*inbuf==' ')
inbuf++;
if (*inbuf==0) {
endp=inbuf;
*outb=0;
break; //exit from error loop without error!
}
linelen-=indent;
//Count words and the minimum number of characters
ibidx=0;
wc=0;textlen=0;k=1;endp=NULL;
while ( *(inbuf+ibidx)!=0 ) {
if (*(inbuf+ibidx)==' ') {
ibidx++;continue;
}
//There's a char!
k=ibidx; //last word start
tmp=textlen;
wc++;textlen++; //add the space after the words
//textlen<linelen because textlen contains also the space after the word
// while(textlen<=linelen && *(inbuf+ibidx)!=' ' && *(inbuf+ibidx) ) {
while(*(inbuf+ibidx)!=' ' && *(inbuf+ibidx) ) {
textlen++;ibidx++;
}
if (textlen>linelen+1) {
endp=inbuf+k;
textlen=tmp;
wc--;
break;
}
}
textlen=textlen-wc;
if (endp==NULL) {
endp=inbuf+ibidx;
}
if (textlen<2) {
*outb=0;
break; //exit from error loop without error!
}
//Prepare outbuf
memset(outb,' ',linelen+indent);
*(outb+linelen+indent)=0;
ibidx=0;
obidx=indent;
if (wc>1) {
if (!nospacing) {
//The odds are max in number == wc-2
spctoadd=linelen-textlen;
} else {
spctoadd=wc-1;
}
spcin=spctoadd/(wc-1);
spcodd=spctoadd % (wc-1);
if (spcodd)
timetoodd=(wc-1)/spcodd;
k=timetoodd;
while(spctoadd) {
while(*(inbuf+ibidx)!=' ') {
*(outb+obidx++)=*(inbuf+ibidx++);
}
obidx+=spcin;spctoadd-=spcin;
if (spcodd && !(--k)) {
k=timetoodd;
spcodd--;
spctoadd--;
obidx++;
}
while(*(inbuf+ ++ibidx)==' ');
}
}
while(*(outb+obidx) && *(inbuf+ibidx) && *(inbuf+ibidx)!=' ')
*(outb+obidx++)=*(inbuf+ibidx++);
//There're words longer then the line!!!
if (*(inbuf+ibidx) && *(inbuf+ibidx)!=' ')
endp=inbuf+ibidx;
break; //Terminate error ctrl loop.
}
if (endptr!=NULL)
*endptr=endp;
return retval;
}
int formatLineByLine(FILE *f, size_t linelen,size_t indent, int notFrmtLast)
{
char text[250],*app;
//justifyLine allocates memory for the line if the outbuf (optr) value is NULL
char * optr=NULL;
size_t j,k;
//print a ruler
for(j=0;j<indent;j++)
printf("%c",'A'+(char)j);
for(j=1;j<=linelen-indent;j++)
printf("%c",'0'+(char)(j%10));
printf("\n");
//starts printing
fseek(f,0,SEEK_SET);
j=0;
while(fgets(text+j,sizeof(text)-j,f)) {
if ( (app=strrchr(text+j,'\n')) ) {
*app=0;
}
k=strlen(text);
if (strlen(text)<linelen-indent) {
if (!*(text+k) && *(text+k-1)!=' ') {
*(text+k++)=' ';
*(text+k)=0;
}
j=k;
continue;
}
app=text;
do {
//justifyLine allocates memory for the line if the outbuf (optr) value is NULL
if ( justifyLine(app,&optr,linelen,&app,indent,0) ) {
if (optr!=NULL)
free(optr);
return 1;
}
printf("%s\n",optr);
j=(*app!=0)?strlen(app):0;
} while(j>linelen-indent);
if (j) {
strcpy(text,app);
*(text+j++)=' ';
*(text+j)=0;
}
}
if (*text!=0 && j) {
if ( justifyLine(text,&optr,linelen,NULL,indent,notFrmtLast) )
{
if (optr!=NULL)
free(optr);
return 2;
}
printf("%s\n",optr);
}
//justifyLine allocates memory for the line if the outbuf value is NULL
if (optr!=NULL)
free(optr);
return 0;
}
int formatInParagraphs(FILE *f, size_t linelen,size_t indentstart,size_t indent, int notFrmtLast)
{
char text[1024], *app;
//To uncomment when you use the commented justifyParagraph line.
//see below
//char *outbuf=NULL;
size_t j;
//print a ruler
for(j=0;j<indent;j++)
printf("%c",'A'+(char)j);
for(j=1;j<=linelen-indent;j++)
printf("%c",'0'+(char)(j%10));
printf("\n");
//starts printing
fseek(f,0,SEEK_SET);
j=0;
while(fgets(text+j,sizeof(text),f)) {
if ( (app=strrchr(text+j,'\n')) ) {
*app++=' ';*app=0;
}
if ( *(text+j)==' ' && !*(text+j+1) ) {
//The following commented line allocates memory creating a paragraph buffer!
//doesn't print the formatted line.
//justifyParagraph(text,&outbuf,linelen,indentstart,indent,notFrmtLast);
//This line directly print the buffer allocating and de-allocating
//only a line buffer. It prints the formatted line.
justifyParagraph(text,NULL,linelen,indentstart,indent,notFrmtLast);
j=0;
//To uncomment when you use the commented justifyParagraph line.
// printf("%s\n\n",outbuf);
puts("");
} else {
j+=strlen(text+j);
}
}
return 0;
}
int main(void)
{
FILE * file;
file=fopen("text.txt","r");
formatLineByLine(file,40,5,1);
puts("");
formatInParagraphs(file,40,5,5,1);
fclose(file);
return 0;
}
You were incredibly close – but you forgot one thing!
After copying a word into outLine, you insert the correct number of additional spaces, and continue with 'the next word'. However, at that point the input pointer i still is at the end of the previously copied word (so it points to the first space immediately after that). The test while (strin[i] != ' ') then immediately fails and you insert the additional spaces at that point again. This continues until you run out of spaces to add, and at the very end you add what was not processed, which is "the entire rest of the string".
The fix is simple: after copying your word into outLine, copy the original space(s) as well, so the i iterator gets updated to point to the next word.
//One word at a time
while (endSpaces > 0 || addUnevenly > 0)
{
//Copy letters into out
while (strin[i] != ' ')
{
outLine[j] = strin[i];
i++;
j++;
}
//Copy original spaces into out <-- FIX!
while (strin[i] == ' ')
{
outLine[j] = strin[i];
i++;
j++;
}
With this, your code works entirely as you intended. Output:
|Since it began to escape from AT&T's Bell Laboratories in|
|the early 1970's, the success of the UNIX|
|operating system has led to many different|
|versions: recipients of the (at that time free) UNIX system|
|code all began developing their own different|
|versions in their own different ways for use and sale.|
| Universities, research|
|institutes, government bodies and computer|
|companies all began using the powerful |
|UNIX system to develop many of the |
|technologies which today are part of a |
|UNIX system. Computer aided design, |
|manufacturing control systems,laboratorysimulations,even the Internet itself, |
|all began life with and because of UNIX |
|Today, without UNIX systems, the Internewould come to a screeching halt.|
|Most telephone calls could not be made,|
|electronic commerce would grind to a halt and|
|there would have never been "Jurassic Park"! |
Possible improvements
Justified lines should never begin with whitespace (your Copy space left of array to output part). Just increment the pointer there:
//Copy space left of array to output
while (strin[i] == ' ')
{
// outLine[j++] = ' ';
i++;
endSpaces++;
}
(and move the calculation for How many to add between words below this, because it changes endSpaces).
The same goes for spaces at the end. You can adjust endSpaces at the start
int l = strlen(strin);
while (l > 0 && strin[l-1] == ' ')
{
l--;
endSpaces++;
}
and suppress copying the trailing spaces into outLn at the bottom. (That needs some additional tinkering, I couldn't get it right first time.)
It is much neater to ignore multiple spaces inside the input string as well, but that takes a bit more code.
With these three implemented, you get a slightly neater output:
|Since it began to escape from AT&T's Bell Laboratories in|
|the early 1970's, the success of the UNIX|
|operating system has led to many different|
|versions: recipients of the (at that time free) UNIX system|
|code all began developing their own different|
|versions in their own different ways for use and sale.|
|Universities, research|
|institutes, government bodies and computer|
|companies all began using the powerful|
|UNIX system to develop many of the|
|technologies which today are part of a|
|UNIX system. Computer aided design,|
|manufacturing control systems,laboratorysimulations,even the Internet itself,|
|all began life with and because of UNIX|
|Today, without UNIX systems, the Internewould come to a screeching halt.|
|Most telephone calls could not be made,|
|electronic commerce would grind to a halt and|
|there would have never been "Jurassic Park"!|
A drawback of this one-line-at-a-time method is that it cannot easily be rewritten to gather input until a line overflows. To do so, you need:
a routine that skips all spaces and return a pointer to the next word.
a routine that reads words until a line is 'overfull' – that is, the number of words plus (the number of words - 1) for spaces is larger than your LINE value. This uses routine #1 and outputs exactly one justified line.
You need to pass on the location and number of strings from your main to both these routines, and in both check if you are at the end of either a single input line or the entire input array.
I've written this main that contains two simple methods to center a text in a line. The first method only prints the text variable without modifying it, the second method modifies the text variable and then prints it. (Here method is not intended as function, the code contains two examples which you may translate easily in simple functions)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char text[81],fmt[10];
int linelen=80,tlen;
int spacetocenter=0;
printf("Insert text to center [max %lu char]:\n",sizeof(text)-1);
if (scanf("%[^\n]",text)<1) {
perror("scanf");
return -1;
}
getchar(); //Leaves return from the buffer
tlen=strlen(text);
spacetocenter=(linelen-tlen)/2;
if (spacetocenter<0)
spacetocenter=0;
//Method one (this doesn't modify text)
//This method directly prints the contents of text centered.
//----------------------------------------------------------
snprintf(fmt,sizeof(fmt),"%%%+ds\n",spacetocenter+tlen);
//printf("%s\n",fmt); // prints the used format
printf(fmt,text);
//Method two (this modifies text)
//This method modifies the contents of the variable text
//----------------------------------------------------------
memmove(text+spacetocenter,text,tlen+1);
memset(text,' ',spacetocenter);
printf("%s\n",text);
return 0;
}
Note:
After the second method is applied tlen no longer contains the length of text!
The program consider the line of 80 chars, if you need shorter/longer lines you have to modify the value of the variable linelen.

Filter text from huge .csv files, in C

I have the raw and unfiltered records in a csv file (more than 1000000 records), and I am suppose to filter out those records from a list of files (each weighing more than 282MB; approx. more than 2000000 records). I tried using strstr in C. This is my code:
while (!feof(rawfh)) //loop to read records from raw file
{
j=0; //counter
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh)) //read a line from raw file
{
line[j] = c; line[j+1] = '\0'; j++;
}
//function to extract the element in the specified column, in the CSV
extractcol(line, relcolraw, entry);
printf("\nWorking on : %s", entry);
found=0;
//read a set of 4000 bytes; this is the target file
while( fgets(buffer, 4000, dncfh)!=NULL && !found )
{
if( strstr(buffer, entry) !=NULL) //compare it
found++;
}
rewind(dncfh); //put the file pointer back to the start
// if the record was not found in the target list, write it into another file
if(!found)
{
fprintf(out, "%s,\n", entry); printf(" *** written to filtered ***");
}
else
{
found=0; printf(" *** Found ***");
}
//I hope this is the right way to null out a string
entry[0] = '\0'; line[0] ='\0';
//just to display a # on the screen, to let the user know that the program
//is still alive and running.
rawreccntr++;
if(rawreccntr>=10)
{
printf("#"); rawreccntr=0;
}
}
This program takes approximately 7 to 10 seconds, on an average, to search one entry in the target file (282 MB). So, 10*1000000 = 10000000 seconds :( God knows how much is that going to take if I decide to search in 25 files.
I was thinking of writing a program, and not going to spoon fed solutions (grep, sed etc.). OH, sorry, but I am using Windows 8 (64 bit, 4 GB RAM, AMD processor Radeon 2 core - 1000Mhz). I used DevC++ (gcc) to compile this.
Please enlighten me with your ideas.
Thanks in advance, and sorry if I sound stupid.
Update by Ali, the key information extracted from a comment:
I have a raw CSV file with details for customer's phone number and address. I have the target file(s) in CSV format; the Do Not Call list. I am suppose to write a program to filter out phone number that are not present in the Do No Call List. The phone numbers (for both files) are in the 2nd column. I, however, don't know of any other method. I searched for Boyer-Moore algorithm, however, could not implement that in C. Any suggestions about how should I go about searching for records?
EDITED
I would recommend you have a try with the readymade tools in any Unix/Linux system, grep and awk. You'll probably find they are just as fast and much more easily maintained. I haven't seen your data format, but you say the phone numbers are in the second column, so you can get the phone numbers on their own like this:
awk '{print $2}' DontCallFile.csv
If your phone numbers are in double quotes, you can remove those like this:
awk '{print $2}' DontCallFile.csv | tr -d '"'
Then you can use fgrep with the -f option, to search whether strings listed in one file are present in a second file, like this:
fgrep -f file1.csv file2.csv
or you can invert the search and search for strings NOT present in another file, by adding the -v switch to fgrep.
So, your final command would probably end up like this:
fgrep -v -f <(awk '{print $2}' DontCallFile.csv | tr -d '"') file2.csv
That says... search, in file2.csv for all strings not present (-v option) in column 2 of file "DontCallFile.csv". If you want to understand the bit in <() it is called process substitution and it basically makes a pseudo-file out of the result of running the command inside the brackets. And we need a pseudo-file because fgrep -f expects a file.
ORIGINAL ANSWER
Why are you using fgetc() anyway. Surely you would use getline() like this:
while(getline(myfile,line ))
{
...
}
Are you really reading the whole "target" file from the start for every single line in your main file? That will kill you! And why are you doing it in chunks of 4,000 bytes? And what if one of your strings straddles the 4,000 bytes you compare it with - i.e. the first 8 bytes are in one 4k chunk and the last however many bytes are in the nect 4k chunk?
I think you will get better help on here if you take the time to explain properly what you are trying to do - and maybe do it with awk or grep (at least figuratively) so we can see what you are actually trying to achieve. Your decription doesn't mention the "target" file you use in the code, for example.
You can do this with awk, like this:
awk -F, '
FNR==NR {gsub(/"/,"",$2);dcn[$2]++;next}
{gsub(/ /,"",$2);if(!dcn[$2])print}
' DontCallFile.csv x.csv
That says... the field separator is a comma (-F,). Now read the first file (DontCallFile.csv) and process according to the part in curly braces after FNR==NR. Remove the double quotes from around the phone number in field 2, using gsub (global substitution). Then increment the element in the associative array (i.e. hash) as indexed by unquoted field 2 and then move to next record. So basically, after file "DontCallFile.csv" is processed, the array dcn[] will hold a hash of all the numbers not to call (dcn=dontcallnumbers). Then, the code in the second set of curly braces is executed for each line of the second file ("x.csv"). That says... remove all spaces from around the phone number in field 2. Then, if that phone number is not present in the array dcn[] that we built earlier, print the line.
Here is one idea for improvement...
In the code below, what's the point in setting line[j+1] = '\0' at every iteration?
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
{
line[j] = c; line[j+1] = '\0'; j++;
}
You might as well do it outside the loop:
while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
line[j++] = c;
line[j] = '\0';
My advice is the following.
Put all don't call phone numbers into an array.
Sort this array.
Use binary search to check if a given phone number is among the sorted
don't call numbers.
In the code below, I just hard-coded the numbers. In your application, you will have to replace that with the corresponding code.
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int compare(const void* a, const void* b) {
return (strcmp(*(char **)a, *(char **)b));
}
int binary_search(const char** first, const char** last, const char* val) {
ptrdiff_t len = last - first;
while (len > 0) {
ptrdiff_t half = len >> 1;
const char** middle = first;
middle += half;
if (compare(&*middle, &val)) {
first = middle;
++first;
len = len - half - 1;
}
else
len = half;
}
return first != last && !compare(&val,&*first);
}
int main(int argc, char** argv) {
size_t i;
/* Read _all_ of your don't call phone numbers into an array. */
/* For the sake of the example, I just hard-coded it. */
char* dont_call[] = { "908-444-555", "800-200-400", "987-654-321" };
/* in your program, change length to the number of dont_call numbers actually read. */
size_t length = sizeof dont_call / sizeof dont_call[0];
qsort(dont_call, length, sizeof(char *), compare);
printf("The don\'t call numbers sorted\n");
for (i=0; i<length; ++i)
printf("%lu %s\n", i, dont_call[i]);
/* For each phone number, check if it is in the sorted dont_call list. */
/* Use binary search to check it. */
char* numbers[] = { "999-000-111", "333-444-555", "987-654-321" };
size_t n = sizeof numbers / sizeof numbers[0];
printf("Now checking if we should call a given number\n");
for (i=0; i<n; ++i) {
int should_call = binary_search((const char **)dont_call, (const char **)dont_call+length, numbers[i]);
char* as_text = should_call ? "no" : "yes";
printf("Should we call %s? %s\n",numbers[i], as_text);
}
return 0;
}
This prints:
The don't call numbers sorted
0 800-200-400
1 908-444-555
2 987-654-321
Now checking if we should call a given number
Should we call 999-000-111? yes
Should we call 333-444-555? yes
Should we call 987-654-321? no
The code is definitely not perfect but it is sufficient to get you started.
The problem with your algorithm is complexity. You approach is O(n*m) where n is number of customers and m is number of do_not_call records (or size of file in your case). You need reduce this complexity. (And Boyer-Moore algorithm would not help there which suggested by Ali. It would not improve asymptotic complexity but only constant.) Even binary search as Ali suggest in his answer is not best. It would be O((n+m)*log m). We can do better. Nice solutions are using fgrep and awk as suggested by Mark Setchell in his answers. (I would chose one using fgrep which should perform better I guess but it is only guess.) I can provide one similar solution in Perl which will provide more robust CSV parsing and should handle your data sizes in easy on decent HW. This type of solutions has complexity O(n+m).
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHN_COL_DNC => 1;
use constant PHN_COL_CUSTOMERS => 1;
die "Usage: $0 dnc_file [customers]" unless #ARGV>0;
my $dncfile = shift #ARGV;
my $csv = Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1, binary=>1});
my %dnc;
open my $dnc, '<', $dncfile;
while(my $row = $csv->getline($dnc)){
$dnc{$row->[PHN_COL_DNC]} = undef;
}
close $dnc;
while(my $row = $csv->getline(*ARGV)){
$csv->print(*STDOUT, $row) unless exists $dnc{$row->[PHN_COL_CUSTOMERS]};
}
If it would not meet our performance expectation you can go down to C road but I would definitely recommend use some good csv parsing and hashmap libraries. I would try libcsv and khash.h

Advice on Segmentation Fault, using gdb effectively, C Programming (newbie)

I am having a problem with a segmentation fault working in C, and I cannot figure out why this is occurring. I think it has something to do with misuse of the fget(c) function.
while((ch = fgetc(fp))!= EOF) {
printf("Got inside first while: character is currently %c \n",ch); //**********DELETE
while(ch != '\n') {
char word[16]; //Clear out word before beginning
i = i+1; //Keeps track of the current run thru of the loop so we know what input we're looking at.
while(ch != ' ') {
printf("%c ",ch); //**********DELETE
//The following block builds up a character array from the current "word" (separated by spaces) in the input file.
int len = strlen(word);
word[len] = ch;
word[len+1] = '\0';
printf("%s",word);
ch = fgetc(fp);
}
//The following if-else block sets the variables TextA, TextB, and TextC to the appropriate Supply Types from the input.
//This part may be confusing to read mentally, but not to trace. All it does is logically set TextA, B, and C to the 3 different possible values SupplyType.
if(word!=TextB && word!=TextC && i==1 && TextB!="") {
strcpy(TextA,word);
}
else if(word!=TextA && word!=TextC && i==1 && TextC!="") {
strcpy(TextB,word);
}
else if(word!=TextB && word!=TextA && i==1) {
strcpy(TextC,word);
}
switch(i) {
case 1:
if(TextA == word) {
SubTypeOption = 1;
}
else if(TextB == word) {
SubTypeOption = 2;
}
else if(TextC == word) {
SubTypeOption = 3;
}
break;
case 2:
//We actually ultimately don't need to keep track of the product's name, so we do nothing for case i=2. Included for readibility.
break;
case 3:
WholesalePrice = atof(word);
break;
case 4:
WholesaleAmount = atoi(word);
break;
case 5:
RetailPrice = atof(word);
break;
case 6:
RetailAmount = atoi(word);
break;
}//End switch(i)
ch = fgetc(fp);
}//End while(ch != '\n')
//The following if-else block "tallys up" the total amounts of SubTypes bought and sold by the owner.
if(SubTypeOption == 1) {
SubType1OwnersCost = SubType1OwnersCost + (WholesalePrice*(float)WholesaleAmount);
SubType1ConsumersCost = SubType1ConsumersCost + (RetailPrice *(float)RetailAmount);
}
else if(SubTypeOption == 2) {
SubType2OwnersCost = SubType2OwnersCost + (WholesalePrice*(float)WholesaleAmount);
SubType2ConsumersCost = SubType2ConsumersCost + (RetailPrice *(float)RetailAmount);
}
else if(SubTypeOption == 3) {
SubType3OwnersCost = SubType3OwnersCost + (WholesalePrice*(float)WholesaleAmount);
SubType3ConsumersCost = SubType3ConsumersCost + (RetailPrice *(float)RetailAmount);
}
}//End while((ch = fgetc(fp))!= EOF)
Using gdb (just a simple run of the a.out) I found that the problem is related to getc, but it does not tell which line/which one. However, my program does output "Got in side the first while: character is currently S". This S is the first letter in my input file, so I know it is working somewhat how it should, but then causes a seg fault.
Does anyone have any advice on what could be going wrong, or how to debug this problem? I am relatively new to C and confused mostly on syntax. I have a feeling I've done some small syntactical thing wrong.
By the way, this snippet of the code is meant to get a word from a string. Example:
Help me with this program please
should give word equaling "Help"
Update: Now guys I am getting kind of a cool error (although cryptic). When I recompiled I got something like this:
word is now w S
word is now w Su
word is now w Sup
... etc except it goes on for a while, building a pyramid of word.
with my input file having only the string "SupplyTypeA 1.23 1 1.65 1" in it.
UPDATE: Segmentation fault was fixed (the issue was, I was going past the end of the file using fgetc() ). Thanks everyone.
If anyone still glances at this, could they help me figure out why my output file does not contain any of the correct numbers it should? I think I am probably misusing atof and atoi on the words I'm getting.
Make sure you compile the program with -g -O0 options
Next step through the program line by line in GDB, watch and understand what your program is doing. Look at the various variables. This is the essential debugging skill.
WHen it dies type the command 'k' this will give you a stack trace the last line of the trace will have the failing line number, but you know that anyway because you were on the line shen you did a step command
There is no "fget" in good old C, but maybe you're using a more modern version that has something named "fget". Most likely, you meant to use "fgetc". When a C I/O function starts with "f", it usually wants a FILE* handle as an argument, as "fgetc" does. Try using "fgetc" instead, after reading the documentation for it.

Shell program pipes C

I am trying to run a small shell program and the first step to make sure my code is running properly is to make sure I get the correct command and parameters:
//Split the command and store each string in parameter[]
cp = (strtok(command, hash)); //Get the initial string (the command)
parameter[0] = (char*) malloc(strlen(cp)+ 1); //Allocate some space to the first element in the array
strncpy(parameter[0], cp, strlen(cp)+ 1);
for(i = 1; i < MAX_ARG; i++)
{
cp = strtok(NULL, hash); //Check for each string in the array
parameter[i] = (char*) malloc(strlen(cp)+ 1);
strncpy(parameter[i], cp, strlen(cp)+ 1); //Store the result string in an indexed off array
if(parameter[i] == NULL)
{
break;
}
if(strcmp(parameter[i], "|") == 0)
{
cp = strtok(NULL, hash);
parameter2[0] = (char*) malloc(strlen(cp)+ 1);
strncpy(parameter2[0], cp, strlen(cp)+ 1);
//Find the second set of commands and parameters
for (j = 1; j < MAX_ARG; j++)
{
cp = strtok(NULL, hash);
if (strlen(cp) == NULL)
{
break;
}
parameter2[j] = (char*) malloc(strlen(cp)+ 1);
strncpy(parameter2[j], cp, strlen(cp)+ 1);
}
break;
}
I am having a problem when I compare cp and NULL, my program crashes. What I want is to exit the loop once the entries for the second set or parameters have finished (which is what I tried doing with the if(strlen(cp) == NULL)
I may have misunderstood the question, but your program won't ever see the pipe character, |.
The shell processes the entire command line, and your program will only be given it's share of the command line, so to speak.
Example:
cat file1 file2 | sed s/frog/bat/
In the above example, cat is invoked with only two arguments, file1, and file2. Also, sed is invoked with only a single argument: s/frog/bat/.
Let's look at your code:
parameter[0] = malloc(255);
Since strtok() carves up the original command array, you don't have to allocate extra space with malloc(); you could simply point the parameter[n] pointers to the relevant sections of the original command string. However, once you move beyond space-separated commands (in a real shell, the | symbol does not have to be surrounded by spaces, but it does in yours), then you will probably need to copy the parts of the command string around, so this is not completely wrong.
You should check for success of memory allocation.
cp = strtok(command, " "); //Get the initial string (the command)
strncpy(parameter[0], cp, 50);
You allocated 255 characters; you copy at most 49. It might be better to wait until you have the parameter isolated, and then duplicate it - allocating just the space that is needed. Note that if the (path leading to the) command name is 50 characters or more, you won't have a null-terminated string - the space allocated by malloc() is not zeroed and strncpy() does not write a trailing zero on an overlong string.
for (i = 1; i < MAX_ARG; i++)
It is not clear that you should have an upper limit on the number of arguments that is as simple as this. There is an upper limit, but it is normally on the total length of all the arguments.
{
parameter[i] = malloc(255);
Similar comments about memory allocation - and checking.
cp = strtok(NULL, " ");
parameter[i] = cp;
Oops! There goes the memory. Sorry about the leak.
if (strcmp(parameter[i], "|") == 0)
I think it might be better to do this comparison before copying... Also, you don't want the pipe in the argument list of either command; it is a notation to the shell, not part of the command's argument lists. You should also ensure that the first command's argument list is terminated with a NULL pointer, especially since i is set just below to MAX_ARG so you won't know how many arguments were specified.
{
i = MAX_ARG;
cp = strtok(NULL, " ");
parameter2[0] = malloc(255);
strncpy(parameter2[0], cp, 50);
This feels odd; you isolate the command and then process its arguments separately. Setting i = MAX_ARG seems funny too since your next action is to break the loop.
break;
}
if(parameter[i] == NULL)
{
break;
}
}
//Find the second set of commands and parameter
//strncpy(parameter2[0], cp, 50);
for (j = 1; j < MAX_ARG; j++)
{
parameter2[j] = malloc(255);
cp = strtok(NULL, " ");
parameter2[j] = cp;
}
You should probably only enter this loop if you found a pipe. Then this code leaks memory like the other one does (so you're consistent - and consistency is important; but so is correctness).
You need to review your code to ensure it handles 'no pipe symbol' properly, and 'pipe but no following command'. At some point, you should consider multi-stage pipelines (three, four, ... commands). Generalizing your code to handle that is possible.
When writing code for Bash or an equivalent shell, I frequently use notations such as this script, which I used a number of times today.
ct find /vobs/somevob \
-branch 'brtype(dev.branch)' \
-version 'created_since(2011-10-11T00:00-00:00)' \
-print |
grep -v '/0$' |
xargs ct des -fmt '%u %d %Vn %En\n' |
grep '^jleffler ' |
sort -k 4 |
awk '{ printf "%-8s %s %-25s %s\n", $1, $2, $3, $4; }'
It doesn't much matter what it does (but it finds all checkins I made since 11th October on a particular branch in ClearCase); it's the notation that I was using that is important. (Yes, it could probably be optimized - it wasn't worth doing so.) Equally, this is not necessarily what you need to deal with now - but it does give you an inkling of where you need to go.

Resources