getchar() and counting sentences and words in C - c

I'm creating a program which follows certain rules to result in a count of the words, syllables, and sentences in a given text file.
A sentence is a collection of words separated by whitespace that ends in a . or ! or ?
However, this is also a sentence:
Greetings, earthlings..
The way I've approached this program is to scan through the text file one character at a time using getchar(). I am prohibited from working with the the entire text file in memory, it must be one character or word at a time.
Here is my dilemma: using getchar() i can find out what the current character is. I just keep using getchar() in a loop until it finds the EOF character. But, if the sentence has multiple periods at the end, it is still a single sentence. Which means I need to know what the last character was before the one I'm analyzing, and the one after it. Through my thinking, this would mean another getchar() call, but that would create problems when i go to scan in the next character (its now skipped a character).
Does anyone have a suggestion as to how i could determine that the above sentence, is indeed a sentence?
Thanks, and if you need clarification or anything else, let me know.

You just need to implement a very simple state machine. Once you've found the end of a sentence you remain in that state until you find the start of a new sentence (normally this would be a non-white space character other than a terminator such as . ! or ?).

You need an extensible grammar. Look for example at regular expressions and try to build one.
Generally human language is diverse and not easily parseable especially if you have colloquial speech to analyze or different languages. In some languages it may not even be clear what the distinction between a word and a sentence is.

Related

Removing a substring from a char array without using any libraries in C

I am a computer science first year student and our teachers gave a binary pattern search task to do. We have to remove a substring from a string without using any libraries and built-ins like(memmove or strstr). Our only hint is that its something to do with '\0'. I don't see how we are going to achive this because as i know the null character only ends a string not removes it. And given an unknown input it gets even harder to get around. I need help about the usage of null character. EDIT: Oh and also we are not allowed to create new arrays. EDIT2: The problem is much more complicated if you are here for the solution read the comments under this thread and the marked solution's also.
C strings are "null terminated" which means they are considered to end wherever a null (written '\0' in C) appears.
If I start with the string "Stack Overflow" and I overwrite the space with '\0', I now have the string "Stack". The storage for "Overflow" still exists, but it is not part of the string according to C functions like strlen(), printf() etc. In fact, if I hold a pointer to the "O" part of the original string, it will be just as if there are two strings: "Stack" and "Overflow", and you can still use both of them.
It's like if I come to where you live and I build a huge wall across the road just before your house. The road is now shortened, and people on my side of it won't know you are there.

Is there a way to compare every line in one text file to one line in another text file in C?

For example, I have an index text file that has 400+ English words, and then I have another text file with decrypted text on each line.
I want to check each English word in my index file with each line of my decrypted text file (so checking 400+ English words for a match per line of decrypted text)
I was thinking of using strncmp(decryptedString, indexString, 10) because I know that strncmp terminates if the next character is NULL.
Each line of my decrypted text file is 352 characters long, and there's ~40 million lines of text stored in there (each line comes from a different output).
This is to decrypt a playfair cipher; I know that my decryption algorithm works because my professor gave us an example to test our program against and it worked fine.
I've been working on this project for six days straight and this is the only part I've been stuck on. I simply can't get it to work. I've tried using
while(getline(&line, &len, decryptedFile) != -1){
while(getline(&line2, &len2, indexFile) != -1){
if(strncmp(decryptedString, indexString, 10) == 0){
fprintf(potentialKey, "%s", key);
}
}
}
But I never get any matches. I've tried storing each string in into arrays and testing them one character at a time and that didn't work for me either since it would list all the English words are on one line. I'm simply lost, so any help or pointers in the right direction would be much appreciated. Thank you in advance.
EDIT: Based on advice from Clifford in the comments, here's an example of what I'm trying to do
Let's say indexFile contains:
HELLO
WORLD
PROGRAMMING
ENGLISH
And the decryptedFile contains
HEVWIABAKABWHWHVWC
HELLOHEGWVAHSBAKAP
DHVSHSBAJANAVSJSBF
WORLDHEEHHESBVWJWU
PROGRAMMINGENGLISH
I'm trying to compare each word from indexFile to decryptedFile, one at a time. So all four words from indexFile will be compared to line 1, line2, line 3, line 4, and line 5 respectively.
If what you are trying to do is check to see if an input line starts with a word, you should use:
strncmp(line, word, strlen(word));
If you know that line is longer than word, you can use
memcmp(line, word, strlen(word));
If you are doing that repeatedly with the same word(s), you'd be better off saving the length of the word in the same data structure as the word itself, to avoid recomputing it each time.
This is a common use case for strncmp. Note that your description of strncmp is slightly inaccurate. It will stop when it hits a NUL in either argument, but it only returns equal if both arguments have a NUL in the same place or if the count is exhausted without encountering a difference.
strncmp is safer than depending on the fact that line is longer than word, given that the speed difference between memcmp and strncmp is very small.
However, with that much data and that many words to check, you should try something which reduces the number of comparisons you need to do. You could put the words into a Trie, for example. Or, if that seems like too much work, you could at least categorize them by their first letter and only use the ones whose first letter matches the first letter of the line, if there are any.
If you are looking for an instance of the word(s) anywhere in the line, then you'll need a more sophisticated search strategy. There are lots of algorithms for this problem; Aho-Corasick is effective and simple, although there are faster ones.
If a line of decrypted text is 352 characters long and each word in the index is not 352 characters long, then a line of decrypted text will never match any word in the index.
From this I think you've misunderstood the requirements and asked a question based on the misunderstanding.
Specifically, I suspect that you want to compare each individual word in the decrypted line (and not the whole line) with each each word in your index, to determine if all words in the decrypted line are acceptable. To do that, the first step would be to break the decrypted line of characters into individual words - e.g. maybe finding the characters that separate words (spaces, tabs, commas?) within the decrypted text and replacing them with a zero terminator (so that you can use strcmp() and don't need to worry about "foobar" incorrectly matching "foo" just because the first letters match).
Note that there's probably potential optimisations. E.g. if you know that a word from the decrypted text is 8 characters (which you would've had to have known to place the zero terminator in the right spot) and if your index is split into "one list for each word length" (e.g. a list of index words with 3 characters, a list of index words with 4 characters, etc) then you might be able to skip a lot of string comparisions (and only compare the word from the decrypted line with words that have the same length in the index). In this case (where you know both words have the same length already) you can also avoid modifying the original 352 characters (you won't need to insert the zero terminator after each word).

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

Convert escape sequences from user input into their real representation

I'm trying to write an interpreter for LOLCODE that reads escaped strings from a file in the form:
VISIBLE "HAI \" WORLD!"
For which I wish to show an output of:
HAI " WORLD!
I have tried to dynamically generate a format string for printf in order to do this, but it seems that the escaping is done at the stage of declaration of a string literal.
In essence, what I am looking for is exactly the opposite of this question:
Convert characters in a c string to their escape sequences
Is there any way to go about this?
It's a pretty standard scanning exercise. Depending on how close you intend to be to the LOLCODE specification (which I can't seem to reach right now, so this is from memory), you've got a few ways to go.
Write a lexer by hand
It's not as hard as it sounds. You just want to analyze your input one character at a time, while maintaining a bit of context information. In your case, the important context consists of two flags:
one to remember you're currently lexing a string. It'll be set when reading " and cleared when reading ".
one to remember the previous character was an escape. It'll be set when reading \ and cleared when reading the character after that, no matter what it is.
Then the general algorithm looks like: (pseudocode)
loop on: c ← read next character
if not inString
if c is '"' then clear buf; set inString
else [out of scope here]
if inEscape then append c to buf; clear inEscape
if c is '"' then return buf as result; clear inString
if c is '\' then set inEscape
else append c to buf
You might want to refine the inEscape case should you want to implement \r, \n and the like.
Use a lexer generator
The traditional tools here are lex and flex.
Get inspiration
You're not the first one to write a LOLCODE interpreter. There's nothing wrong with peeking at how the others did it. For example, here's the string parsing code from lci.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources