I'm trying to create a scanner with flex that acts somewhat like grep.
Basically, what I want to do is: given a word (regular text, not a regex), find any line in the input that contains a match for that text, then print the line that contains the word.
The problem I've been having is that I can't figure out how to best print the line. I can print everything after the searched word, but I don't know how to properly store the contents of the whole line.
I tried using yyseek(), but when I compile, I get back the message that yyseek is an undefined symbol.
Using yymore() to store text works well for anything after the matched word in the line.
Here is the code that I have so far:
%option yylineno
%option noyywrap
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *search_str = NULL;
char *curr_line = NULL;
%}
%x found
letter [a-zA-Z]
word {letter}+
line (.*)\n
%%
<INITIAL,found>{word} {
/* If a word matches the string that we are looking for, use the 'found'
* condition, which will cause the line to be dumped at the end.
*/
yymore();
if (strcmp(search_str, yytext) == 0) {
BEGIN(found);
}
}
<found>{line} {
yymore();
ECHO;
BEGIN(INITIAL);
}
. { }
\n {}
%%
int main(int argc, char *argv[])
{
if (argc > 1) {
unsigned int str_len = sizeof(argv[1]);
search_str = malloc(str_len + 1);
strcpy(search_str, argv[1]);
yylex();
free(search_str);
return 0;
}
printf("usage: ./a.out [search word]\n");
return 1;
}
This is really not a good use case for flex. And it's not totally clear to me that it will do what you want, either. (Since I don't actually know what you want, so I could be wrong about that. But note the following:
Target line grep night grep -w night Your code
------------------- ---------- ------------- ---------
a night to remember Yes Yes Yes
a knight to forget Yes No No
night23 Yes No Yes
Anyway, your instinct about using yymore was correct. You just have to start earlier, so that the entire line is retained in token. The small complication is that when you need to check a word, you can't check from the beginning of yytext; it contains the whole line up to this point. You have to check the last strlen(search_str) characters. The following code makes sure it only does that computation once, since it requires a complete scan of search_str. Also note that it makes sure it does not overrun the beginning of yytext.
In effect, the following code divides the text into three kinds of tokens: words, non-words, and newlines. Only newline fails to call yymore(), so when the newline rule triggers, yytext contains the entire line. As in your code, once a match is found in a line, the rest of the line is simply added to the match.
(Note: I rewrote this without macros, which are generally overused. I don't see any reason to think that {letter} is more readable than [[:alpha:]], and the latter has the advantage of being clear to anyone who knows flex, without having to search for your particular definition.)
%x FOUND
%%
/* Indented lines before the first rule are put at the top of yylex */
int match_length = strlen(search_str);
[^[:alpha:]\n]+ { yymore(); }
[[:alpha:]]+ { yymore();
if (yyleng >= match_length
&& 0 == strcmp(yytext + yyleng - match_length,
search_str))
BEGIN(FOUND);
}
<INITIAL,FOUND>\n BEGIN(INITIAL);
<FOUND>.* printf("%s\n", yytext);
The oddity at the end is to deal with inputs which are not correctly terminated with a newline character. The last pattern will print the line with a newline character (even if there isn't one), and the newline character (if there is one) will restart the start condition.
For a slight speed gain, you could remember the previous value of yyleng every time you call yymore(), so that yyleng - prev_yyleng will be the length of "this part" of the token. (The flex scanner knows this value but doesn't provide any interface for you to find it out, which is slightly annoying. But it's not a big deal.) Then instead of checking whether the entire line up to this point is long enough to make a compare possible, you could check whether the last word matched was exactly the right length, which will be true less often, thereby requiring fewer calls to strcmp.
All in all, though, this is not a good strategy. You'll probably find that strstr is faster than flex, and it's only slightly optimised compared to what is possible for a repeated search for the same target. Better would be to implement or find one of the standard search algorithms:
Boyer-Moore
Knuth-Morris-Pratt
Rabin-Karp
etc.
Related
I am writing a compiler for the mini-L language. I am pretty much done with the scanning part. This is where you basically find tokens in the input. My problem is, when I type in "function" it matches it as an identifier. I cant really tell whats wrong.
ident [a-zA-Z][a-zA-Z0-9_]+[^_]
%{
#include <stdio.h>
#include <stdlib.h>
int col = 0;
int row = 0;
%}
%%
"function" {printf("FUNCTION\n"); col += yyleng;}
{ident} {printf("IDENT %s\n", yytext); col += yyleng;}
. {printf("Error at line %d, column %d: unrecognized symbol \"%s\"\n", row, col, yytext);}
%%
int main(int argc, char** argv){
if(argc > 1){
FILE *fp = fopen(argv[1], "r");
if(fp){
yyin = fp;
}
}
printf("Give me the input:\n");
yylex();
}
I tried rearranging the rules but that didn't work and that doesn't seem like the problem anyway. I think it's a problem with my regex. Any help is much appreciated.
So here's your ident pattern: [a-zA-Z][a-zA-Z0-9_]+[^_]. Let's break that down:
[a-zA-Z] A single letter.
[a-zA-Z0-9_]+ One or more (+) of a letter, a digit, or a _
[^_] Anything other than _. Anything. Such as a space.
One simple observation is that the pattern can't match fewer than three characters. On the other hand, the last one could be almost anything. It could be a letter. But it could also be a comma, or a space, or even a newline. Just not an underline.
So that won't match x, which is usually considered a valid identifier. But it will match function . (That is, the word function followed by a space.) Since that is longer than function, the identifier pattern will win unless you happened to write function_.
So that's probably not what you wanted.
Note that if you want to avoid identifiers ending with an underscore, you have to deal with two issues.
The obvious pattern [[:alpha:]][[:alnum:]_]*[[:alnum:]] can't match single letters. So you need [[:alpha:]]([[:alnum:]_]*[[:alnum:]])?.
If your identifier pattern doesn't match a trailing underscore, the underscore will be left over for the next token. Maybe you're OK with that. But it's probably better to match the underscore and issue an error message. Or reconsider the restriction.
Note: The pattern syntax is documented in the Flex manual, including the built in character sets I used above.
I have tried different functions including strtok(), strcmp() and strstr(), but I guess I'm missing something. Is there a way to match the exact substring in a string?
For example:
If I have a name: "Tan"
And I have 2 file names: "SomethingTan5346" and "nothingTangyrs634"
So how can I make sure that I match the first string and not both? Because the second file is for the person Tangyrs. Or is it impossible with this approach? Am I going at it the wrong way?
If, as seems to be the case, you just want to identify strings that have your text but are immediately followed by a digit, your best bet is probably to get yourself a good regular expression implementation and just search for Tan[0-9].
It could be done simply be using strstr() to find the string then checking the character following that with isnum() but the actual code to do that would be:
not as easy as you think since you may have to do multiple searchs (e.g., TangoTangoTan42 would need three checks); and
inadvisable if there's the chance the searches my become more complex (such as Tan followed by 1-3 digits or exactly two # characters and an X).
A regular expression library will make this much easier, provided you're willing to invest a little effort into learning about it.
If you don't want to invest the time in learning regular expressions, the following complete test program should be a good starting point to evaluate a string based on the requirements in the first paragraph:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int hasSubstrWithDigit(char *lookFor, char *searchString) {
// Cache length and set initial search position.
size_t lookLen = strlen(lookFor);
char *foundPos = searchString;
// Keep looking for string until none left.
while ((foundPos = strstr(foundPos, lookFor)) != NULL) {
// If at end, no possibility of following digit.
if (strlen(foundPos) == lookLen) return 0;
// If followed by digit, return true.
if (isdigit(foundPos[lookLen])) return 1;
// Otherwise keep looking, from next character.
foundPos++;
}
// Not found, return false.
return 0;
}
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Usage testprog <lookFor> <searchIn>...\n");
return 1;
}
for (int i = 2; i < argc; ++i) {
printf("Result of looking for '%s' in '%s' is %d\n", argv[1], argv[i], hasSubstrWithDigit(argv[1], argv[i]));
}
return 0;
}
Though, as you can see, it's not as elegant as a regex search, and is likely to become even less elegant if your requirements change :-)
Running that with:
./testprog Tan xyzzyTan xyzzyTan7 xyzzyTangy4 xyzzyTangyTan12
shows it is action:
Result of looking for 'Tan' in 'xyzzyTan' is 0
Result of looking for 'Tan' in 'xyzzyTan7' is 1
Result of looking for 'Tan' in 'xyzzyTangy4' is 0
Result of looking for 'Tan' in 'xyzzyTangyTan12' is 1
The solution depends on your definition of exact matching.
This might be useful for you:
Traverse all matches of the target substring.
C find all occurrences of substring
Finding all instances of a substring in a string
find the count of substring in string
https://cboard.cprogramming.com/c-programming/73365-how-use-strstr-find-all-occurrences-substring-string-not-only-first.html
etc.
Having the span of the match, verify that the previous and following characters match/do not match your criterion for "exact match".
Or,
You could take advantage of regex in C++ (I know the tag is "C"), with #include <regex>, or POSIX #include <regex.h>.
You may want to use strstr(3) to search a substring in a string, strchr(3) to search a character in a string, or even regular expressions with regcomp(3).
You should read more about parsing techniques, notably about recursive descent parsers. In some cases, sscanf(3) with %n can also be handy. You should take care of the return count.
You could loop to read then parse every line, perhaps using getline(3), see this.
You need first to document your input file format (or your file name conventions, if SomethingTan5346 is some file path), perhaps using EBNF notation.
(you probably want to combine several approaches I am suggesting above)
BTW, I recommend limiting (for your convenience) file paths to a restricted set of characters. For example using * or ; or spaces or tabs in file paths is possible (see path_resolution(7)) but should be frowned upon.
I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.
I am trying to write lex code which will take a string as input, and parse through a long dictionary file to find the longest word in that dictionary which is made up of only the letters in that string. Each letter in the string can be used zero or more times, meaning the word "in" would be valid for "input". Here is what I have so far:
%{
#include <stdio.h>
%}
%option noyywrap
%%
[input]+ {
printf("This is the longest I think: %s\n", yytext);
}
.|\n {}
%%
int main(void)
{
yylex();
return 0;
}
However, this really does not do what I expect it to do. This code goes through and prints the matching portions of every word in the dictionary, so I get output like "i", "iu", "inu", etc., and these obviously aren't valid words. Anyone know how to fix this?
You could use the beginning-of-line and end-of-line markers as part of your regular expression to require that the entire line is matched, not just a part of it. Try changing your regex from [input]+ to
^[input]+$
You will then need some separate logic to track the longest string you've found so far, but judging from the code you have above I think this more directly addresses your question at hand.
Hope this helps!
Can anyone explain to me the purpose of ungetch?
This is from K&R chapter 4 where you create a Reverse Polish Calculator.
I've ran the program without the call to ungetch and in my tests it still works the same.
int getch(void) /* get a (possibly pushed back) character */
{
if (bufp > 0)
{
return buf[--bufp];
}
else
{
return getchar();
}
}
void ungetch(int c) /* push character back on input */
{
if (bufp >= BUFSIZE)
{
printf("ungetch: too many characters\n");
}
else
{
buf[bufp++] = c;
}
}
(I've removed the ternary operator in getch to make it clearer.)
I don't know about the specific example you're referring to (It's probaby 23 years since I read K&R, and that was the first edition.), but often when parsing it's convenient to 'peek' at the next character to see if it is part of what you're currently parsing. For instance, if you're reading a number you want to keep reading digits until you come to a non-digit. Ungetc lets the number reader look at the next character without consuming it so that someone else can read it. In Greg Hewgill's example of "2 3+", the number reader would read the 3 digit, then read the plus sign and know the number is finished, then ungetc the plus sign so that it can be read later.
Try running the program without spaces around operators. I don't recall precisely the format of that example and I don't have K&R handy, but instead of using "2 3 +" try "2 3+". The ungetch() is probably used when parsing numbers, as the number parser will read digits until it gets something that is a non-digit. If the non-digit is a space, then the next getch() will read the + and all is well. However, if the next non-digit is a +, then it will need to push that back onto the input stream so the main read loop can find it again.
Hope I'm remembering the example correctly.
It's used a lot for lexical scanners (the part of the compiler that breaks your text into chunks like variable names, constants, operators, etc.). The function isn't necessary for the scanner, it's just very convenient.
When you're reading a variable name, for example, you don't know when you're done until you read a character that can't be part of the variable name. But then you have to remember that character and find a way to communicate it to the next chunk of the lexer. You could create a global variable or something, or pass it to the caller--but then how do you return other things, like error codes? Instead, you ungetch() the character to put it back into the input stream, do whatever you need to with your variable name and return. Then when the lexer starts reading the next chunk, it doesn't have to look around for extra characters lying around.
Take a look at this code, you'll understand:
#include <conio.h>
#include <stdio.h>
int main()
{
int y=0;
char t[10];
int u=0;
ungetch('a');
t[y++]=getch();
ungetch('m');
t[y++]=getch();
ungetch('a');
t[y++]=getch();
ungetch('z');
t[y++]=getch();
ungetch('z');
t[y++]=getch();
ungetch('a');
t[y++]=getch();
ungetch('l');
t[y++]=getch();
ungetch('\0');
t[y++]=getch();
ungetch('\0');
t[y++]=getch();
ungetch('\0');
t[y++]=getch();
printf("%s",t);
return 0;
}