How to tokenize this? - lexical-analysis

How to tokenize this? - lexical-analysis

I am trying to hand code a tokenizer. I keep on reading the characters which can be part of a token. For example an integer can only contain digits. So in the below text I keep on reading the characters until I find a non-digit character. So I get 123 as the token. Next I get ( as the token, and then abc as identifier. This is fine as ( is a delimiter.
123(abc
However, in the below text I get 123 as integer and then abc as identifier. But actually this in not valid since there is no delimiter between them.
123abc(
Should the tokenizer check for delimiters and report an error? If yes what tokens should be returned and where should the tokenizer continue reading from after an invalid token is found?
Or should the tokenizer simply return 123 as integer and abc as identifier and let the parser detect the errors?

Usually, the tokenizer (or lexer) performs no checking of valid syntax.
The role of a lexer is to split the input into tokens, which can then be transformed into a syntax tree by the parser. Therefore, it'd usually be the job of the parser to perform such a check.

This is somewhat of a gray area, but most hand-coded lexers just do the tokenizing, and let the parser decide whether the stream of tokens make sense.

If "123abc" is an invalid token then you should handle it as soon as you spot it since it's directly related to the way tokens are defined, not how they interact with each other (which would be the lexer's job). It's an orthographic error rather than a grammar-related one.
There are multiple ways to go about it:
You could abort the parsing and just throw some exception, leaving the caller with no tokens or just the tokens you had successfully parsed until then. This will save you any "recovery" logic and might be enough for your use case. Although, if you're parsing stuff for syntax highlighting for instance, this would probably not be sufficient as you don't want all of the remaining code to look unparsed.
Example: A conforming XML parser could use this for fatal errors if there's no need to handle malformed markup, just spit out a basic error and quit.
Alternatively, you could insert an "error" token with proper metadata about the nature of the error and skip ahead to the next valid token.
You might need to have heuristics in your lexer to handle the error token gracefully and find how to interpret further tokens when an error token is found inside an nested expression (like, should you consider the expression has ended? look for a closing token? etc.).
Anyway, this approach will allow for error tokens to be used to display precise info about the location and nature of errors encountered (think inline error reporting in a GUI).

You might consider generating your tokenizer or lexer. Tools like Flex or ANTLR should help. And you might also generate your parser with ANTLR or Bison
If you insist on hand-coding your lexer (and your parser), having some look-ahead is extremely helpful in practice. For instance, you could read your input line by line and tokenize inside the current line (with the ability to inspect the next few characters).

Related

Unable to form the required regex in C

I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.
Check should ensure string is wildcard domain name of a website.
Example:
*.cool.dude is valid
*.cool is not valid
abc.cool.dude is not valid
So I had written something which like this
\\*\\.[.*]\\.[.*]
However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.
I am looking for something which ensures that at-least 1 occurrence of the string happens.
Example:
*.a.b -> valid but *.. -> invalid
how to change the regex to support this?
I have already tried doing something like this:
\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work
\\*\\.([.+])\\.(.+) -> doesnt work
^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work
I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.
PS. Looking for a solution which works in C.

[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.
So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)
But that's probably not really what you want. My guess is that you want something like:
^[*]([.][A-Za-z0-9_]+){2,}$
although you'll have to correct the character class to specify the precise set of characters you think are legitimate.
Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.
Some notes:
The {2,} requires at least two components after the *, so that *.cool doesn't match.
The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.
Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.
For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?

You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

Why is buffering used in lexical analysis?

Why is buffering used in lexical analysis?and what is best value for EOF?

EOF is typically defined as (-1).
In my time I have made quite a number of parsers using lex/yacc, flex/bison and even a hand-written lexical analyser and a LL(1) parser. 'Buffering' is rather vague and could mean multiple things (input characters or output tokens) but I can imagine that the lexical analyzer has an input buffer where it can look ahead. When analyzing 'for (foo=0;foo<10;foo++)', the token for the keyword 'for' is produced once the space following it is seen. The token for the first identifier 'foo' is produced once it sees the character '='. It will want to pass the name of the identifier to the parser and therefore needs a buffer so the word 'foo' is still somewhere in memory when the token is produced.

Speed of lexical analysis is a concern.
Also, need to check several ahead characters in order to find a match.

Lexical analyzer scans a input string character by character,from left to right and those input character thus read from hard-disk or secondary storage.That can requires a lot of system calls according to the size of program and can make the system slow.That's why we use input buffering technique.
Input buffer is a location that holds all the incoming information before it continues to CPU for processing.
you can also know more information from here:
https://www.geeksforgeeks.org/input-buffering-in-compiler-design/

lex and yacc output

How can I modify my lex or yacc files to output the same input in a file? I read the statements from a file, I want to add some invariant for special statements and add it to input file and then continue statements. For example I read this file:
char mem(d);
int fun(a,b);
char a ;
The output should be like:
char mem(d);
int fun(a,b);
invariant(a>b) ;
char a;
I can't do this. I can only write the new statements to output file.

It's useful to understand why this is a non-trivial question.
The goal is to
Copy the entire input to the output; and
Insert some extra information produced while parsing.
The problem is that the first of those needs to be done by the scanner (lexer), because the scanner doesn't usually pass every character through to the parser. It usually drops whitespace, comments, at least. And it may do other things, like convert numbers to their binary representation, losing the original textual representation.
But the second one obviously needs to be done by the parser, obviously. And here is the problem: the parser is (almost) always one token behind the scanner, because it needs the lookahead token to decide whether or not to reduce. Consequently, by the time a reduction action gets executed, the scanner will already have processed all the input data up to the end of the next token. If the scanner is echoing input to output, the place where the parser wants to insert data has already been output.
Two approaches suggest themselves.
First, the scanner could pass all of the input to the parser, by attaching extra data to every token. (For example, it could attach all whitespace and comments to the following token.) That's often used for syntax coloring and reformatting applications, but it can be awkward to get the tokens output in the right order, since reduction actions are effectively executed in a post-order walk.
Second, the scanner could just remember where every token is in the input file, and the parser could attach notes (such as additional output) to token locations. Then the input file could be read again and merged with the notes. Unfortunately, that requires that the input be rewindable, which would preclude parsing from a pipe, for example; a more general solution would be to copy the input into a temporary file, or even just keep it in memory if you don't expect it to be too huge.

Since you can already output your own statements, your problem is how to write out the input as it is being read in. In lex, the value of each token being read is available in the variable yytext, so just write it out for every token you read. Depending on how your lexer is written, this could be used to echo whitespace as well.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.

If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.

The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.