Recognising multiple new_lines in PEGKit - parsekit

I am learning how to use PEGKit, but am running into problem with creating a grammar for a script that parses lines, even when they are separated by multiple line break characters. I have reduced the problem to this grammar:
expr
#before {
PKTokenizer *t = self.tokenizer;
self.silentlyConsumesWhitespace = NO;
t.whitespaceState.reportsWhitespaceTokens = YES;
self.assembly.preservesWhitespaceTokens = YES;
}
= Word nl*;
nl = nl_char nl_char*;
nl_char = '\n'! | '\r'!;
This simple grammar to me should allow one word per line, with as many line breaks as necessary. But it only allows one word with an optional line break. Does anybody know what's wrong here? Thank you.

Creator of PEGKit here.
Try the following grammar instead (make sure you are using HEAD of master):
#before {
PKTokenizer *t = self.tokenizer;
[t.whitespaceState setWhitespaceChars:NO from:'\\n' to:'\\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\\r' to:'\\r'];
[t setTokenizerState:t.symbolState from:'\\n' to:'\\n'];
[t setTokenizerState:t.symbolState from:'\\r' to:'\\r'];
}
lines = line+;
line = ~eol* eol+; // note the `~` Not unary operator. this means "zero or more NON eol tokens, followed by one or more eol token"
eol = '\n'! | '\r'!;
Note that here, I am tweaking the tokenizer to recogognize newlines and carriage returns as Symbols rather than whitespace. That makes it easier to match and discard them (they are discarded by the ! operator).
For another approach to the same problem using the builtin S whitespace rule, see here.

Related

Markup matches and escape special characters outside the matches at the same time

I have a search functionality for a treeview that highlights all matches, incl. distinction between caseless and case-sensitive, as well as distinction between regular expression and literal. However, I have a problem when the current cell contains special characters that are not part of the matches. Consider the following text inside a treeview cell:
father & mother
Now I want to do for example a search on the whole treeview for the letter 'e'. For highlighting the matches only and not the whole cell, I need to use markup. To achieve this, I use g_regex_replace_eval and its callback function in the way as stated inside the GLib documentation. The resulting new marked up text for the cell would be like this:
fath<span background='yellow' foreground='black'>e</span>r &
moth<span background='yellow' foreground='black'>e</span>r
If there are special characters inside the matches, they are escaped before being added to the hashtable that is used by the eval function. So special characters inside matches are no problem.
But I have the '&' now outside the markup parts, and it has to be changed to &, otherwise the markup won't show up in the cell and a warning
Failed to set text from markup due to error parsing markup: Error on line x: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &
will be shown inside the terminal.
If I use g_markup_escape_text on the new cell text, it will obviously not only escape the '&', but also the '<' and '>' of the markup, so this is no solution.
Is there a reasonable way to put markup around the matches and escape special characters outside the markup at the same time or with a view steps? Everything I could imagine so far is much too complicated, if it would work at all.
Even though I had already considered Philip's suggestion in most of its parts before asking my question, I had not touched yet the subject of utf8, so he gave an important hint for the solution. The following is the core of a working implementation:
gchar *counter_char = original_cell_txt; // counter_char will move through all the characters of original_cell_txt.
gint counter;
gunichar unichar;
gchar utf8_char[6]; // Six bytes is the buffer size needed later by g_unichar_to_utf8 ().
gint utf8_length;
gchar *utf8_escaped;
enum { START_POS, END_POS };
GArray *positions[2];
positions[START_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
positions[END_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
gint start_position, end_position;
txt_with_markup = g_string_new ("");
g_regex_match (regex, original_cell_txt, 0, &match_info);
while (g_match_info_matches (match_info)) {
g_match_info_fetch_pos (match_info, 0, &start_position, &end_position);
g_array_append_val (positions[START_POS], start_position);
g_array_append_val (positions[END_POS], end_position);
g_match_info_next (match_info, NULL);
}
do {
unichar = g_utf8_get_char (counter_char);
counter = counter_char - original_cell_txt; // pointer arithmetic
if (counter == g_array_index (positions[END_POS], gint, 0)) {
txt_with_markup = g_string_append (txt_with_markup, "</span>");
// It's simpler to always access the first element instead of looping through the whole array.
g_array_remove_index (positions[END_POS], 0);
}
/*
No "else if" is used here, since if there is a search for a single character going on and
such a character appears double as 'm' in "command", between both m's a span tag has to be
closed and opened at the same position.
*/
if (counter == g_array_index (positions[START_POS], gint, 0)) {
txt_with_markup = g_string_append (txt_with_markup, "<span background='yellow' foreground='black'>");
// See the comment for the similar instruction above.
g_array_remove_index (positions[START_POS], 0);
}
utf8_length = g_unichar_to_utf8 (unichar, utf8_char);
/*
Instead of using a switch statement to check whether the current character needs to be escaped,
for simplicity the character is sent to the escape function regardless of whether there will be
any escaping done by it or not.
*/
utf8_escaped = g_markup_escape_text (utf8_char, utf8_length);
txt_with_markup = g_string_append (txt_with_markup, utf8_escaped);
// Cleanup
g_free (utf8_escaped);
counter_char = g_utf8_find_next_char (counter_char, NULL);
} while (*counter_char != '\0');
/*
There is a '</span>' to set at the end; because the end position is one position after the string size
this couldn't be done inside the preceding loop.
*/
if (positions[END_POS]->len) {
g_string_append (txt_with_markup, "</span>");
}
g_object_set (txt_renderer, "markup", txt_with_markup->str, NULL);
// Cleanup
g_regex_unref (regex);
g_match_info_free (match_info);
g_array_free (positions[START_POS], TRUE);
g_array_free (positions[END_POS], TRUE);
Probably the way to do this is to not use g_regex_replace_eval(), but rather to use g_regex_match_all() to get the list of matches for a string. Then you need to step through the string character-by-character (do this using the g_utf8_*() functions, since this has to be Unicode-aware). If you get to a character which needs to be escaped (<, >, &, ", '), output the escaped entity for it. When you get to a match position, output the correct markup for it.
I'd escape the whole text first using g_markup_escape_text, then escape the text to search and use it in g_regex_replace_eval. This way escaped text can be matched, and text not matched is already escaped.

How to disable parsing for a piece of text in a file?

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.
As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.
When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Checking for a blank line in C - Regex

Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.
It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.
Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.
Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.

Replacing C++ single-line comments with C89 comments

What's a good way to replace single-line // input number comments with multi-line /* input number */ comments?
I don't have any preference for the language used to accomplish the task; I was thinking of Perl or sed. The source language will be C (ANSI X3.159-1989).
Simple scripts like
while(<>) {
if (m#^(.*?)//#) {
print $1;
} else {
print $_;
}
}
would be fooled by strings containing // and are not OK. Similarly, // inside a multi-line comment should be left alone.
Edit: Code can assume that there are no trigraphs.
This is the opposite of replace C style comments by C++ style comments. It is similar to Replacing // comments with /* comments */ in PHP (though the accepted answer there cannot handle the special cases I mentioned and so is arguably wrong).
You can use boost::wave lexer's output to replace all the c++ style comments to C style comments. without getting bothered about the edge cases.
#include <iostream>
#include <fstream>
#include <boost/wave/cpplexer/cpp_lex_token.hpp>
#include <boost/wave/cpplexer/cpp_lex_iterator.hpp>
typedef boost::wave::cpplexer::lex_token<> token_type;
typedef boost::wave::cpplexer::lex_iterator<token_type> token_iterator;
typedef token_type::position_type position_type;
int main()
{
const char* infile = "infile.h";
const char* outfile = "outfile.h";
std::string instr;
std::stringstream outstrm;
std::string cmt_str;
std::ifstream instream(infile);
std::ofstream outstream(outfile);
if(!instream.is_open()) {
std::cerr << "Could not open file: "<< infile<<"\n";
}
if(!outstream.is_open()) {
std::cerr << "Could not open file: "<< outfile<<"\n";
}
instream.unsetf(std::ios::skipws);
instr = std::string(std::istreambuf_iterator<char>(instream.rdbuf()),
std::istreambuf_iterator<char>());
position_type pos(infile);
token_iterator it = token_iterator(instr.begin(), instr.end(), pos,
boost::wave::language_support(boost::wave::support_cpp|boost::wave::support_option_long_long));
token_iterator end = token_iterator();
boost::wave::token_id id = *it;
while(it!=end) {
//here you check the c++ style comments
if(id == boost::wave:: T_CPPCOMMENT) {
std::cout<<"Found CPP COMMENT";
cmt_str = it->get_value();
cmt_str[0] = '/';
cmt_str[1] = '*';
//since the last token is the new_line token so replace the new line
cmt_str[cmt_str.size()-1] = '*';
cmt_str.push_back('/');
//and then append the newline at the end of the string
cmt_str.push_back('\n');
outstrm<<cmt_str;
}
else {
outstrm<<it->get_value();
}
++it;
id = *it;
}
outstream<<outstrm;
return 0;
}
For further documentation please see:
http://www.boost.org/doc/libs/1_47_0/libs/wave/index.html
There are a lot of corner cases to consider. Stray //s can appear in string literals, character constants (yes, really), and within /* ... */ comments and // comments. Line-splicing with trailing \ characters can really mess things up -- and a \ can be represented as the trigraph ??/. I seriously doubt that I've thought of all of them.
If you need a 100% reliable replacement, you're going to have to reproduce (or steal!) part of the preprocessor of a C compiler.
If you don't need 100% reliability, you might consider just doing a naive replacement, then comparing the input to the output and manually cleaning up any problems. (For typically code, it's likely that there won't be any, but you'll need to check.) The practicality of this approach depends in part on how much code you need to translate.
Most of the corner cases will result in code that won't compile:
printf("Hello // world\n");
-->
print("Hello /* world\n"); */
You might also consider whether this is really necessary. Most C89/C90 compilers do support // comments, at least optionally.
This won't cover 100% of corner cases, but it covers the ones you mentioned in your request.
#!/usr/bin/env python
import re
from sys import stdin, stdout
for line in stdin.readlines():
line = line[:-1] # Trim the newline
stripped = re.sub(r'[\'"].*[\'"]', '', line) # Ignore strings
stripped = re.sub(r'/\*.*\*/', '', stripped) # Ignore multi-line comments
m = re.match(r'.*?//(.*)', stripped) # Only match actual C++-style comments
if m:
offset = len(m.group(1)) + 2
content = line[:offset*-1] # Get the original line sans comment
print '%s/* %s */' % (content, m.group(1)) # Combine the two with C-style comments
else:
print line

REPL for interpreter using Flex/Bison

I've written an interpreter for a C-like language, using Flex and Bison for the scanner/parser. It's working fine when executing full program files.
Now I'm trying implement a REPL in the interpreter for interactive use. I want it to work like the command line interpreters in Ruby or ML:
Show a prompt
Accept one or more statements on the line
If the expression is incomplete
display a continuation prompt
allow the user to continue entering lines
When the line ends with a complete expression
echo the result of evaluating the last expression
show the main prompt
My grammar starts with a top_level production, which represents a single statement in the language. The lexer is configured for interactive mode on stdin. I am using the same scanner and grammar in both full-file and REPL modes, because there's no semantic difference in the two interfaces.
My main evaluation loop is structured like this.
while (!interpreter.done) {
if (interpreter.repl)
printf(prompt);
int status = yyparse(interpreter);
if (status) {
if (interpreter.error)
report_error(interpreter);
}
else {
if (interpreter.repl)
puts(interpreter.result);
}
}
This works fine except for the prompt and echo logic. If the user enters multiple statements on a line, this loop prints out superfluous prompts and expressions. And if the expression continues on multiple lines, this code doesn't print out continuation prompts. These problems occur because the granularity of the prompt/echo logic is a top_level statement in the grammar, but the line-reading logic is deep in the lexer.
What's the best way to restructure the evaluation loop to handle the REPL prompting and echoing? That is:
how can I display one prompt per line
how can I display the continuation prompt at the right time
how can I tell when a complete expression is the last one on a line
(I'd rather not change the scanner language to pass newline tokens, since that will severely alter the grammar. Modifying YY_INPUT and adding a few actions to the Bison grammar would be fine. Also, I'm using the stock Flex 2.5.35 and Bison 2.3 that ship with Xcode.)
After looking at how languages like Python and SML/NJ handle their REPLs, I got a nice one working in my interpreter. Instead of having the prompt/echo logic in the outermost parser driver loop, I put it in the innermost lexer input routine. Actions in the parser and lexer set flags that control the prompting by input routine.
I'm using a reentrant scanner, so yyextra contains the state passed between the layers of the interpreter. It looks roughly like this:
typedef struct Interpreter {
char* ps1; // prompt to start statement
char* ps2; // prompt to continue statement
char* echo; // result of last statement to display
BOOL eof; // set by the EOF action in the parser
char* error; // set by the error action in the parser
BOOL completeLine // managed by yyread
BOOL atStart; // true before scanner sees printable chars on line
// ... and various other fields needed by the interpreter
} Interpreter;
The lexer input routine:
size_t yyread(FILE* file, char* buf, size_t max, Interpreter* interpreter)
{
// Interactive input is signaled by yyin==NULL.
if (file == NULL) {
if (interpreter->completeLine) {
if (interpreter->atStart && interpreter->echo != NULL) {
fputs(interpreter->echo, stdout);
fputs("\n", stdout);
free(interpreter->echo);
interpreter->echo = NULL;
}
fputs(interpreter->atStart ? interpreter->ps1 : interpreter->ps2, stdout);
fflush(stdout);
}
char ibuf[max+1]; // fgets needs an extra byte for \0
size_t len = 0;
if (fgets(ibuf, max+1, stdin)) {
len = strlen(ibuf);
memcpy(buf, ibuf, len);
// Show the prompt next time if we've read a full line.
interpreter->completeLine = (ibuf[len-1] == '\n');
}
else if (ferror(stdin)) {
// TODO: propagate error value
}
return len;
}
else { // not interactive
size_t len = fread(buf, 1, max, file);
if (len == 0 && ferror(file)) {
// TODO: propagate error value
}
return len;
}
}
The top level interpreter loop becomes:
while (!interpreter->eof) {
interpreter->atStart = YES;
int status = yyparse(interpreter);
if (status) {
if (interpreter->error)
report_error(interpreter);
}
else {
exec_statement(interpreter);
if (interactive)
interpreter->echo = result_string(interpreter);
}
}
The Flex file gets these new definitions:
%option extra-type="Interpreter*"
#define YY_INPUT(buf, result, max_size) result = yyread(yyin, buf, max_size, yyextra)
#define YY_USER_ACTION if (!isspace(*yytext)) { yyextra->atStart = NO; }
The YY_USER_ACTION handles the tricky interplay between tokens in the language grammar and lines of input. My language is like C and ML in that a special character (';') is required to end a statement. In the input stream, that character can either be followed by a newline character to signal end-of-line, or it can be followed by characters that are part of a new statement. The input routine needs to show the main prompt if the only characters scanned since the last end-of-statement are newlines or other whitespace; otherwise it should show the continuation prompt.
I too am working on such an interpreter, I haven't gotten to the point of making a REPL yet, so my discussion might be somewhat vague.
Is it acceptable if given a sequence of statements on a single line, only the result of the last expression is printed? Because you can re-factor your top level grammar rule like so:
top_level = top_level statement | statement ;
The output of your top_level then could be a linked list of statements, and interpreter.result would be the evaluation of the tail of this list.

Resources