How can I modify my lex or yacc files to output the same input in a file? I read the statements from a file, I want to add some invariant for special statements and add it to input file and then continue statements. For example I read this file:
char mem(d);
int fun(a,b);
char a ;
The output should be like:
char mem(d);
int fun(a,b);
invariant(a>b) ;
char a;
I can't do this. I can only write the new statements to output file.
It's useful to understand why this is a non-trivial question.
The goal is to
Copy the entire input to the output; and
Insert some extra information produced while parsing.
The problem is that the first of those needs to be done by the scanner (lexer), because the scanner doesn't usually pass every character through to the parser. It usually drops whitespace, comments, at least. And it may do other things, like convert numbers to their binary representation, losing the original textual representation.
But the second one obviously needs to be done by the parser, obviously. And here is the problem: the parser is (almost) always one token behind the scanner, because it needs the lookahead token to decide whether or not to reduce. Consequently, by the time a reduction action gets executed, the scanner will already have processed all the input data up to the end of the next token. If the scanner is echoing input to output, the place where the parser wants to insert data has already been output.
Two approaches suggest themselves.
First, the scanner could pass all of the input to the parser, by attaching extra data to every token. (For example, it could attach all whitespace and comments to the following token.) That's often used for syntax coloring and reformatting applications, but it can be awkward to get the tokens output in the right order, since reduction actions are effectively executed in a post-order walk.
Second, the scanner could just remember where every token is in the input file, and the parser could attach notes (such as additional output) to token locations. Then the input file could be read again and merged with the notes. Unfortunately, that requires that the input be rewindable, which would preclude parsing from a pipe, for example; a more general solution would be to copy the input into a temporary file, or even just keep it in memory if you don't expect it to be too huge.
Since you can already output your own statements, your problem is how to write out the input as it is being read in. In lex, the value of each token being read is available in the variable yytext, so just write it out for every token you read. Depending on how your lexer is written, this could be used to echo whitespace as well.
Related
Why is buffering used in lexical analysis?and what is best value for EOF?
EOF is typically defined as (-1).
In my time I have made quite a number of parsers using lex/yacc, flex/bison and even a hand-written lexical analyser and a LL(1) parser. 'Buffering' is rather vague and could mean multiple things (input characters or output tokens) but I can imagine that the lexical analyzer has an input buffer where it can look ahead. When analyzing 'for (foo=0;foo<10;foo++)', the token for the keyword 'for' is produced once the space following it is seen. The token for the first identifier 'foo' is produced once it sees the character '='. It will want to pass the name of the identifier to the parser and therefore needs a buffer so the word 'foo' is still somewhere in memory when the token is produced.
Speed of lexical analysis is a concern.
Also, need to check several ahead characters in order to find a match.
Lexical analyzer scans a input string character by character,from left to right and those input character thus read from hard-disk or secondary storage.That can requires a lot of system calls according to the size of program and can make the system slow.That's why we use input buffering technique.
Input buffer is a location that holds all the incoming information before it continues to CPU for processing.
you can also know more information from here:
https://www.geeksforgeeks.org/input-buffering-in-compiler-design/
I would like some suggestion on how to read an 'XML' like file in such a way that the program would only read/store elements observed in a node that meets some requirements. I was thinking about using two fgets in the following way:
while (fgets(file_buffer,line_buffer,fp) != NULL)
{
if (p_str = (char*) strstr(file_buffer,"<element of interest opening")) )
{
//new fgets that starts at fp and runs only until the end of the node
{
//read and process
}
}
}
Does this make sense or are there smarter ways of doing this?
Secondly (in my idea), will i have to define a new FILE* (like fr), set fr to fp at the start of the second fgets or can i somehow abuse the original filepointer for that?
Use an XML parser like Xmllib2 http://xmlsoft.org/xml.html
Your approach seems isn't bad for the job.
You could read the whole line from the file, then, process it using sprintf, strstr or whatever functions you like. This will save you time and unnecessary overheads with FILE I/O.
As per your second idea, you can use fseek()(Refer: man fseek) or rewind()(Refer: man rewind) using the same file pointer fp. You do not need an extra file pointer.
EDIT:
If you could change the tag format to adhere to XML structure you will be able to use libXML2 and such libraries properly.
If that's not possible, then you have to write your own parser.
A few pointers:
First extract data from the file into a buffer.The size of the buffer and whether dynamically or statically allocated, will depend on your specs.
Search in the buffer, if non-whitespace character is < or whatever character your tag usually begins with. If not, you can just show an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store them. Process the =, strings and stuff, as you wish.
If the next non-whitespace character is /, check that it is followed by >(or a similar pattern in your specs to find if a tag ends). If so, you've finished parsing and can return your result. Otherwise, you've got a malformed tag and should exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content.
Otherwise what follows is an argument. Parse that, store the result, continue at step 4.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >. If yes, return the result , else, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm and then continue at 4 again.
Although, its quite basic idea, I hope it should help you start.
EDIT:
If you still want to reference the file as a pointer consider using mmap().
If you add mmap with a bit of shared memory IPC and adequate memory locking stuff, you could write a parallel processing program, that will process most of your files faster.
I have taken up a project and I would like some help. Basically it is a program to check whether some pins are connected or not on a board.
(Well, that's the simplified version. The whole thing is a circuit with a microcontroller.)
The problem is that, when a pin is connected I get a numeric value, and when it's not connected, I get no value, as in it's a blank in my table.
How can I accept these values?
I need to accept even the blank, to know that its not connected,
plus the table contains some other non-numeric values as well.
I tried reading the file using the fscanf() function but it didn't quite work. I'm aware of only fscanf(), fread(), fgets() and fgetc() functions to read from different kinds of files.
Also, is it possible to read data from an Excel file using C?
An example of the table is:
FROM TO
1 39
2
Over here, the numbers 1 and 2 are under the column FROM and it tells which pin the first end of the connector is connected to. The numbers under TO tell us which pin the other end of the connector is connected to, and when the column is blank, it's not connected at one end.
Now what I'm trying to do is create a program to create an assembly language program for the micro controller, so I need to be able to read whether the connector is connected, and if it is then to which pin? And accordingly, I need to perform some operations. (Which I can manage by myself).
The difficulty I'm facing is reading from a specific line and reading the blank.
Read the lines using fgets() or a relative. Then use sscanf() on the line, checking to see whether there were one or two successful conversions (the return value). If there's one conversion, the second value was empty or missing; if two, then you have both numbers safely.
Note that fscanf() and relatives will read past newlines unless you're careful, so they do not provide the information you need.
so your file is more like this
Col1 col2 \n
r1val1 r1val2\n
.
.
and so on,if this is the case then use fscanf() to read the string (until \n)from the file.Then use strtok() function to break the string into tokens ,here is the tutorial of the same
http://www.gnu.org/s/hello/manual/libc/Finding-Tokens-in-a-String.html
hope this helps...
one more humble suggestion..just work on c programming first if you are a newbie,don't directly go for microcontrollers,as there are lots of things that you might understand in a wrong way if you dont know some of the basic concepts...
This is a common problem in C. When line boundaries carry meaning in the grammar, it's difficult to directly read the file using only the scanf()-family functions.
Just read each line with fgets(3) and then run sscanf() on one line at a time. By doing this you won't incorrectly jump ahead to read the next line's first column.
Since there are two values on a line you can parse the first, find the next whitespace, then parse the next looking for it's absence as well. I say parse rather than scanf() as when I really want control, or have a huge volume of numbers to scan, I use calls in the strtol() family.
I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.
I have a scientific application for which I want to input initial values at run time. I have an option to get them from the command line, or to get them from an input file. Either of these options are input to a generic parser that uses strtod to return a linked list of initial values for each simulation run. I either use the command-line argument or getline() to read the values.
The question is, should I be rolling my own parser, or should I be using a parser-generator or some library? What is the standard method? This is the only data I will read at run time, and everything else is set at compile time (except for output files and a few other totally simple things).
Thanks,
Joel
Also check out strtof() for floats, strtod() for doubles.
sscanf
is probably the standard way to parse them.
However, there are some problems with sscanf, especially if you are parsing user input.
And, of course,
atof
In general, I prefer to have data inputs come from a file (e.g. the initial conditions for the run, the total number of timesteps, etc), and flag inputs come from the command line (e.g. the input file name, the output file name, etc). This allows the files to be archived and used again, and allows comments to be embedded in the file to help explain the inputs.
If the input file has a regular format:
For parsing, read in a full line from the file, and use sscanf to "parse" the line into variables.
If the input file has an irregular format:
Fix the file format so that it is regular (if that is an option).
If not, then strtof and strtod are the best options.