Semantic of comma-separated values with bison - c

I'm trying to give a semantic value to a list of comma-separated values. In fact, I have defined the reduction rules for bison using
commasv : exp
| commasv "," exp
where exp its a number or a variable or a function pointer or, also, a commasv token with its respective syntax and semantic rules. The type of exp is double so the type of commasv must be double.
The thing is that I want to store the list in order to use it, for example, on a function call. For instance
h = create_object()
compute_list(h,1,cos(3.14159))
will give the expected result of a certain compute_list function.
As basis bison file I've used mfcalc example from the bison manual and I replaced the yylex function by other one generated using flex. By now I can do things like
pi = 3.14159
sin(pi)
ln(exp(5))
with the modified version of yylex function with flex but I want use the comma-separated values with function calls, lists creation and more.
Thanks for your answers.

Then create a list to store the results in. Instead of having the result of the commasv rule return an actual value, have it return the list head.
In general, as soon as you get a somewhat moderately advanced grammar (like it incorporating things like lists), you can no longer really use values to represent the parsing, but have to go over to some sort of abstract syntax tree.

Related

Why required "=" before ANY function with array as param, in postgres procedure?

I was answering a postgres question yesterday, and also came across a postgres thread (here) where they describe the following error:
ERROR: operator does not exist: text = text[]
HINT: No operator matches the given name and argument type(s). You
might need to add explicit type casts.
The error seems to appear whenever an ARRAY string type is fed to ANY without using = ANY. This seems completely strange since based on language, logic, and sql conventions, usually you have (e.g. IN):
variable FUNCTION(set)
instead of.
variable = FUNCTION(set) , unless ofcourse operator is a summation/count operation returning one result :)
It would make more senseto have variable ANY(Set/Array) instead of variable=ANY(Set/Array). Similar example is the IN function.
Can anyone explain what is going on here?
IN (...) is basically equivalent to = ANY (ARRAY[...])
Crucially, ANY is not a function. It's syntax defined by the SQL standard, and is no more a function than GROUP BY or the OVER clause in a window function.
The reason that = is required before ANY is that ANY can apply to other operators too. What it means is "Test the operator to the left against every element in the array on the right, and return true if the test is true for at least one element."
You can use > ANY (ARRAY[...]) or whatever. It's a general purpose operator that isn't restricted to =. Notably useful for LIKE ANY (albeit with somewhat bad performance).
There is ALL too, which does much the same thing but returns true only if all results are true.

splitting JSON string using regex

I want to split a JSON document and which has a pattern like [[[1,2],[3,4][5,6]]] using regex. The pairs represent x ad y. What I want to do it to take this string and produce a list with {"1,2", "3,4","5,6"}. Eventually I want to split the pairs. I was thinking I can make a list of {"1,2", “3,4","5,6"} and use the for loop to split the pairs. Is this approach correct to get the x and y separately?
JSON is not a regular language, but a Context free language, and as such, cannot be matched by a regular expresion. You need a full JSON parser like the ones referenced in the comments to your question.
... but, if you are going to have a fixed structure, like only three levels of square brakets only, and with the structure you posted in your question, then there's a regexp that can parse it (It would be a subset of the JSON grammar, not general enough to parse other JSON contents):
You'll have numbers: ([+-]?[0-9]+)
Then you'll have brackets and separators: \[\[\[, ,, \],\[ and \]\]\]
and finally, put all this together:
\[\[\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\]\]\]
and if you want to permit spaces between symbols, then you need:
\s*\[\s*\[\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*\]\s*\]\s*
This regexp will have six matching groups that will match the corresponding integers in the matching string as the folloging demo
Clarification
Regular languages, and regular grammars, and regular expressions form a class of languages with many practical properties, for example:
You can parse them efficiently in one pass with what is called a finite automaton
You can define the automaton to accept language sentences simply with a regular expression.
You can simply operate with regexps (or with automata) to make more complex acceptors (for the union of language sets, intersection, symmetric difference, concatenation, etc) to make acceptors for them.
You can simply say if one regular expression (the language it defines) is a subset, superset or none of the language of the original.
By contrast, it limits the power of languages that can be defined with it:
you cannot define languages that allow nesting of subexpressions (like the bracketing you allow in JSON expressions or the tag nesting allowed in XML documents)
you cannot define languages which collect context and use it in another place of the sentence (for example, sentences that identify a number and have to match that same number in another place of the sentence)
But, the meaning of my answer is that, if you bind the upper limit of nesting (let's say, for example, to three levels of parenthesis, like the example you posted) you can make your language regular and then parse it with the regular expression. It is not easy to do that, because this often leads to complex expressions (as you have seen in my answer) but not impossible, and you'll gain the possibility of being able to identify parts of the sentence as submatches of the regular subexpressions embedded in the global one.
If you want to allow nesting, you need to switch to context free languages, which are defined with context free grammars and are accepted with a more complex stack based automaton. Then, you loose the complete set of operations you had:
You'll never be able again to say if some language overlaps another (is included)
You'll never be abla again to construct a language from the union, intersection or difference of other context free languages.
But you will be able to match unbounded nested sentences. Normally, programming languages are defined with a context free grammar and a little more work for context checking (for example, to check if some identifier being used is actually defined in the declaration section or to match the starting and ending tag identifiers at matching levels in an XML document)
For context free languages, see this.
For regular languages, see this.
Second clarification
As in your question you didn't expressed you wanted to match real, decimal numbers, I have modified the demo to make it to allow fixed point numbers (not general floating point with exponential notation, you'll need to work it yourself, as an exercise). Just make some tests and modify the regexp to adapt it to your needs.
(well, if you want to see the solution, look at it)
Yeah i tried using the regex in my code but it is not working so I am trying a different approach now. I have an idea of how to approach it but it is not really working. First of let me be more clear on the question. What I am trying to so parse a JSON document. Like the image below. the file has a strings have [[[1,2],[3,4][5,6]]] pattern. What I am trying to get out of this is to have each pair as a list. So the list has an x-y pairs.
the string structure
My approach: first replace the “[[“ and “]]” at the begging and at the end, so I have a string with the same pattern through out. which gives [enter image description here][2]me a string “[1,2],[3,4][5,6]” This is my code but it is not working. How do I fix it? The other thing I though it could be an issue is, the strings are not the same length so. So how do I replace just the beginning and the ending?
my code
Then I can use a regex split method to get a list that has a form {“1,2” , “3,4”, “5,6”}. I am not really sure how to do this though.
Then I take the x, and the y, and add them and add those to the list. So I get of a list pair x-y pair. I will appreciate if you show me how to do this.
This is the approach I am working on but if there is a better way of doing it I will be glad to see it. [enter image description here][4]

how to declare variables during run time in c

I am about to start a project of something like simple calculator completely in c and I was wondering how would I allow the user to create variables during the run time of the program, this variable could be a number or complex number or even a matrix or a
one approach was to store the type of the variable, its name and its size and value in a temporary text file and retrieve it whenever needed, is there any better approach. I hope if I could declare real variables during runtime in c
OK, you are going to need three basic modules:
N.B. I haven't had the time to actually compile these examples, hopefully I didn't make too many mistakes.
Parser. This module takes a user entered input, tokenizes it and insures that the entered expression conforms to the grammer.
Interpreter. This module takes the output of the parser and preforms the computation.
Environment. This module manages the state of the computation.
Lets start with the environment, here is what we need (or at least what I would implement). First here are the design considerations the I would use for the environment:
We will only deal with variables
We will only allow 300 variables
All variables will have global scope
Rebinding a variable will overwrite the old binding
Now, lets define the following structure:
typedef struct st
{
char* tokenName;
int type;
union
{
int iVal;
float fVal;
} val;
} tableEntry, * ptableEntry;
tableEntry symbolTable[300];
and now for some functions:
a. init(tableEntry*) -- this function initialized the environment, i.e. sets all values in the symbol table to some, predefined empty state.
b. addValue(tableEntry*, name, value) -- this function takes a pointer to the environment and adds a new entry to the environment.
c. int lookupValue(tableEntry*, name) -- this function takes a pointer to the environment and sees if the token name has been defined in it. Already, we see a problem, we are allowing for both integers and floating point numbers but would like a single lookup functions, so we probably need some sort of variant type or figure out some way to return different types.
d. updateValue(tableEntry*, name, value) -- this function takes a pointer to the environment and updates an existing value. This raises an unaddressed specification, what should updateValue do if the token is not found? Personally I would just add the value, but it up to you as the designer of the calculator on what to do.
This should do for a start for the environment.
Now lets turn to the interpreter for a bit. For this let suppose that the parser emits the abstract syntax tree in prefix form. For example:
the statement x=3 would be emitted as = x 3
the statement z = 4 + 5 would be emitted as = z + 4 5
OK, the trick here is we don't really emit 3, but rather a token that contains more information about what is being passed around.
A possible implementation of a token might be:
typedef struct tok
{
int tokType;
char* tokVal;
} token, * ptoken;
Also lets have the following enumeration:
enum {EMPTY=0, ID, VAL, EQ, PLUS, SUB, MULT, DIV, LPAREN, RPAREN};
so, with this the simplified statement = x 3, would actually be the following
structures:
{EQ, null} {ID, "x"} {VAL, "3"}
OK, so the interpreter in pseudo-code would look like (assuming that the above is presented to the interpreter as a list).
while list not empty
token <-- head(list) /* this returns the first token as well as removing it from the list */
switch (token.tokType)
{
....
case EQ: /* handling assignment */
token <--- head(list)
name = token.name
token <--- head(list)
val = atoi(token.tokVal)
addValue(env*, name, val);
break;
case ID:
name = token.name
val = lookupValue(env*, name)
....
}
Please, be advise that the actual format of the above code will in all probability need to be modified to deal with other constructs, it is just a notional example!
Now it's your turn -- take a stab and show us what you've come up with.
Later
T.
You can not declare new variables at runtime in C as such.
For what you want you could make a list of structs for each type you want to support. Each struct then contains the name of the variable and its value. Declaring a variable will add a new struct to the list.
I'm assuming you were thinking about an interaction with you calculator that might look something like this:
> myCalc
mc> x=5
mc> 5
mc> 3*x
mc> 15
mc> quit
>
Where myCalc is the name of your program, and the mc>prompt shows interaction with the calculator, where the use enters a statement and the calculator displays the result of the statement.
Now, consider the first statement x=5, we need to parse it and determine if it is a valid according to the grammar you are using. Assuming it is, you then need to evaluate the statement, which for sake of discussion has
the abstract syntax tree (AST) ASMT(x,VAL(5)). ASMT and VAL are notional operators.
Now, I would take this as adding a new binding for x into the current environment. Exactly how this environment looks depends on what you are willing to allow, so for now lets assume you are just allowing variable assignment. A simple associative array would work here, where the key is the variable name and the data would be the value.
Now consider the next statement 3*x, after parsing we can assume the AST for the expression is TIMES(3, ID(x)). Now on evaluation of this, or interpreter would first need to handle the ID(x) part which would be looking up the value of x in the environment, which is 5.
After the above, the AST would look like TIMES(3,5) which would be directly evaluated as 15.
N.B. I am being fairly loose with what how AST would be represented and how it would be evaluated by the interpreter. I'm trying to give a flavour of what to do, not full low-level implementation details.
hope this helps (a bit),
T

How do I tokenize this input line with variable types such that I can decide the size(s) of the struct holding my information?

This is a multiple part question. Firstly my input file will look like this:
category Shoes brand:char[50],cost:int
category Shirts brand:char[20],cost:int
My questions are:
a.) How do I break up the line at : only after the category name? Shoes or Shirts in these cases.
b.) How would I write my Bison parser such that I would be determining the variables (eg. char[30]) of the struct that would hold the information for each line?
If these questions seem too localized, I'd appreciate it if I could be guided to some resources that could help me do the same
There are far too many missing details. For instance, can "int" be used as a category name? How do you plan to store the data you parse?
Still, an initial sketch would be something like this for the parser:
%token CATEGORY "category"
EQ "="
COLON ":"
COMMA ","
LBRA "["
RBRA "]"
INT "int"
CHAR "char"
ID
NATURAL
;
%%
categories:
category
| categories category
;
category:
"category" ID fields
;
fields:
field
| fields "," field
;
field:
ID ":" type
;
type:
"char"
| "int"
| type "[" NATURAL "]"
;
and this for the scanner:
%%
"category" return CATEGORY;
"=" return EQ;
":" return COLON;
"," return COMMA;
"[" return LBRA;
"]" return RBRA;
"int" return INT;
"char" return CHAR;
[a-zA-Z]+ return ID;
[0-9]+ return NATURAL;
[ \n\t]+ continue;
(b.) I'm not sure I understand your question correctly. I assume you want to parse each entry in the input file and store the info in a struct. Category and Brand would be strings and cost would be an int. You want to know beforehand, the length of the strings, so you can assign a variable to hold the parsed value (such as char[20]).
Why not use C++ strings? Then you don't have to declare the length . Example here.
If you must use C, you can just have a char * and allocate the string on heap using malloc.
This page has a reasonable, simple example that can be a guide to what you're trying to do. A complete general procedure for writing a flex/bison parser is:
Decide what the tokens are, write a regex and choose an identifier for each (as #akim has done in his flex/bison code).
Write a LALR(1) Context Free Grammar for the "legal" input you're trying to parse. #akim has given you a leg up on this one.
Design a target data structure to hold the parser result. This is the key thing missing in your post. If you are just trying to compute a single integer size, then you're done. If you need to pass more detail on for further processing, you'll want some kind of record/enumeration/list structure, which is normally called an abstract syntax tree or AST even though it's often not really a tree.
Implement the AST (if necessary) data structure including constructor functions like CATEGORY_NODE make_category(enum category_e cat);. These constructors will be called by the parser.
Implement the tokens in a flex scanner program file with no action code.
Implement the CFG in the bison parser program file with no action code.
Build a test frame and a test suite and test that the scanner and parser read legal inputs and reject bad ones. So far the program does nothing else. This is where #akim's code is now.
Decide what data must be associated with each token, if any. For example, an unsigned number's value would be its magnitude returned from the scanner as an unsigned int. An identifier's value will be a string. But a bracket [ has no value at all. Use these types to create the bison %union directive to accept values from the scanner. Add these as <union_field> tags to the respective %token directives. See the example for the #union used there, which has two token value types.
Fix the flex scanner by adding an #include "foo.tab.h" at the top and adding action code to return the correct token value for the respective tokens. At this point everything should compile and run again, but still do nothing. Again refer to the example.
Now implement data handing in the parser. You'll be using the dollar $n and $$ directives and adding <union_field> type tags to grammar nonterminals to move data among the grammar rules and finally calls to the constructors for your AST structure to build the structure as the input is read (or to increment the integer if that's all it is). This is where experience and deeper knowledge of LALR parsers is helpful. The later examples in the Bison documentation are another reference for this part. If you hit roadblocks, come back with specific questions.
Run the tests again and print the resulting AST in a form you can verify. XML works well for this.
Now I'm sure someone will observe that this fully general approach is overkill for what you've asked so far. It depends .Your current input is so simple that you could probably hand-write a little ad hoc parser more quickly, not using flex or bison at all.
However if the program you're writing is likely to be in use for a while and to change over time, having all the machinery of a general parser in place at the beginning can make life much easier. Thinking about program input as a real language rather than just raw data can lead you to create functionality that otherwise would never have occurred to you.

Postgres C Function - Passing & Returning Numerics

I am just beginning to test with Postgres External C Functions. When I pass in a Numeric and Return it the function works fine. (Example)
Sample Function
PG_FUNCTION_INFO_V1(numericTesting);
Datum
numericTesting(PG_FUNCTION_ARGS)
{
Numeric p = PG_GETARG_NUMERIC(0);
PG_RETURN_NUMERIC(p);
}
However, when I try to do any math functions on the variable passed in, it will not compile. I get
error: invalid operands to binary *
Sample Function
PG_FUNCTION_INFO_V1(numericTesting);
Datum
numericTesting(PG_FUNCTION_ARGS)
{
Numeric p = PG_GETARG_NUMERIC(0);
PG_RETURN_NUMERIC(p * .5);
}
What is causing this? I'm guessing the the Numeric datatype needs some function to allow math. I tried using: PG_RETURN_NUMERIC(DatumGetNumeric(p * .5)) but that had the same result.
Numeric isn't a primitive type so you can't do arithmetic operations on it directly. C doesn't have operator overloading, so there's no way to add a multiply operator for Numeric. You'll have to use appropriate function calls to multiply numerics.
As with most things when writing Pg extension functions it can be helpful to read the source and see how it's done elsewhere.
In this case look at src/backend/utils/adt/numeric.c. Examine Datum numeric_mul(PG_FUNCTION_ARGS) where you'll see it use mul_var(...) to do the work.
Unfortunately mul_var is static so it can't be used outside numeric.c. Irritating and surprising. There must be a reasonable way to handle NUMERIC from C extension functions without using the spi/fmgr to do the work via SQL operator calls, as you've shown in your comment where you use DirectFunctionCall2 to invoke the numeric_mul operator.
It looks like the public stuff for Numeric that's callable directly from C is in src/include/utils/numeric.h so let's look there. Whoops, not much, just some macros for converting between Numeric and Datum and some helper GETARG and RETURN macros. Looks like usage via the SQL calls might be the only way.
If you do find yourself stuck using DirectFunctionCall2 via the SQL interfaces for Numeric, you can create a Numeric argument for the other side from a C integer using int4_numeric.
If you can't find a solution, post on the pgsql-general mailing list, you'll get more people experienced with C extensions and the source code there. Link back to this post if you do so.
A way to sidestep this problem altogether is to use data type coercion. Declare your SQL function with the type you want to coerce a value to, e.g.
CREATE FUNCTION foo(float8) RETURNS float8 AS 'SELECT $i' LANGUAGE SQL;
Any value provided to that function will be coerced to that value:
SELECT foo(12);
Even explicitly specifying the type will work:
SELECT foo(12::numeric);
Your C code will receive a double. (Credit goes to Tom Lane, see this mailing list post.)
Both operands of * must have arithmetic type (integer or floating-point types). It's a pretty good bet that Numeric is a typedef for something that isn't a simple integer or floating-point type.
Unfortunately, I don't know enough about the Postgres API to be much more help. Hopefully there's a macro or a function that can either convert a Numeric to an arithmetic type, or apply an arithmetic operation to a Numeric type.
#include "utils/numeric.h"
// ....
Numeric p = PG_GETARG_NUMERIC(0);
Numeric b = int64_div_fast_to_numeric(5, 1); // 0.5
bool failed = false;
Numeric r = numeric_mul_opt_error(p, b, &failed);
if (failed) {
// handle failure here
}
PG_RETURN_NUMERIC(r);

Resources