Chomsky Hierarchy: LR(k) grammars vs Deterministic CFGs? - theory

We are learning the chomsky hiearchy in my introduction to computer science course. My professor has mention lrk grammars multiple times, but they're not taught in the book. From my understanding, they are a subset of deterministic context free grammars that generate unambiguous languages. But how are they different from deterministic CFGs?
Here is the Chomsky hierarchy we've gone over in class with the devices that recognize the associated grammar:
recursively enumerable - all turing machines
recursive - deciders/TMs that halt on every input
context sensitive - Linear-bounded non-deterministic Turing machine
context free - nondeterministic PDA
deterministic context free - deterministic PDA
LRK grammar - deterministic PDA
regular - DFAs/NFAs
On a separate note (please kindly let me know in comments if this question should be separate post) - how are linear-bounded non-deterministic Turing machine different from deciders?

What's tricky here is that there are two parallel hierarchies that are related but not exactly the same. There are the LR(k) grammars, which are classes of grammars with certain properties. We know that
LR(0) ⊊ LR(1) ⊊ LR(2) ⊊ ...
That is, as you increase k, larger and larger classes of grammars are included in the class LR(k).
Independently, there are the LR(k) languages, which are languages for which an LR(k) grammar exists for some choice of k. There's a cool theorem from Don Knuth that shows that a language has an LR(k) grammar for some k if and only if it has an LR(1) grammar. So in that sense, the LR(k) languages are "languages for which you can make an LR(1) grammar."
Then there's the deterministic context-free languages (DCFLs), which are languages for which you can build a deterministic PDA. It's known that the DCFLs are precisely the same as the LR(k) languages - that is, a language is a deterministic CFL if and only if there's an LR(1) grammar for it.
So what does this mean for the hierarchy of languages? It looks something like this, from least powerful / most restrictive to most powerful / least restrictive:
Regular languages
Described by right-linear grammars, left-linear grammars, DFAs, NFAs, regular expressions, and prefix grammars
Deterministic CFLs
Described by deterministic PDAs and LR(k) grammars
(Nondeterministic) CFLs
Described by (non)deterministic PDAs and CFGs.
Context-sensitive languages
Described by linear-bounded automata and context-sensitive grammars
Recursive languages
Languages accepted by some decider, languages where both the language and the complement are recursively enumerable
Recursively enumerable languages
Languages of unrestricted grammars; languages of Turing machines; languages that can be verified by Turing machines; languages of enumerators; etc.

Related

Does C have 32 or 44 keywords?

ISO/IEC 9899:2017 C N2176 document link: https://files.lhmouse.com/standards/ISO%20C%20N2176.pdf
There are plenty of sources on world wide web, which say that there are 32 keywords in C langauge, But this document (I think it's a draft version, but there's no much changes as compared to the previous version, right?) has 44 words that are defined to the keywords of C language.
Please explain this.
============================================================
Link to the sources which say that there are 32 keywords in C language:
https://www.programiz.com/c-programming/list-all-keywords-c-language
https://tutorials.webencyclop.com/c-language/c-keyword/
https://www.educba.com/c-keywords/
https://www.javatpoint.com/keywords-in-c
https://beginnersbook.com/2014/01/c-keywords-reserved-words/
https://www.phptpoint.com/c-keywords/
https://www.guru99.com/c-tokens-keywords-identifier.html
https://fresh2refresh.com/c-programming/c-tokens-identifiers-keywords/
https://www.w3schools.in/c-tutorial/keywords/
Note: Some of these sites are useful for beginners to learn basic concepts and terminalogies of C.
The claims of there being "32" keywords in C refer to the original ANSI-specified version of C from 1989, aka C89.
Because this is the Internet, and because the real C specifications are behind ISO's ridiculous paywall most people so-inclined probably can't fact-check the claim.
And it's not a claim worth fact-checking: the number of keywords in a language is utter trivia of no consequence.
The ISO/IEC 9899 specification you linked to refers to C17 (the proposed updated C specification in 2017) which postdates C89 by 28 years.
It should come as no surprise that a future updated revision of a programming language introduces new keywords.
Historically, when C was introduced, people were impressed by its minimalist syntax and how the language's design effectively reified everything by implementing functionality as library features instead of language features which is what keeps C simple and helps mitigate every language's designer's fears about feature creep.
In comparison, C's early contemporaries like COBOL, opted to implement functionality into their own languages as first-class features, which is why COBOL has over 300 keywords; so I'll admit that using the stark difference in keyword-count does serve as a proxy for language-complexity and by extension: a way to almost quantify good design. But using it as the basis for comparing languages today in 2021 is of limited-utility as the most relevant programming languages today1 are already either inspired by C or derived from it somehow, and they all share C's decision to do things in the library instead of the language, so all those languages similarly have a low keyword count compared to COBOL, SQL, and others, and so that's why C's keyword count just isn't interesting anymore.
1: C-inspired or C-derived languages in use today: C++, Objective-C, Java, Groovy, Swift, C#, JavaScript, TypeScript, Go, PHP, Perl. Other popular languages that aren't modelled on C (like Haskell, OCaml, etc) do share C's library-first philosophy, but I can't say if C originated it or not - but I feel that languages opting for library-first designs is inevitable: the cost to implement language-features is easily ten-fold that of implementing library features.

Compiler Construction for Formal Requirement Specification Language using JFlex and CUP

I am planning to build compiler for Requirement Specification Language. I have come up with idea using JFlex as lexical analyzer and CUP as parser.
Can any one let me know it is possible to use JFlex and CUP for formal specification language? All the documentation and tutorials are related to programming language only.
Any tutorial available for building formal language compiler.
Lexer and parser generators do not care if your langauge is "conventional computer langauge", only that your langauge has a grammar specification they can handle.
Ofteh the way you get a such a grammar specification, is to take a specification for your formal system as given, and bend it according to the constraints of your chosen parser generator. This bending process is at best inconvenient, at worst really hard, depending of the gap between the parser generator's capabilitie and what your formal langauge specificaiton says.
I suggest you inspect your "Requirement Specification Language" formal grammar, and decide which parser generator you want to use based on that, to minimize the amount of bending you have to do.

What parser-generators with code separation and language extensibility would you recommend?

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.
I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.
Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.
Which are the most well-known/maintained options?
Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.
Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.
What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.
Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).
You seem to be worried about mixing syntax and parser-actions used to build the trees.
If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.
Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.
A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].
One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.
The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:
By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.
This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.
First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...
Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.
Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?
Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.
The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...
Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

Need a way to parse algebraic expressions in C

I need to parse algebraic expressions for an application I'm working on and am hoping to garnish a bit of collective wisdom before taking a crack at it and, possibly, heading down the wrong road.
What I need to do is pretty straight forward: given a textual algebraic expression (3*x - 4(y - sin(pi))) create a object representation of the equation. The custom objects already exist, so I need a parser that creates a tree I can walk to instantiate the objects I need.
The basic requirements would be:
Ability to express the algebra as a grammar so I have control and can customize/extend it as necessary.
The initial syntax will include integers, real numbers, constants, variables, arithmetic operators (+, - , *, /), powers (^), equations (=), parenthesis, precedence, and simple functions (sin(pi)). I'm hoping to extend my app fairly quickly to support functions proper (f(x) = 3x +2).
Must compile in C as it needs to be integrated into my code.
I DON'T need to evaluate the expression mathematically, so software that solves for a variable or performs the arithmetic is noise.
I've done my Google homework and it looks like the best approach is to use a BNF grammar and software to generate a compiler in C. So my questions:
Does a BNF grammar with corresponding parser generator for algebraic expressions (or better yet, LaTex) already exist? Someone has to have done this already. I REALLY want to avoid rolling my own, mainly because I don't want to test it. I'd be willing to pay a reasonable amount for a library (under $50)
If not, which parser generator for C do you think is the easiest to learn/use here? Lex? YACC? Flex, Bison, Python/SymPy, Others? I'm not familiar with any of these.
The standard Linux tools flex and bison would probably be most appropriate here. IIRC the sample parsers and lexers used in these tools do something close to what you want, so you might be able to just modify that code to get what you need.
These tools seem like they meet your specifications. You can customize the grammars, compile down to C, and use any operator you want.
I've had very good luck with ANTLR. It has runtimes for many different languages, including C, and has a very nice syntax for specifying grammars and building trees. I recently wrote a similar grammar (algebraic expressions) in 131 lines, which is definitely manageable.
I used the code (found on the net) from the following:
Program Translation Fundamentals" by Peter Calingaert
I enhanced it to handle functions, which lets you implement things like "if(a, b, c)" (kind of like how Excel does things).
you can build simple parser yourself or use any of popular "compiler-compiler" (some of them were listed by other posts). just decide if your parser will be complicated enough to use (and learn) an external tool. in any case you'll need to define the grammar, usually it's the most brain intensive task if you don't have prior experience. the formal way to define syntactic grammars is BNF or EBNF

Has the use of C to implement other languages constrained their designs in any way?

It seems that most new programming languages that have appeared in the last 20 years have been written in C. This makes complete sense as C can be seen as a sort of portable assembly language. But what I'm curious about is whether this has constrained the design of the languages in any way. What prompted my question was thinking about how the C stack is used directly in Python for calling functions. Obviously the programming language designer can do whatever they want in whatever language they want, but it seems to me that the language you choose to write your new language in puts you in a certain mindset and gives you certain shortcuts that are difficult to ignore. Are there other characteristics of these languages that come from being written in that language (good or bad)?
I tend to disagree.
I don't think it's so much that a language's compiler or interpreter is implemented in C — after all, you can implement a virtual machine with C that is completely unlike its host environment, meaning that you can get away from a C / near-assembly language mindset.
However, it's more difficult to claim that the C language itself didn't have any influence on the design of later languages. Take for example the usage of curly braces { } to group statements into blocks, the notion that whitespace and indentation is mostly unimportant, native type's names (int, char, etc.) and other keywords, or the way how variables are defined (ie. type declaration first, followed by the variable's name, optional initialization). Many of today's popular and wide-spread languages (C++, Java, C#, and I'm sure there are even more) share these concepts with C. (These probably weren't completely new with C, but AFAIK C came up with that particular mix of language syntax.)
Even with a C implementation, you're surprisingly free in terms of implementation. For example, chicken scheme uses C as an intermediate, but still manages to use the stack as a nursery generation in its garbage collector.
That said, there are some cases where there are constraints. Case in point: The GHC haskell compiler has a perl script called the Evil Mangler to alter the GCC-outputted assembly code to implement some important optimizations. They've been moving to internally-generated assembly and LLVM partially for that reason. That said, this hasn't constrained the language design - only the compiler's choice of available optimizations.
No, in short. The reality is, look around at the languages that are written in C. Lua, for example, is about as far from C as you can get without becoming Perl. It has first-class functions, fully automated memory management, etc.
It's unusual for new languages to be affected by their implementation language, unless said language contains serious limitations. While I definitely disapprove of C, it's not a limited language, just very error-prone and slow to program in compared to more modern languages. Oh, except in the CRT. For example, Lua doesn't contain directory functionality, because it's not part of the CRT so they can't portably implement it in standard C. That is one way in which C is limited. But in terms of language features, it's not limited.
If you wanted to construct an argument saying that languages implemented in C have XYZ limitations or characteristics, you would have to show that doing things another way is impossible in C.
The C stack is just the system stack, and this concept predates C by quite a bit. If you study theory of computing you will see that using a stack is very powerful.
Using C to implement languages has probably had very little effect on those languages, though the familiarity with C (and other C like languages) of people who design and implement languages has probably influenced their design a great deal. It is very difficult to not be influenced by things you've seen before even when you aren't actively copying the best bits of another language.
Many languages do use C as the glue between them and other things, though. Part of this is that many OSes provide a C API, so to access that it's easy to use C. Additionally, C is just so common and simple that many other languages have some sort of way to interface with it. If you want to glue two modules together which are written in different languages then using C as the middle man is probably the easiest solution.
Where implementing a language in C has probably influenced other languages the most is probably things like how escapes are done in strings, which probably isn't that limiting.
The only thing that has constrained language design is the imagination and technical skill of the language designers. As you said, C can be thought of as a "portable assembly language". If that is true, then asking if C has constrained a design is akin to asking if assembly has constrained language design. Since all code written in any language is eventually executed as assembly, every language would suffer the same constraints. Therefore, the C language itself imposes no constraints that would be overcome by using a different language.
That being said, there are some things that are easier to do in one language vs another. Many language designers take this into account. If the language is being designed to be, say, powerful at string processing but performance is not a concern, then using a language with better built-in string processing facilities (such as C++) might be more optimal.
Many developers choose C for several reasons. First, C is a very common language. Open source projects in particular like that it is relatively easier to find an experienced C-language developer than it is to find an equivalently-skilled developer in some other languages. Second, C typically lends itself to micro-optimization. When writing a parser for a scripted language, the efficiency of the parser has a big impact on the overall performance of scripts written in that language. For compiled languages, a more efficient compiler can reduce compile times. Many C compilers are very good at generating extremely optimized code (which is also part of the reason why many embedded systems are programmed in C), and performance-critical code can be written in inline assembly. Also, C is standardized and is generally a static target. Code can be written to the ANSI/C89 standard and not have to worry about it being incompatible with a future version of C. The revisions made in the C99 standard add functionality but don't break existing code. Finally, C is extremely portable. If at least one compiler exists for a given platform, it's most likely a C compiler. Using a highly-portable language like C makes it easier to maximize the number of platforms that can use the new language.
The one limitation that comes to mind is extensibility and compiler hosting. Consider the case of C#. The compiler is written in C/C++ and is entirely native code. This makes it very difficult to use in process with a C# application.
This has broad implications for the tooling chain of C#. Any code which wants to take advantage of the real C# parser or binding engine has to have at least one component which is written in native code. This eventually results in most of the tooling chain for the C# language being written in C++ which is a bit backwards for a language.
This doesn't limit the language per say but definitely has an effect on the experience around the language.
Garbage collection. Language implementations on top of Java or .NET use the VM's GC. Those on top of C tend to use reference counting.
One thing I can think of is that functions are not necessarily first class members in the language, and this is can't be blamed on C alone (I am not talking about passing a function pointer, though it can be argued that C provides you with that feature).
If one were to write a DSL in groovy (/scheme/lisp/haskell/lua/javascript/and some more that I am not sure of), functions can become first class members. Making functions first class members and allowing for anonymous functions allows to write concise and more human readable code (like demonstrated by LINQ).
Yes, eventually all of these are running under C (or assembly if you want to get to that level), but in terms of providing the user of the language the ability to express themselves better, these abstractions do a wonderful job.
Implementing a compiler/interpreter in C doesn't have any major limitations. On the other hand, implementing a language X to C compiler does. For example, according to the Wikipedia article on C--, when compiling a higher level language to C you can't do precise garbage collection, efficient exception handling, or tail recursion optimization. This is the kind of problem that C-- was intended to solve.

Resources