Difficulty with my lexical analyzer - c

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:
Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)
My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:
assuming there's a stream of characters, x1...xN
foreach i=1, i<=n, i++
if x1...xI accepts one or more of the 6 group's DFA
{
take the longest-token
add x1...xI to token-linked-list
delete x1...xI from input
}
However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).
Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.
I'm trying to think about a new algorithm, but I keep failing. Any suggestions?

Code your scanner so that it treats identifiers and keywords the same until the reading is finished.
When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.
I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.

Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc.

It seems you are missing the greediness aspect of competing DFAs. greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx. Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched.
Flex, for example, defaults to greedy matching.
In other words, your proposed problem of intx isn't a problem...
If you have 2 rules that compete for int
rule 1 is the token "int"
rule 2 is IDENTIFIER
When we reach
i n t
we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state:
i n t x
If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx, if so, it is the match. If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx
In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). Generally you put your keywords first so IDENTIFIER doesn't match first.

Related

How to fix shift/reduce problem between tab and space

I am trying to write a minipython compiler, and as we know, Python works with spaces to define a block. In my situation, I defined the block as exactly 4 spaces, but when I want to create a block with multiple lines it tells that there is a shift/reduce conflict. I guess I know where the problem is; it doesn't know to treat the second line as space or tab, but I am not sure, so here is my code:
lexical :
" " {column = column + strlen(yytext); return mc_tab;}
" " {column = column + strlen(yytext);}
syntaxic :
S : INSTRUCTION S
| {printf("\nWINNER WINNER CHICKEN DINNER\n");YYACCEPT;}
;
INSTRUCTION : DECL mc_jmp | LOOP | COND | mc_jmp
;
COND : IF ELSE
;
IF : mc_if mc_opnArc COMPARISION mc_clsArc mc_dblpnt mc_jmp BEGIN_BLOCK
;
ELSE : mc_else mc_dblpnt BEGIN_BLOCK |
BEGIN_BLOCK : mc_tab INSTRUCTION BEGIN_BLOCK |
;
Just to know, when I deleted the recursion in BEGIN_BLOCK, the conflict is gone.
EDIT :
and there is another problem, I guess if we solved it, then the first will be solved too.
when I write TAB in any line of code it will be treated exactly as that tab didn't exist, so the code treated the tab exactly as it treats the 4 spacements
As I wrote in comments, shift / reduce conflicts are parser issues. They are being reported by Bison in your case, and they are a function of your grammar (only). If you ask it to do so via either or both of -v or -r all, Bison will produce an output file that shows you (among other things) exactly where such conflicts occur.
The grammar presented in the question is incomplete, but I made it into an input file that Bison would accept by adding section delimiters and adding a %token declaration for each symbol that is not otherwise defined. I also added the semicolon that appears to have been intended after the definition of the ELSE symbol, before the definition of BEGIN_BLOCK. Bison reported four shift / reduce conflicts for the result:
when the token on top of the stack is an IF and the next token is an mc_else, it is ambiguous whether to reduce zero tokens to an ELSE or to shift the mc_else in anticipation of performing a later reduction to an ELSE. This arises in part because the grammar accommodates nested conditionals, so it is a manifestation of the common issue of matching elses with the appropriate ifs.
when the tokens on top of the stack are mc_else mc_dblpnt and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of an INSTRUCTION to follow.
when a token sequence that can be reduced to a BEGIN_BLOCK is required and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of reducing to a BEGIN_BLOCK via the other production for that.
when the tokens on top of the stack are mc_if mc_opnArc COMPARISION mc_clsArc mc_dblpnt mc_jmp and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of an INSTRUCTION to follow.
A common theme emerges: empty rules are biting you hard. Such rules are by no means the only way that a shift / reduce conflict can emerge, but I presume that you will recognize that allowing the parser to create a token out of nothing is something to be handled with considerable care.
The name and usage of BEGIN_BLOCK in particular suggest a design problem, especially in conjunction with the fact that there is no corresponding END_BLOCK. Python's own parsing approach relies on the lexical analyzer to track indentation levels, and to emit synthetic indent and corresponding dedent tokens, as appropriate, when the indentation level changes. Sometimes a sequence of multiple dedents must be emitted to maintain correct indent / dedent pairing. And again, indents and dedents correspond to indentation level changes, not individual characters.
Making the lexer track indents and corresponding dedents allows for grammatic rules along these lines:
if_stmt: IF conditional_expr COLON block optional_else ;
optional_else: /* empty */
| ELSE COLON block
;
block: INDENT stmts DEDENT ;
stmts: stmt
| stmts stmt /* note _left_ recursion */
;
stmt: ...
conditional_expr: ...
Note that the block structure is completely unambiguous -- a block begins with an indent and ends with a matching dedent. Although it may not be immediately obvious, that takes care not only of ambiguities such as arise from your empty BEGIN_BLOCK productions, but also ambiguities such as arise from your rules providing for optional else clauses. The latter are addressed because now the grammar allows at most one if with which any given else can pair.
You could do similar.

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

What will be number of tokens(compiler)?

What will be number of tokens in following ?
int a[2][3];
I think tokens are -> {'int', '[', ']', '[', ']', ';'}
Can someone explain what to consider and what not while compiler calculates tokens ?
Thanks
Expanding on my comment:
How the input is tokenized is a function of your tokenizer (scanner). In principle, the input you presented might be tokenized as "int", "a", "[2]", "[3]", ";", for example. In practice, the most likely choice of tokenization would be "int", "a", "[", "2", "]", "[", "3", "]", ";". I am uncertain why you seem to think that the variable name and dimension values would not be represented among the tokens -- they carry semantic information and therefore must not be left out.
Although separating compiling into a lexical analysis step and a semantic analysis step is common and widely considered useful, it is not inherently essential to make such a separation at all. Where it is made, the choice of tokenization is up to the compiler. One ordinarily chooses tokens so that each represents a semantically significant unit, but there is more than one way to do that. For instance, my alternative example corresponds to a token sequence that might be characterized as
IDENTIFIER, IDENTIFIER, DIMENSION, DIMENSION, TERMINATOR
The more likely approach might be characterized as
IDENTIFIER, IDENTIFIER, OPEN_BRACKET, INTEGER, CLOSE_BRACKET, OPEN_BRACKET,
INTEGER, CLOSE_BRACKET, TERMINATOR
The questions to consider include
What units of the source contain meaningful semantic information in their own right? For instance, it is not useful to make each character a separate token or to split up int into two tokens, because such tokens do not represent a complete semantic unit.
How much responsibility you can or should put on the lexical analyzer (for instance, to understand the context enough to present DIMENSION instead of OPEN_BRACKET, INTEGER, CLOSE_BRACKET)
Updated to add:
The C standard does define the post-preprocessing language in terms of a specific tokenization, which for the statement you gave would be the "most likely" alternative I specified (and that's one reason why it's the most likely). I have answered the question in a more general sense, however, in part because it is tagged [compiler-construction].

Regular expressions in flex has error

I am new in flex and I want to design a scanner using flex.
At this step, I want to make regular expression to match with id, but here are some conditions:
underline can exist in id
you can use _ whenever you want, but if you are using them exactly
consequently it can be at most 2 underlines for example :
a_b_c »»»» true
a___b »»»» false
123abv »»»» false
integers can't be at the beginning of an id
underline can't exist at the end of an id
The regular expression I have written for that is :
(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)
but now I have 2 questions:
Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?
The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized
Can anyone please help me?
The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).
Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.
In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:
Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.
One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.
I think your requirement is:
Identifiers can use only alphanumeric characters and _
Identifiers cannot start with a number
Identifiers cannot end with an _
Identifiers cannot include more than two consecutive _
(When I first read your question, I thought the last requirement was that identifiers cannot include more than two _, but looking at the proposed regex, I think the version above is more accurate.)
Based on the above, you should be able to use the following two Flex patterns:
([[:alpha:]]|__?[[:alnum:]])(_?_?[[:alnum:]])* { /* Handle an identifier */ }
[[:alpha:]_][[:alnum:]_]* { /* Error */ }
Breaking that down:
([[:alpha:]]|__?[[:alnum:]]) matches an alphabetic character or one or two _ followed by an alphanumeric character.
(_?_?[[:alnum:]])* matches a string of and alphanumeric characters, with a maximum of two before an alphanumeric character.
The second pattern will match anything which starts with an alphabetic character or followed by any number of alphanumerics or . This will match all valid identifiers as well as the sequences which contain too many consecutive or which end with . If both patterns match (that is, a valid identifier), the first one will win, so it will be correctly recognized. The second pattern will consume the entire erroneous identifier, allowing for easier error recovery.
The pattern in the OP doesn't work because flex treats \b as a backspace character (as in C). Flex does not implement word boundary assertions, but in a lexer you almost never need these; the pattern above can be used if necessary.

What is the Pumping Lemma in Layman's terms?

I saw this question, and was curious as to what the pumping lemma was (Wikipedia didn't help much).
I understand that it's basically a theoretical proof that must be true in order for a language to be in a certain class, but beyond that I don't really get it.
Anyone care to try to explain it at a fairly granular level in a way understandable by non mathematicians/comp sci doctorates?
The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n). This is the simple language which is just any number of as, followed by the same number of bs. So the strings
ab
aabb
aaabbb
aaaabbbb
etc. are in the language, but
aab
bab
aaabbbbbb
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four as, the machine loses count of how many as have been input because it stays in the same state. That means that after four, I can add as many as as I want to the string, without adding any bs, and still get the same return value. This means that the strings:
aaaa(a*)bbbb
with (a*) representing any number of as, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*) can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.
Remember that Regular Expressions and Finite State Machines are equivalent, then replace a and b with opening and closing Html tags which can be embedded within each other, and you can see why it is not possible to use regular expressions to parse Html
It's a device intended to prove that a given language cannot be of a certain class.
Let's consider the language of balanced parentheses (meaning symbols '(' and ')', and including all strings that are balanced in the usual meaning, and none that aren't). We can use the pumping lemma to show this isn't regular.
(A language is a set of possible strings. A parser is some sort of mechanism we can use to see if a string is in the language, so it has to be able to tell the difference between a string in the language or a string outside the language. A language is "regular" (or "context-free" or "context-sensitive" or whatever) if there is a regular (or whatever) parser that can recognize it, distinguishing between strings in the language and strings not in the language.)
LFSR Consulting has provided a good description. We can draw a parser for a regular language as a finite collection of boxes and arrows, with the arrows representing characters and the boxes connecting them (acting as "states"). (If it's more complicated than that, it isn't a regular language.) If we can get a string longer than the number of boxes, it means we went through one box more than once. That means we had a loop, and we can go through the loop as many times as we want.
Therefore, for a regular language, if we can create an arbitrarily long string, we can divide it into xyz, where x is the characters we need to get to the start of the loop, y is the actual loop, and z is whatever we need to make the string valid after the loop. The important thing is that the total lengths of x and y are limited. After all, if the length is greater than the number of boxes, we've obviously gone through another box while doing this, and so there's a loop.
So, in our balanced language, we can start by writing any number of left parentheses. In particular, for any given parser, we can write more left parens than there are boxes, and so the parser can't tell how many left parens there are. Therefore, x is some amount of left parens, and this is fixed. y is also some number of left parens, and this can increase indefinitely. We can say that z is some number of right parens.
This means that we might have a string of 43 left parens and 43 right parens recognized by our parser, but the parser can't tell that from a string of 44 left parens and 43 right parens, which isn't in our language, so the parser can't parse our language.
Since any possible regular parser has a fixed number of boxes, we can always write more left parens than that, and by the pumping lemma we can then add more left parens in a way that the parser can't tell. Therefore, the balanced parenthesis language can't be parsed by a regular parser, and therefore isn't a regular expression.
Its a difficult thing to get in layman's terms, but basically regular expressions should have a non-empty substring within it that can be repeated as many times as you wish while the entire new word remains valid for the language.
In practice, pumping lemmas are not sufficient to PROVE a language correct, but rather as a way to do a proof by contradiction and show a language does not fit in the class of languages (Regular or Context-Free) by showing the pumping lemma does not work for it.
Basically, you have a definition of a language (like XML), which is a way to tell whether a given string of characters (a "word") is a member of that language or not.
The pumping lemma establishes a method by which you can pick a "word" from the language, and then apply some changes to it. The theorem states that if the language is regular, these changes should yield a "word" that is still from the same language. If the word you come up with isn't in the language, then the language could not have been regular in the first place.
The simple pumping lemma is the one for regular languages, which are the sets of strings described by finite automata, among other things. The main characteristic of a finite automation is that it only has a finite amount of memory, described by its states.
Now suppose you have a string, which is recognized by a finite automaton, and which is long enough to "exceed" the memory of the automation, i.e. in which states must repeat. Then there is a substring where the state of the automaton at the beginning of the substring is the same as the state at the end of the substring. Since reading the substring doesn't change the state it may be removed or duplicated an arbitrary number of times, without the automaton being the wiser. So these modified strings must also be accepted.
There is also a somewhat more complicated pumping lemma for context-free languages, where you can remove/insert what may intuitively be viewed as matching parentheses at two places in the string.
By definition regular languages are those recognized by a finite state automaton. Think of it as a labyrinth : states are rooms, transitions are one-way corridors between rooms, there's an initial room, and an exit (final) room. As the name 'finite state automaton' says, there is a finite number of rooms. Each time you travel along a corridor, you jot down the letter written on its wall. A word can be recognized if you can find a path from the initial to the final room, going through corridors labelled with its letters, in the correct order.
The pumping lemma says that there is a maximum length (the pumping length) for which you can wander through the labyrinth without ever going back to a room through which you have gone before. The idea is that since there are only so many distinct rooms you can walk in, past a certain point, you have to either exit the labyrinth or cross over your tracks. If you manage to walk a longer path than this pumping length in the labyrinth, then you are taking a detour : you are inserting a(t least one) cycle in your path that could be removed (if you want your crossing of the labyrinth to recognize a smaller word) or repeated (pumped) indefinitely (allowing to recognize a super-long word).
There is a similar lemma for context-free languages. Those languages can be represented as word accepted by pushdown automata, which are finite state automata that can make use of a stack to decide which transitions to perform. Nonetheless, since there is stilla finite number of states, the intuition explained above carries over, even through the formal expression of the property may be slightly more complex.
In laymans terms, I think you have it almost right. It's a proof technique (two actually) for proving that a language is NOT in a certain class.
Fer example, consider a regular language (regexp, automata, etc) with an infinite number of strings in it. At a certain point, as starblue said, you run out of memory because the string is too long for the automaton. This means that there has to be a chunk of the string that the automaton can't tell how many copies of it you have (you're in a loop). So, any number of copies of that substring in the middle of the string, and you still are in the language.
This means that if you have a language that does NOT have this property, ie, there is a sufficiently long string with NO substring that you can repeat any number of times and still be in the language, then the language isn't regular.
For example, take this language L = anbn.
Now try to visualize finite automaton for the above language for some n's.
if n = 1, the string w = ab. Here we can make a finite automaton with out looping
if n = 2, the string w = a2b2. Here we can make a finite automaton with out looping
if n = p, the string w = apbp. Essentially a finite automaton can be assumed with 3 stages.
First stage, it takes a series of inputs and enter second stage. Similarly from stage 2 to stage 3. Let us call these stages as x, y and z.
There are some observations
Definitely x will contain 'a' and z will contain 'b'.
Now we have to be clear about y:
case a: y may contain 'a' only
case b: y may contain 'b' only
case c: y may contain a combination of 'a' and 'b'
So the finite automaton states for stage y should be able to take inputs 'a' and 'b' and also it should not take more a's and b's which cannot be countable.
If stage y is taking only one 'a' and one 'b', then there are two states required
If it is taking two 'a' and one 'b', three states are required with out loops
and so on....
So the design of stage y is purely infinite. We can only make it finite by putting some loops and if we put loops, the finite automaton can accept languages beyond L = anbn. So for this language we can't construct a finite automaton. Hence it is not regular.
This is not an explanation as such but it is simple.
For a^n b^n our FSM should be built in such a way that b must know the number of a's already parsed and will accept the same n number of b's. A FSM can not simply do stuff like that.

Resources