parsing of mathematical expressions - c

(in c90) (linux)
input:
sqrt(2 - sin(3*A/B)^2.5) + 0.5*(C*~(D) + 3.11 +B)
a
b /*there are values for a,b,c,d */
c
d
input:
cos(2 - asin(3*A/B)^2.5) +cos(0.5*(C*~(D)) + 3.11 +B)
a
b /*there are values for a,b,c,d */
c
d
input:
sqrt(2 - sin(3*A/B)^2.5)/(0.5*(C*~(D)) + sin(3.11) +ln(B))
/*max lenght of formula is 250 characters*/
a
b /*there are values for a,b,c,d */
c /*each variable with set of floating numbers*/
d
As you can see infix formula in the input depends on user.
My program will take a formula and n-tuples value.
Then it calculate the results for each value of a,b,c and d.
If you wonder I am saying ;outcome of program is graph.
/sometimes,I think i will take input and store in string.
then another idea is arise " I should store formula in the struct"
but ı don't know how I can construct
the code on the base of structure./
really, I don't know way how to store the formula in program code so that
I can do my job.
can you show me?
/* a,b,c,d is letters
cos,sin,sqrt,ln is function*/

You need to write a lexical analyzer to tokenize the input (break it into its component parts--operators, punctuators, identifiers, etc.). Inevitably, you'll end up with some sequence of tokens.
After that, there are a number of ways to evaluate the input. One of the easiest ways to do this is to convert the expression to postfix using the shunting yard algorithm (evaluation of a postfix expression is Easy with a capital E).

You should look up "abstract syntax trees" and "expression trees" as well as "lexical analysis", "syntax", "parse", and "compiler theory". Reading text input and getting meaning from it is quite difficult for most things (though we often try to make sure we have simple input).
The first step in generating a parser is to write down the grammar for your input language. In this case your input language is some Mathematical expressions, so you would do something like:
expr => <function_identifier> ( stmt )
( stmt )
<variable_identifier>
<numerical_constant>
stmt => expr <operator> stmt
(I haven't written a grammar like this {look up BNF and EBNF} in a few years so I've probably made some glaring errors that someone else will kindly point out)
This can get a lot more complicated depending on how you handle operator precedence (multiply and device before add and subtract type stuff), but the point of the grammar in this case is to help you to write a parser.
There are tools that will help you do this (yacc, bison, antlr, and others) but you can do it by hand as well. There are many many ways to go about doing this, but they all have one thing in common -- a stack. Processing a language such as this requires something called a push down automaton, which is just a fancy way of saying something that can make decisions based on new input, a current state, and the top item of the stack. The decisions that it can make include pushing, popping, changing state, and combining (turning 2+3 into 5 is a form of combining). Combining is usually referred to as a production because it produces a result.
Of the various common types of parsers you will almost certainly start out with a recursive decent parser. They are usually written directly in a general purpose programming language, such as C. This type of parser is made up of several (often many) functions that call each other, and they end up using the system stack as the push down automaton stack.
Another thing you will need to do is to write down the different types of words and operators that make up your language. These words and operators are called lexemes and represent the tokens of your language. I represented these tokens in the grammar <like_this>, except for the parenthesis which represented themselves.
You will most likely want to describe your lexemes with a set of regular expressions. You should be familiar with these if you use grep, sed, awk, or perl. They are a way of describing what is known as a regular language which can be processed by something known as a Finite State Automaton. That is just a fancy way of saying that it is a program that can make a decision about changing state by considering only its current state and the next input (the next character of input). For example part of your lexical description might be:
[A-Z] variable-identifier
sqrt function-identifier
log function-identifier
[0-9]+ unsigned-literal
+ operator
- operator
There are also tools which can generate code for this. lex which is one of these is highly integrated with the parser generating program yacc, but since you are trying to learn you can also write your own tokenizer/lexical analysis code in C.
After you have done all of this (it will probably take you quite a while) you will need to have your parser build a tree to represent the expressions and grammar of the input. In the simple case of expression evaluation (like writing a simple command line calculator program) you could have your parser evaluate the formula as it processed the input, but for your case, as I understand it, you will need to make a tree (or Reverse Polish representation, but trees are easier in my opinion).
Then after you have read the values for the variables you can traverse the tree and calculate an actual number.

Possibly the easiest thing to do is use an embedded language like Lua or Python, for both of which the interpreter is written in C. Unfortunately, if you go the Lua route you'll have to convert the binary operations to function calls, in which case it's likely easier to use Python. So I'll go down that path.
If you just want to output the result to the console this is really easy and you won't even have to delve too deep in Python embedding. Since, then you only have to write a single line program in Python to output the value.
Here is the Python code you could use:
exec "import math;A=<vala>;B=<valb>;C=<valc>;D=<vald>;print <formula>".replace("^", "**").replace("log","math.log").replace("ln", "math.log").replace("sin","math.sin").replace("sqrt", "math.sqrt").replace("cos","math.cos")
Note the replaces are done in Python, since I'm quite sure it's easier to do this in Python and not C. Also note, that if you want to use xor('^') you'll have to remove .replace("^","**") and use ** for powering.
I don't know enough C to be able to tell you how to generate this string in C, but after you have, you can use the following program to run it:
#include <Python.h>
int main(int argc, char* argv[])
{
char* progstr = "...";
Py_Initialize();
PyRun_SimpleString(progstr);
Py_Finalize();
return 0;
}
You can look up more information about embedding Python in C here: Python Extension and Embedding Documentation
If you need to use the result of the calculation in your program there are ways to read this value from Python, but you'll have to read up on them yourself.

Also, you should review your posts to SO and other posts regarding Binary Trees. Implement this using a tree structure. Traverse as infix to evaluate. There have been some excellent answers to tree questions.
If you need to store this (for persistance as in a file), I suggest XML. Parsing XML should make you really appreciate how easy your assignment is.

Check out this post:
http://blog.barvinograd.com/2011/03/online-function-grapher-formula-parser-part-2/
It uses ANTLR library for parsing math expression, this one specifically uses JavaScript output but ANTLR has many outputs such as Java, Ruby, C++, C# and you should be able to use the grammar in the post for any output language.

Related

Best way to identify system library commands in Lexer/Bison

I'm writing an intepreter for a new programming language. The language's syntax is very simple and the "system library" commands are treated as simple identifiers (even if is no special construct, but a function like everything else - only pre-defined internally). And no, this is not yet-another-one of the 1 million Lisp's out there.
The question is:
Should I have the Lexer catch them, or should I do it in the AST-construction code?
What I've done so far:
I tried recognizing all of them in my Lexer script, and they are a lot already - over 200. I send the same token back to Bison (SYSTEM_CMD) only with a different value (basically a numeric index pointing to the array of system commands where they are all stored).
As an approach, I think this makes it much faster than having to look up every single one of them in a hash and see if it's a system command.
The thing is the Lexer is getting quite huge (in term of resulting binary filesize I mean) rather fast. And I obviously don't like it.
Given that my focus is something both lightning-fast (I'm already quite good with that) and small enough to be embedded, what would be the most recommended approach?

Good way to create an assembler?

I have an homework to do for my school. The goal is to create a really basic virtual machine as well as a simple assembler. I had no problem creating the virtual machine but I can't think of a 'nice' way to create the assembler.
The grammar of this assembler is really basic: an optional label followed by a colon, then a mnemonic followed by 1, 2 or 3 operands. If there is more than one operand they shall be separated by commas. Also, whitespaces are ignored as long as they don't occur in the middle of a word.
I'm sure I can do this with strtok() and some black magic, but I'd prefer to do it in a 'clean' way. I've heard about Parse Trees/AST, but I don't know how to translate my assembly code into these kinds of structures.
I wrote an assembler like this when I was a teenager. You don't need a complicated parser at all.
All you need to do is five steps for each line:
Tokenize (i.e. split the line into tokens). This will give you an array of tokens and then you don't need to worry about the whitespace, because you will have removed it during tokenization.
Initialize some variables representing parts of the line to NULL.
A sequence of if statements to walk over the token array and check which parts of the line are present. If they are present put the token (or a processed version of it) in the corresponding variable, otherwise leave that variable as NULL (i.e. do nothing).
Report any syntax errors (i.e. combinations of types of tokens that are not allowed).
Code generation - I guess you know how to do this part!
What you're looking for is actually lexical analyses, parsing en finally the generation of the compiled code. There are a lot of frameworks out there which helps creating/generating a parser like Gold Parser or ANTLR. Creating a language definition (and learning how to depending on the framework you use) is most often quite a lot of work.
I think you're best off with implementing the shunting yard algorithm. Which converts your source into a representation computers understand, which makes it easy to understand for your virtual machine.
I also want to say that diving into parsers, abstract syntax trees, all the tools available on the web and reading a lot of papers about this subject is a really good learning experience!
You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is:
http://pasmo.speccy.org/
I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:
A symbol table: just an array of structs, with the name of a symbol and its value.
typedef struct
{
char nombre[256];
u8 valor;
} TSymbol;
TSymbol tablasim[MAXTABLA];
int maxsim = 0;
A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU pseudoinstruction.
Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA symbols.
I perform two passes to the source code:
The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU instruction or not. If there is such, the value next to EQU is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).
Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.
case 1:
if (es_equ (token))
{
token = strtok (NULL, "\n");
tablasim[maxsim].valor = parse_numero (token, &err);
if (err)
{
if (err==1)
fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
else if (err==2)
fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
estado = 2;
}
else
{
maxsim++;
token = NULL;
estado = 0;
}
}
else
{
tablasim[maxsim].valor = pcounter;
maxsim++;
if (es_instruccion (token))
pcounter++;
token = NULL;
estado = 0;
}
break;
The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok() to split a line into its components, and using strncasecmp() to compare what I find with instruction mnemonics
If the operands can be expressions, like "1 << (x + 5)", you will need to write a parser. If not, the parser is so simple that you do not need to think in those terms. For each line get the first string (skipping whitespace). Does the string end with a colon? then it is a label, else it is the menmonic. etc.
For an assembler there's little need to build an explicit parse tree. Some assemblers do have fancy linkers capable of resolving complicated expressions at link-time time but for a basic assembler an ad-hoc lexer and parsers should do fine.
In essence you write a little lexer which consumes the input file character-by-character and classifies everything into simple tokens, e.g. numbers, labels, opcodes and special characters.
I'd suggest writing a BNF grammar even if you're not using a code generator. This specification may then be translated into a recursive-decent parser almost by-wrote. The parser simply walks through the whole code and emits assembled binary code along the way.
A symbol table registering every label and its value is also needed, traditionally implemented as a hash table. Initially when encountering an unknown label (say for a forward branch) you may not yet know the value however. So it is simply filed away for future reference.
The trick is then to spit out dummy values for labels and expressions the first time around but compute the label addresses as the program counter is incremented, then take a second pass through the entire file to fill in the real values.
For a simple assembler, e.g. no linker or macro facilities and a simple instruction set, you can get by with perhaps a thousand or so lines of code. Much of it brainless through-free hand translation from syntax descriptions and opcode tables.
Oh, and I strongly recommend that you check out the dragon book from your local university library as soon as possible.
At least in my experience, normal lexer/parser generators (e.g., flex, bison/byacc) are all but useless for this task.
When I've done it, nearly the entire thing has been heavily table driven -- typically one table of mnemonics, and for each of those a set of indices into a table of instruction formats, specifying which formats are possible for that instruction. Depending on the situation, it can make sense to do that on a per-operand rather than a per-instruction basis (e.g., for mov instructions that have a fairly large set of possible formats for both the source and the destination).
In a typical case, you'll have to look at the format(s) of the operand(s) to determine the instruction format for a particular instruction. For a fairly typical example, a format of #x might indicate an immediate value, x a direct address, and #x an indirect address. Another common form for an indirect address is (x) or [x], but for your first assembler I'd try to stick to a format that specifies instruction format/addressing mode based only on the first character of the operand, if possible.
Parsing labels is simpler, and (mostly) separate. Basically, each label is just a name with an address.
As an aside, if possible I'd probably follow the typical format of a label ending with a colon (":") instead of a semicolon (";"). Much more often, a semicolon will mark the beginning of a comment.

My Simpler Dead-code Remover

I am doing a stimulation of dead-code remover in a very simpler manner.
For that my Idea is to,
Step 1: Read the input C-Program line by line and store it in a doubly linked-list or Array.(Since deletion and insertion will be easier than in file operations).
Doubt:Is my approach correct? If so, How to minimize traversing a Linked-List each time.
Step 2: Analyzing of the read strings will be done in parallel, and tables are created to maintain variables names and their details, functions and their calls,etc.,
Step 3: Searching will be done for each entries in the variable table, and the variables will be replaced by its that time's value(as it has).
(E.g.)
i=0;
if(i==3) will be replaced by if(0==3).
But on situation like..
get(a);
i=a;
if(i){}
here,'i' will not be replaced since it depends on another variable. 'a' will not be replaced since it depends on user input.
Doubt: if user input is,
if(5*5+6){print hello;} ,
it surely will be unnecessary check. How can i solve this expression to simplify the code as
{
print hello;
}
Step 4: Strings will be searched for if(0),while(0) etc., and using stack, the action block is removed. if(0){//this will be removed*/}
Step 5:(E.g) function foo(){/**/} ... if(0) foo(); ..., Once all the dead codes are removed, foo()'s entry in the function table is checked to get no.of.times it gets referred in the code. If it is 0, that function has to be removed using the same stack method.
Step 6: In the remaining functions, the lines below the return statements (if any) are removed except the '}'. This removal is done till the end of the function. The end of the function is identified using stack.
Step 7: And I will assume that my dead-free code is ready now. Store the linked-list or array in an output file.
My Questions are..
1.Whether my idea will be meaningful? or will it be implementable? How
can I improve this algorithm?
2.While i am trying to implement this idea, I have to deal more with string
manipulations rather than removing dead-codes. Is any way to reduce
string manipulations in this algorithm.
Do not do it this way. C is a free-form language, and trying to process it line-by-line will result in supporting a subset of C that is so ridiculously restricted that it doesn't deserve the name.
What you need to do is to write a proper parser. There is copious literature about that out there. Find out which textbook your school uses for its compiler-construction course, and work through that -- or just take the course! Only when you've got the parser down should you even begin to consider semantics. Then do your work on abstract syntax trees instead of strings. Alternatively, find an already written and tested parser for C that you can reuse (but you'll still need to learn quite a bit in order to integrate it with your own processing).
If you end up writing the parser yourself, and it's only for your own edification, consider using a simpler language than C as your subject. Even though C at is core is fairly compact as languages go, getting all details of the declaration syntax right is surprisingly tricky, and will probably detract you from what you're actually interested in. And the presence of the preprocessor is an issue in itself which can make it very difficult to design meaningful source-to-source transformations.
By the way, the transformations you sketch are known in the trade as "constant propagation", or (in a more ambitious variants that will clone functions and loop bodies when they have differing constant inputs) "partial evaluation". Googling those terms may be interesting.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

What is the Pumping Lemma in Layman's terms?

I saw this question, and was curious as to what the pumping lemma was (Wikipedia didn't help much).
I understand that it's basically a theoretical proof that must be true in order for a language to be in a certain class, but beyond that I don't really get it.
Anyone care to try to explain it at a fairly granular level in a way understandable by non mathematicians/comp sci doctorates?
The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n). This is the simple language which is just any number of as, followed by the same number of bs. So the strings
ab
aabb
aaabbb
aaaabbbb
etc. are in the language, but
aab
bab
aaabbbbbb
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four as, the machine loses count of how many as have been input because it stays in the same state. That means that after four, I can add as many as as I want to the string, without adding any bs, and still get the same return value. This means that the strings:
aaaa(a*)bbbb
with (a*) representing any number of as, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*) can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.
Remember that Regular Expressions and Finite State Machines are equivalent, then replace a and b with opening and closing Html tags which can be embedded within each other, and you can see why it is not possible to use regular expressions to parse Html
It's a device intended to prove that a given language cannot be of a certain class.
Let's consider the language of balanced parentheses (meaning symbols '(' and ')', and including all strings that are balanced in the usual meaning, and none that aren't). We can use the pumping lemma to show this isn't regular.
(A language is a set of possible strings. A parser is some sort of mechanism we can use to see if a string is in the language, so it has to be able to tell the difference between a string in the language or a string outside the language. A language is "regular" (or "context-free" or "context-sensitive" or whatever) if there is a regular (or whatever) parser that can recognize it, distinguishing between strings in the language and strings not in the language.)
LFSR Consulting has provided a good description. We can draw a parser for a regular language as a finite collection of boxes and arrows, with the arrows representing characters and the boxes connecting them (acting as "states"). (If it's more complicated than that, it isn't a regular language.) If we can get a string longer than the number of boxes, it means we went through one box more than once. That means we had a loop, and we can go through the loop as many times as we want.
Therefore, for a regular language, if we can create an arbitrarily long string, we can divide it into xyz, where x is the characters we need to get to the start of the loop, y is the actual loop, and z is whatever we need to make the string valid after the loop. The important thing is that the total lengths of x and y are limited. After all, if the length is greater than the number of boxes, we've obviously gone through another box while doing this, and so there's a loop.
So, in our balanced language, we can start by writing any number of left parentheses. In particular, for any given parser, we can write more left parens than there are boxes, and so the parser can't tell how many left parens there are. Therefore, x is some amount of left parens, and this is fixed. y is also some number of left parens, and this can increase indefinitely. We can say that z is some number of right parens.
This means that we might have a string of 43 left parens and 43 right parens recognized by our parser, but the parser can't tell that from a string of 44 left parens and 43 right parens, which isn't in our language, so the parser can't parse our language.
Since any possible regular parser has a fixed number of boxes, we can always write more left parens than that, and by the pumping lemma we can then add more left parens in a way that the parser can't tell. Therefore, the balanced parenthesis language can't be parsed by a regular parser, and therefore isn't a regular expression.
Its a difficult thing to get in layman's terms, but basically regular expressions should have a non-empty substring within it that can be repeated as many times as you wish while the entire new word remains valid for the language.
In practice, pumping lemmas are not sufficient to PROVE a language correct, but rather as a way to do a proof by contradiction and show a language does not fit in the class of languages (Regular or Context-Free) by showing the pumping lemma does not work for it.
Basically, you have a definition of a language (like XML), which is a way to tell whether a given string of characters (a "word") is a member of that language or not.
The pumping lemma establishes a method by which you can pick a "word" from the language, and then apply some changes to it. The theorem states that if the language is regular, these changes should yield a "word" that is still from the same language. If the word you come up with isn't in the language, then the language could not have been regular in the first place.
The simple pumping lemma is the one for regular languages, which are the sets of strings described by finite automata, among other things. The main characteristic of a finite automation is that it only has a finite amount of memory, described by its states.
Now suppose you have a string, which is recognized by a finite automaton, and which is long enough to "exceed" the memory of the automation, i.e. in which states must repeat. Then there is a substring where the state of the automaton at the beginning of the substring is the same as the state at the end of the substring. Since reading the substring doesn't change the state it may be removed or duplicated an arbitrary number of times, without the automaton being the wiser. So these modified strings must also be accepted.
There is also a somewhat more complicated pumping lemma for context-free languages, where you can remove/insert what may intuitively be viewed as matching parentheses at two places in the string.
By definition regular languages are those recognized by a finite state automaton. Think of it as a labyrinth : states are rooms, transitions are one-way corridors between rooms, there's an initial room, and an exit (final) room. As the name 'finite state automaton' says, there is a finite number of rooms. Each time you travel along a corridor, you jot down the letter written on its wall. A word can be recognized if you can find a path from the initial to the final room, going through corridors labelled with its letters, in the correct order.
The pumping lemma says that there is a maximum length (the pumping length) for which you can wander through the labyrinth without ever going back to a room through which you have gone before. The idea is that since there are only so many distinct rooms you can walk in, past a certain point, you have to either exit the labyrinth or cross over your tracks. If you manage to walk a longer path than this pumping length in the labyrinth, then you are taking a detour : you are inserting a(t least one) cycle in your path that could be removed (if you want your crossing of the labyrinth to recognize a smaller word) or repeated (pumped) indefinitely (allowing to recognize a super-long word).
There is a similar lemma for context-free languages. Those languages can be represented as word accepted by pushdown automata, which are finite state automata that can make use of a stack to decide which transitions to perform. Nonetheless, since there is stilla finite number of states, the intuition explained above carries over, even through the formal expression of the property may be slightly more complex.
In laymans terms, I think you have it almost right. It's a proof technique (two actually) for proving that a language is NOT in a certain class.
Fer example, consider a regular language (regexp, automata, etc) with an infinite number of strings in it. At a certain point, as starblue said, you run out of memory because the string is too long for the automaton. This means that there has to be a chunk of the string that the automaton can't tell how many copies of it you have (you're in a loop). So, any number of copies of that substring in the middle of the string, and you still are in the language.
This means that if you have a language that does NOT have this property, ie, there is a sufficiently long string with NO substring that you can repeat any number of times and still be in the language, then the language isn't regular.
For example, take this language L = anbn.
Now try to visualize finite automaton for the above language for some n's.
if n = 1, the string w = ab. Here we can make a finite automaton with out looping
if n = 2, the string w = a2b2. Here we can make a finite automaton with out looping
if n = p, the string w = apbp. Essentially a finite automaton can be assumed with 3 stages.
First stage, it takes a series of inputs and enter second stage. Similarly from stage 2 to stage 3. Let us call these stages as x, y and z.
There are some observations
Definitely x will contain 'a' and z will contain 'b'.
Now we have to be clear about y:
case a: y may contain 'a' only
case b: y may contain 'b' only
case c: y may contain a combination of 'a' and 'b'
So the finite automaton states for stage y should be able to take inputs 'a' and 'b' and also it should not take more a's and b's which cannot be countable.
If stage y is taking only one 'a' and one 'b', then there are two states required
If it is taking two 'a' and one 'b', three states are required with out loops
and so on....
So the design of stage y is purely infinite. We can only make it finite by putting some loops and if we put loops, the finite automaton can accept languages beyond L = anbn. So for this language we can't construct a finite automaton. Hence it is not regular.
This is not an explanation as such but it is simple.
For a^n b^n our FSM should be built in such a way that b must know the number of a's already parsed and will accept the same n number of b's. A FSM can not simply do stuff like that.

Resources