Steps to creating an NFA from a regular expression - theory

I'm having issues 'describing each step' when creating an NFA from a regular expression. The question is as follows:
Convert the following regular expression to a non-deterministic finite-state automaton (NFA), clearly describing the steps of the algorithm that you use:
(b|a)*b(a|b)
I've made a simple 3-state machine but it's very much from intuition.
This is a question from a past exam written by my lecturer, who also wrote the following explanation of Thompson's algorithm: http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html
Can anyone clear up how to 'describe each step clearly'? It just seems like a set of basic rules rather than an algorithm with steps to follow.
Maybe there's an algorithm I've glossed over somewhere but so far I've just created them with an educated guess.

Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).
Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA
Building the top level
Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.
Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a).
Intermediate result:
Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
There's no alternation outside of parentheses, so we skip it.
Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.
Building P
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for b|a. Intermediate result:
Integrating P
Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
Building Q
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)
Integrating Q
We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
Tada! Er, I mean, QED.
Want to know more?
All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.
The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.

In the GitHub repository below, you can find a Java implementation of Thompson's construction where first an NFA is being created from the regex and then an input string is being matched against that NFA:
https://github.com/meghdadFar/regex

https://github.com/White-White/RegSwift
No more tedious words. Check out this repo, it translates your regular expression to an NFA and visually shows you the state transitions of an NFA.

Related

Creating New Matching Logic in Informatica (Ratcliffe - Obershelp)

I am conducting a matching project in Informatica 10.2.1 wherein I need to identify matching strings within product descriptions. Ratcliffe-Obershelp is the matching strategy I need to implement.
I've heard Ratcliffe-Obershelp yields greater results than Jaro - Winkler but I am not sure how to code this into a transformation in Informatica since it is not built in.
No code to show as I don't even know where to start.
I'd expect this to be a transformation/group of transformations that would reproduce the matching score that Ratcliffe-Obershelp creates on a per-line basis.
If I understand correctly, the matching logic performs operations in a loop iterating over the input strings. It is not possible to implement such "loop over string" in Expression Transformation using built-in functions. I see two options:
create DECODE function with multiple conditions for each possible length. - This will be ugly. And can be possible assuming only that we start at the begining of each string - implementing full substring comparison will be... so ugly I can't imagine :)
use Java Transformation - as much as I have putting Java into mappings, there are some cases where it's justified. This look like one of the few. Here's some JS reference

Backjumping, CSP, AIMA book

Context: backjumping is an optimization to vanilla backtracking. It reduces the search tree's branching factor by intelligently jumping back to a node which is the cause of the failure (instead of backtracking to the chronological parent).
Chapter 5 of Artificial Intelligence, a Modern Approach, 3rd ed, p149-150 gives a brief example of how to create the conflict set during backjumping.
The example is about coloring Australia's map.
Quote from the problematic part:
The “terminal” failure of a branch of the search always occurs because
a variable’s domain becomes empty; that variable has a standard
conflict set. In our example, SA fails, and its conflict set is (say)
{WA,NT,Q}. We backjump to Q, and Q absorbs the conflict set from SA
(minus Q itself, of course) into its own direct conflict set, which is
{NT,NSW}; the new conflict set is {WA,NT,NSW}. That is, there
is no solution from Q onward, given the preceding assignment to {WA,NT,NSW}. Therefore, we backtrack to NT, the most recent of these.
NT absorbs {WA,NT,NSW} − {NT} into its own direct conflict set {WA},
giving {WA,NSW} (as stated in the previous paragraph). Now the
algorithm backjumps to NSW, as we would hope.
I'm struggling to understand the emphasized bits:
Backtracking to NT. In what way / why is NT the most recent?
Backjumping to NSW. Why?
The answer to Q1 is in the previous paragraph, the key is the decision order:
Consider again the partial assignment {WA = red, NSW = red} (which, from our earlier discussion, is inconsistent).
Suppose we try T = red next and then assign NT, Q, V , SA.
We know that no assignment can work for these last four variables, so eventually we run out of values to try at NT.
NT is the most recent in the sense that it is the top of the stack at the point where this example is being discussed.
We backjump to NSW as this is the last decision which intersects with the conflict set.

What is the lib/algorithm for matching multiple regular expression against a single target string in C?

I want to match multiple regular expressions against a single string & stop when the first regular expression matches.
I am exploring few solutions from here
http://sljit.sourceforge.net/regex_perf.html
but none of them seem to take into consideration match of multiple regular expression against single string.
Is there any solution to speed this up?
You could just use alternation. That is, if you're looking for the expressions \a+b\ and \[a-z0-9]+xyz\, you could write a single regular expression with grouping: \(a+b)|([a-z0-9]+xyz)\. The regex engine will return the first match it finds.
The Unix fgrep tool does what you're looking for. If you give it a list of expressions to find, it will find all occurrences in a single scan of the file. Dr. Dobb's Journal published an article about it, with C source, sometime back in the late '80s. A quick search reveals that the article was called Parallel pattern matching and fgrep, by Ian Ashdown. I didn't find the source, but I didn't look all that hard. Given a little time, you might have more luck.

How can I quickly do a string matching on many regex keys?

I have an array of elements whose key is a regex.
I would like to come up with a fast algorithm that given a string (not a regex) will return in less than O(N) time what are the matching array values based on execution of the key regex.
Currently I do a linear scan of the array, for each element I execute the respective regex with posix regexec API, but this means that to find the matching elements I have to search across the whole array.
I understand if the aray was composed by only simple strings as key, I could have kept it orderer and use a bsearch style API, but with regex looks like is not so easy.
Am I missing something here?
Example follows
// this is mainly to be considered
// as pseudocode
typedef struct {
regex_t r;
... some other data
} value;
const char *key = "some/key";
value my_array[1024];
bool my_matches[1024];
for(int i =0; i < 1024; ++i) {
if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
my_matches[i] = 1;
else
my_matches[i] = 0;
}
But the above, as you can see, is linear.
Thanks
Addendum:
I've put together a simple executable which executed above algorithm and something proposed in below answer, where form a large regex it build a binary tree of sub-regex and navigates it to find all the matches.
Source code is here (GPLv3): http://qpsnr.youlink.org/data/regex_search.cpp
Compile with: g++ -O3 -o regex_search ./regex_search.cpp -lrt
And run with: ./regex_search "a/b" (or use --help flag for options)
Interestingly (and I would say as expected) when searching in the tree, it takes less number of regex to execute, but these are far more complex to run for each comparison, so eventually the time it takes balances out with the linear scan of vectors. Results are printed on std::cerr so you can see that those are the same.
When running with long strings and/or many token, watch out for memory usage; be ready to hit Ctrl-C to stop it from preventing your system to crash.
This is possible but I think you would need to write your own regex library to achieve it.
Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.
Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.
I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M1...Mn} which share a single alphabet Σ. So we have:
Mi = (Qi, Σ, δi, qi0, Fi) where Qi = {qij} for 0 ≤ j < |Qi|, and Qi &subset; Fi.
We construct the union-DFA M&Union; = (Q&Union;, Σ, δ&Union;, q&Union;0) (yes, no F; I'll get to that) as follows:
q&Union;0 = <q10,...,qn0>
δ&Union;(<q1j1,...,qnjn>, α) = <δ1(q1j1, α),... , δn(qnjn, α)> for each α &in; Σ
Q&Union; consists of all states reachable through δ&Union; starting from q&Union;0.
We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δi transition functions.
Now to do a union match on a string α1...αm, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q1j1,...,qnjn>. From that state, we can extract the set of Mi which would have matched the string as: {Mi | qiji &in; Fi}.
In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.
I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.
Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.
So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.
Hope that all helps.

PROLOG / All paths in a directed graph with loops

I've got the following diagram given:
Diagram here
The first gateway/connector is an OR-gateway/connector (it has a circle in it). The gateway/connector with a 'x' in it is a XOR-gateway/connector.
An OR-gateway specifies that one or more of the available paths will be taken.
An XOR-gateway represents a decision to take exactly one path in the flow.
I need to transform this diagram to PROLOG in order to get all possible paths from node 1 to node 8 but I have problems to code the OR-gateway and to find all possible paths.
How can I transform this diagram easily to Prolog and how can i find all possible paths respecting the gateways between two nodes?
Thank you for answers in advance.
As you should know, a Prolog program is basically a set of rules. From your graph, each node could begin a rule where each directed edge gives an explicit rule. By encoding your graph as a set of rules, a query on what satisfies say, (1, X, 8), would give you every possible path, even infinitely.
Encoding the rules should be easy (basic Prolog). Maybe I'm not understanding the special functions behind the OR and XOR. Please explain more if this isn't as trivial as it seems.

Resources