How can I quickly do a string matching on many regex keys? - c

I have an array of elements whose key is a regex.
I would like to come up with a fast algorithm that given a string (not a regex) will return in less than O(N) time what are the matching array values based on execution of the key regex.
Currently I do a linear scan of the array, for each element I execute the respective regex with posix regexec API, but this means that to find the matching elements I have to search across the whole array.
I understand if the aray was composed by only simple strings as key, I could have kept it orderer and use a bsearch style API, but with regex looks like is not so easy.
Am I missing something here?
Example follows
// this is mainly to be considered
// as pseudocode
typedef struct {
regex_t r;
... some other data
} value;
const char *key = "some/key";
value my_array[1024];
bool my_matches[1024];
for(int i =0; i < 1024; ++i) {
if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
my_matches[i] = 1;
else
my_matches[i] = 0;
}
But the above, as you can see, is linear.
Thanks
Addendum:
I've put together a simple executable which executed above algorithm and something proposed in below answer, where form a large regex it build a binary tree of sub-regex and navigates it to find all the matches.
Source code is here (GPLv3): http://qpsnr.youlink.org/data/regex_search.cpp
Compile with: g++ -O3 -o regex_search ./regex_search.cpp -lrt
And run with: ./regex_search "a/b" (or use --help flag for options)
Interestingly (and I would say as expected) when searching in the tree, it takes less number of regex to execute, but these are far more complex to run for each comparison, so eventually the time it takes balances out with the linear scan of vectors. Results are printed on std::cerr so you can see that those are the same.
When running with long strings and/or many token, watch out for memory usage; be ready to hit Ctrl-C to stop it from preventing your system to crash.

This is possible but I think you would need to write your own regex library to achieve it.
Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.
Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.
I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M1...Mn} which share a single alphabet Σ. So we have:
Mi = (Qi, Σ, δi, qi0, Fi) where Qi = {qij} for 0 ≤ j < |Qi|, and Qi &subset; Fi.
We construct the union-DFA M&Union; = (Q&Union;, Σ, δ&Union;, q&Union;0) (yes, no F; I'll get to that) as follows:
q&Union;0 = <q10,...,qn0>
δ&Union;(<q1j1,...,qnjn>, α) = <δ1(q1j1, α),... , δn(qnjn, α)> for each α &in; Σ
Q&Union; consists of all states reachable through δ&Union; starting from q&Union;0.
We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δi transition functions.
Now to do a union match on a string α1...αm, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q1j1,...,qnjn>. From that state, we can extract the set of Mi which would have matched the string as: {Mi | qiji &in; Fi}.
In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.
I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.
Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.
So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.
Hope that all helps.

Related

what does worst case big omega(n) means?

If Big-Omega is the lower bound then what does it mean to have a worst case time complexity of Big-Omega(n).
From the book "data structures and algorithms with python" by Michael T. Goodrich:
consider a dynamic array that doubles it size when the element reaches its capacity.
this is from the book:
"we fully explored the append method. In the worst case, it requires
Ω(n) time because the underlying array is resized, but it uses O(1)time in the amortized sense"
The parameterized version, pop(k), removes the element that is at index k < n
of a list, shifting all subsequent elements leftward to fill the gap that results from
the removal. The efficiency of this operation is O(n−k), as the amount of shifting
depends upon the choice of index k. Note well that this
implies that pop(0) is the most expensive call, using Ω(n) time.
how is "Ω(n)" describes the most expensive time?
The number inside the parenthesis is the number of operations you must do to actually carry out the operation, always expressed as a function of the number of items you are dealing with. You never worry about just how hard those operations are, only the total number of them.
If the array is full and has to be resized you need to copy all the elements into the new array. One operation per item in the array, thus an O(n) runtime. However, most of the time you just do one operation for an O(1) runtime.
Common values are:
O(1): One operation only, such as adding it to the list when the list isn't full.
O(log n): This typically occurs when you have a binary search or the like to find your target. Note that the base of the log isn't specified as the difference is just a constant and you always ignore constants.
O(n): One operation per item in your dataset. For example, unsorted search.
O(n log n): Commonly seen in good sort routines where you have to process every item but can divide and conquer as you go.
O(n^2): Usually encountered when you must consider every interaction of two items in your dataset and have no way to organize it. For example a routine I wrote long ago to find near-duplicate pictures. (Exact duplicates would be handled by making a dictionary of hashes and testing whether the hash existed and thus be O(n)--the two passes is a constant and discarded, you wouldn't say O(2n).)
O(n^3): By the time you're getting this high you consider it very carefully. Now you're looking at three-way interactions of items in your dataset.
Higher orders can exist but you need to consider carefully what's it's going to do. I have shipped production code that was O(n^8) but with very heavy pruning of paths and even then it took 12 hours to run. Had the nature of the data not been conductive to such pruning I wouldn't have written it at all--the code would still be running.
You will occasionally encounter even nastier stuff which needs careful consideration of whether it's going to be tolerable or not. For large datasets they're impossible:
O(2^n): Real world example: Attempting to prune paths so as to retain a minimum spanning tree--I computed all possible trees and kept the cheapest. Several experiments showed n never going above 10, I thought I was ok--until a different seed produced n = 22. I rewrote the routine for not-always-perfect answer that was O(n^2) instead.
O(n!): I don't know any examples. It blows up horribly fast.

When to use which base of log for tf-idf?

I'm working on a simple search engine where I use the TF-IDF formula to score how important a search word is. I see people using different bases for the formula, but I see no explanation for when to use which. Does it matter at all, and do you have any recommendations?
My current implementation uses the regular log() function of the math.h library
TF-IDF literature usually uses base 2, although a common implementation sklearn uses natural logarithms for example. Just take in count that the lower the base, the bigger the score, which can affect truncation of search resultset by score.
Note that from a mathematical point of view, the base can always be changed later. It's easy to convert from one base to another, because the following equality holds:
log_a(x)/log_a(y) = log_b(x)/log_b(y)
You can always convert from one base to another. It's actually very easy. Just use this formula:
log_b(x) = log_a(x)/log_a(b)
Often bases like 2 and 10 are preferred among engineers. 2 is good for halftimes, and 10 is our number system. Math people prefer the natural logarithm, because it makes calculus a lot easier. The derivative of the function b^x where b is a constant is k*b^x. Bur if b is equal to e (the natural logarithm) then k is 1.
So let's say that you want to send in the 2-logarithm of 5.63 using log(). Just use log(5.63)/log(2).
If you have the need for it, just use this function for arbitrary base:
double logb(double x, double b) {
return log(x)/log(b);
}

Matlab: Return input value with the highest output from a custom function

I have a vector of numbers like this:
myVec= [ 1 2 3 4 5 6 7 8 ...]
and I have a custom function which takes the input of one number, performs an algorithm and returns another number.
cust(1)= 55, cust(2)= 497, cust(3)= 14, etc.
I want to be able to return the number in the first vector which yielded the highest outcome.
My current thought is to generate a second vector, outcomeVec, which contains the output from the custom function, and then find the index of that vector that has max(outcomeVec), then match that index to myVec. I am wondering, is there a more efficient way of doing this?
What you described is a good way to do it.
outcomeVec = myfunc(myVec);
[~,ndx] = max(outcomeVec);
myVec(ndx) % input that produces max output
Another option is to do it with a loop. This saves a little memory, but may be slower.
maxOutputValue = -Inf;
maxOutputNdx = NaN;
for ndx = 1:length(myVec)
output = myfunc(myVec(ndx));
if output > maxOutputValue
maxOutputValue = output;
maxOutputNdx = ndx;
end
end
myVec(maxOutputNdx) % input that produces max output
Those are pretty much your only options.
You could make it fancy by writing a general purpose function that takes in a function handle and an input array. That method would implement one of the techniques above and return the input value that produces the largest output.
Depending on the size of the range of discrete numbers you are searching over, you may find a solution with a golden section algorithm works more efficiently. I tried for instance to minimize the following:
bf = -21;
f =#(x) round(x-bf).^2;
within the range [-100 100] with a routine based on a script from the Mathworks file exchange. This specific file exchange script does not appear to implement the golden section correctly as it makes two function calls per iteration. After fixing this the number of calls required is reduced to 12, which certainly beats evaluating the function 200 times prior to a "dumb" call to min. The gains can quickly become dramatic. For instance, if the search region is [-100000 100000], golden finds the minimum in 25 function calls as opposed to 200000 - the dependence of the number of calls in golden section on the range is logarithmic, not linear.
So if the range is sufficiently large, other methods can definitely beat min by requiring less function calls. Minimization search routines sometimes incorporate such a search in early steps. However you will have a problem with convergence (termination) criteria, which you will have to modify so that the routine knows when to stop. The best option is probably to narrow the search region for application of min by starting out with a few iterations of golden section.
An important caveat is that golden section is guaranteed to work only with unimodal regions, that is, displaying a single minimum. In a region containing multiple minima it's likely to get stuck in one and may miss the global minimum. In that sense min is a sure bet.
Note also that the function in the example here rounds input x, whereas your function takes an integer input. This means you would have to place a wrapper around your function which rounds the input passed by the calling golden routine.
Others appear to have used genetic algorithms to perform such a search, although I did not research this.

Steps to creating an NFA from a regular expression

I'm having issues 'describing each step' when creating an NFA from a regular expression. The question is as follows:
Convert the following regular expression to a non-deterministic finite-state automaton (NFA), clearly describing the steps of the algorithm that you use:
(b|a)*b(a|b)
I've made a simple 3-state machine but it's very much from intuition.
This is a question from a past exam written by my lecturer, who also wrote the following explanation of Thompson's algorithm: http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html
Can anyone clear up how to 'describe each step clearly'? It just seems like a set of basic rules rather than an algorithm with steps to follow.
Maybe there's an algorithm I've glossed over somewhere but so far I've just created them with an educated guess.
Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).
Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA
Building the top level
Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.
Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a).
Intermediate result:
Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
There's no alternation outside of parentheses, so we skip it.
Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.
Building P
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for b|a. Intermediate result:
Integrating P
Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
Building Q
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)
Integrating Q
We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
Tada! Er, I mean, QED.
Want to know more?
All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.
The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.
In the GitHub repository below, you can find a Java implementation of Thompson's construction where first an NFA is being created from the regex and then an input string is being matched against that NFA:
https://github.com/meghdadFar/regex
https://github.com/White-White/RegSwift
No more tedious words. Check out this repo, it translates your regular expression to an NFA and visually shows you the state transitions of an NFA.

number combination algorithm

Write a function that given a string of digits and a target value, prints where to put +'s and *'s between the digits so they combine exactly to the target value. Note there may be more than one answer, it doesn't matter which one you print.
Examples:
"1231231234",11353 -> "12*3+1+23*123*4"
"3456237490",1185 -> "3*4*56+2+3*7+490"
"3456237490",9191 -> "no solution"
If you have an N digit value, there are N-1 possible slots for the + or * operators. So brute force, there are 3^(N-1) possibilities. Testing all of these are inefficient.
BUT
Your examples are all 10 digits. 3^9 = 19683, so brute force is FINE! No need to get any fancier.
So all you need to do is iterate through all 19683 cases, each time building a string for that case, and evaluating the expression. Evaluating the expression is a straightforward task. Iterating is straightforward (just use an incrementing value, you can read the state of the first slot by (i%3), which gives you "no operator" "+" or "*", the state of the second slot is (i/3)%3, the state of the third slot is (i/9)%3 and so on.)
Even with crude parsing code, CPUs are fast.
The brute force option starts becoming ugly after about 20 digits, and you'd have to switch to be more clever.
If this is for the gaming programmer position, do not use the brute force approach. I did that but failed this a couple of years ago. Later heard from someone inside that dynamic programming approach is the one that gets the job.
This can be solved either by backtracking or by dynamic programming.
The "cleverer" approach (using dynamic programming) is basically this:
For each substring of the original string, figure out all possible values it can create. (e.g. in your first example "12" can become either 1+2=3 or 1*2=2)
There may be a lot of different combinations, but many of them will be duplicates. (Also, you should ignore all combinations that are greater than the target).
Thus, when you add a "+" or a "*", you can envision it as combining two substrings of the string. (and since you have the possible values for each substring, you can see if such a combination is possible)
These values can be generated similarly: try splitting the substring in all possible ways, and combine the different values in each half of the substring.
The total number of "states", then, is something like |S|^2 * target - for your example case, it's worse than the brute-force method. But if you had a string of length 1000 and a target of say 5000, then the problem would be solvable with dynamic programming.
Google Code Jam had an extended version of this problem last year (in Round 1C), called Ugly Numbers. You can visit that link and click "Contest Analysis" for some approaches to that problem, when extended to large numbers of digits.

Resources