I have been trying out some regular expressions lately. Now, I have 3 symbols a, b and c.
I first looked at a case where I don't want 2 consecutive a's. The regex would be something like:
((b|c + a(b|c))*(a + epsilon)
Now I'm wondering if there's a way to generalize this problem to say something like:
A regular expression with no two consecutive a's and no two consecutive b's. I tried stuff like:
(a(b|c) + b(a|c) + c)* (a + b + epsilon)
But this accepts inputs such as"abba" or "baab" which will have 2 consecutive a's (or b's) which is not what I want. Can anyone suggest me a way out?
If you can't do a negative match then perhaps you can use negative lookahead to exclude strings matching aa and bb? Something like the following (see Regex 101 for more information):
(?!.*(aa|bb).*)^.*$
I (think I) solved this by hand-drawing a finite state machine, then, generating a regex using FSM2Regex. The state machine is written below (with the syntax from the site):
#states
s0
s1
s2
s3
#initial
s0
#accepting
s1
s2
s3
#alphabet
a
b
c
#transitions
s0:a>s1
s0:b>s2
s0:c>s3
s1:b>s2
s1:c>s3
s2:a>s1
s2:c>s3
s3:c>s3
s3:a>s1
s3:b>s2
If you look at the transitions, you'll notice it's fairly straightforward- I have states that correspond to a "sink" for each letter of the alphabet, and I only allow transitions out of that state for other letters (not the "sink" letter). For example, s1 is the "sink" for a. From all other states, you can get to s1 with an a. Once you're in s1, though, you can only get out of it with a b or a c, which have their own "sinks" s2 and s3 respectively. Because we can repeat c, s3 has a transition to itself on the character c. Paste the block text into the site, and it'll draw all this out for you, and generate the regex.
The regex it generated for me is:
c+cc*(c+$+b+a)+(b+cc*b)(cc*b)*(c+cc*(c+$+b+a)+$+a)+(a+cc*a+(b+cc*b)(cc*b)*(a+cc*a))(cc*a+(b+cc*b)(cc*b)*(a+cc*a))*(c+cc*(c+$+b+a)+(b+cc*b)(cc*b)*(c+cc*(c+$+b+a)+$+a)+b+$)+b+a
Which, I'm pretty sure, is not optimal :)
EDIT: The generated regex uses + as the choice operator (usually known to us coders as |), which means it's probably not suitable to pasting into code. However, I'm too scared to change it and risk ruining my regex :)
You can use back references to match the prev char
string input = "acbbaacbba";
string pattern = #"([ab])\1";
var matchList = Regex.Matches(input, pattern);
This pattern will match: bb, aa and bb. If you don't have any match in your input pattern, it means that it does not contain a repeated a or b.
Explanation:
([ab]): define a group, you can extend your symbols here
\1: back referencing the group, so for example, when 'a' is matched, \1 would be 'a'
check this page: http://www.regular-expressions.info/backref.html
Related
Statement:
Given an array of strings S. You can make a string by combining elements from the array S (you can use an element more than once)
In some situations, there are many ways to make a certain string from the array S.
Example:
S = {a, ab, ba}
Then there are 2 ways to make the string "aba":
"a" + "ba"
"ab" + "a"
Question:
Given an array of string S. Find the shortest string such that there are more that one way to make that string from S. If there're none, print out -1.
P/S: I have been thinking for many days but this is the best one I've got so far:
Generate all permutations of the array
For each permutation, make a string from the array S
Check if that string is made before, if yes, print out the string, if not, save that string.
But this algorithm clearly won't pass all the test cases. I can't think of any better algorithm.
Imagine that you have found your string, and you are matching it two different ways with strings from S. You start with two different strings that match the prefix, and then you repeatedly add a string from S to the shorter one until you end up with a matching length. From your example, that's
"ab"
"a"
"ab"
"aba"
"aba"
"aba"
At every step, you have 2 different strings from S that overlap at the end.
Imagine a directed graph where every vertex is a tuple (i,j,t), where i and j are the indexes of the overlapping strings at the end, and t is the number of characters left over at the end of the longer one after the overlapping section. Make it a rule that t >= 0 and that string i is always the one that ends first.
The edges of the graph indicate which vertexes you can get to by adding a new string to the shorter one, with a cost equal to the length of the added string. Of course you can only add a string if it overlaps with the t characters left over on the longer side.
Your task is then to use Dijkstra's algorithm to find the shortest path in this graph, from an initial selection of 2 distinct strings to a pair with t=0. Initially sorting the array of strings will let you use a binary search to find the strings that overlap the required suffix (the longer ones will all be together), which is an effective optimization.
Here's an O(N3) algorithm, where N is the total length of each string:
For every element Si in S:
Construct an NFA for the regular expression Si(S1|...|Sn)*
Construct an NFA for the regular expression (S1|...|Si-1|Si+1|...|Sn)(S1|...|Sn)*
Construct an NFA that is the intersection of the NFA in step 1.1 and step 1.2
Find the shortest string accepted by the NFA in step 1.3
Return the shortest string among the strings in step 1.4
The above algorithm can be improved to O(N2logN):
Construct an NFA for the regular expression (S1|...|Sn)*
Construct cross-product of two copies of the NFA in step 1.
For each state in the NFA in step 2, find the shortest string accepted by the NFA from that state.
Let 𝒮 = {S}.
While 𝒮 is not empty:
Take an element T from 𝒮.
Partition T into P and Q somewhat evenly.
Construct an NFA for the regular expression (P1|...|Pp)(S1|...|Sn)*, reusing the NFA in step 1.
Construct an NFA for the regular expression (Q1|...|Qq)(S1|...|Sn)*, reusing the NFA in step 1.
Construct cross-product of the NFA in step 5.3 and step 5.4, reusing the NFA in step 2.
Find the shortest string accepted by the NFA in step 5.5, reusing the result of step 3.
If P has more than 1 element, put P into 𝒮.
If Q has more than 1 element, put Q into 𝒮.
Return the shortest string among the strings in step 5.6
Edit: The following is an O(N2) algorithm, improved from the top answer:
Let T be every suffix of every string of S.
Build a trie out of T.
Let G be a weighted directed graph. The vertices are the elements of T, and the edges are: for each string Si in S and Tj in T, if Si = Tj + D or Tj = Si + D (using the trie in step 2 to find all such pairs), add an edge from D to Tj weighted length of Si.
Find the distance from the empty string to every vertex in G.
For each string Si, Sj in S, if Si != Sj and Si = Sj + D (using the trie in step 2 to find all such pairs), find the distance from the empty string to D (using step 4).
The length of the answer is half of the shortest distance among all distances in step 5. (It's trivial to find the actual answer but I'm too lazy to describe it :p)
I have the File as following format
Name Number Position
A 1
B 2
C 3
D 4
Now on position A3 , I applied =IF(B2=1,"Goal Keeper",OR(IF(B2=2,"Defender",OR(IF(B2=3,"MidField","Striker"))))) But it giving me an error #value!
Looked up at google, and my formula is correct.
What i basically want it
1- Goalkeeper 2-Defender 3-Midfield 4-Striker
Yes the other way is to to just filter the number and copy paste the text
But I want to do it using formula and want to know where did I go wrong.
Your immediate problem lies with the expression (for example):
OR(IF(B2=3,"MidField","Striker"))
| \__/ \________/ \_______/ |
| bool string string |
\____________________________/
string
The OR function expects a series of boolean values (true or false) and you're giving it a string value from the inner IF.
You don't actually need the or bits in this specific case, the if is a full if-else. So you can just use:
=IF(B1=1,"Goal Keeper",IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")))
This means that B1=1 will result in "Goal Keeper", otherwise it will evaluate IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")).
Then that means that, if B2=2, it will result in "Defender", otherwise it will evaluate IF(B2=3,"MidField","Striker").
Finally, that means the B2=3 will result in "MidField", anything else will give "Striker".
The only situation I can envisage when OR would come in handy here would be when two different numbers were to generate the same string. Let's say both 1 and 4 should give "Goalie", you could use:
=IF(OR(B1=1,B1=4),"Goalie",IF(B2=2,"Defender","MidField"))
Keep in mind that a more general solution would be better implemented with the Excel lookup functions, ones that would search a table (on the spreadsheet somewhere) which mapped the integers to strings. Then, if the mapping needed to change, you would just update the table rather than going back and changing the formula in every single row.
If you are actually tasked with solving the problem by using the IF and OR function within the same equation, this is the only way I can see how:
=IF(OR(B1=1, B1 = 2, B1 = 3, B1 = 4),IF(B1 = 1, "Goal Keeper", IF(B1 = 2,"Defender",IF(B1 = 3,"MidField","Striker")))
If B1 does not equal 1-4, the OR function will return FALSE and completely bypass all of the nested IF statements.
I have an array of unsigned integers, each corresponding to a string with 12 characters, that can contain 4 different characters, namely 'A','B','C','D'. Thus the array will contain 4^12 = 16777216 elements. The ordering of the elements in the array is arbitrary; I can choose which one corresponds to each string. So far, I have implemented this as simply as that:
unsigned int my_array[16777216];
char my_string[12];
int index = string_to_index(my_string);
my_array[index] = ...;
string_to_index() simply assigns 2 bits per character like this:
A --> 00, B --> 01, C --> 10, D --> 11
For example, ABCDABCDABCD corresponds to the index (000110110001101100011011)2 = (1776411)10
However, I know for a fact that each string that is used to access the array is the previous string shifted once to the left with a new last character. For example after I access with ABCDABCDABCD, the next access will use BCDABCDABCDA, or BCDABCDABCDB, BCDABCDABCDC, BCDABCDABCDD.
So my question is:
Is there a better way to implement the string_to_index function to take under consideration this last fact, so that elements that are consecutively accessed are closer in the array? I am hoping to improve my caching performance by doing so.
edit: Maybe I was not very clear: I am looking for a completely different string to index correspondence scheme, so that the indexes of ABCDABCDABCD and BCDABCDABCDA are closer.
If the following assumptions are true for your problem then the solution you implemented is best one.
The right most char of next string is randomly selected with equal probability for each valid character
Start of the sequence is not same always (it is random).
Reason:
When I first read your question I came up with the following tree: (reduced your problem to string of length three characters and only 2 possible characters A and B for simplicity) Note that left most child of root node (AAA in this case) is always same as root node (AAA) hence I am not building that branch further.
AAA
/ \
AAB
/ \
ABA ABB
/ \ / \
BAA BAB BBA BBB
In this tree each node has its next possible sequence as child nodes. To improve on cache you need to traverse this tree using breadth-first traversal and store it in the array in the same order. For the above tree we get following string index combination.
AAA 0
AAB 1
ABA 2
ABB 3
BAA 4
BAB 5
BBA 6
BBB 7
Assuming value(A) = 0 and value(B) = 1, index can be calculated as
index = 2^0 * (value(string[2])) + 2^1 * (value(string[1])) + 2^2 * (value(string[0]))
This is same solution as you are using.
I have written a python script to check this for other combinations too (like string of length 4 characters with A B C as possible characters). Script link
So unless the 2 assumptions made at the beginning are false than your solution already takes care of cache optimisation.
I think we could define "closer" first.
For example, we could define a function F which takes a method of calculating the indices of strings. Then F will check every string's index and return a certain value based on the distance of neighbor strings' indices.
Then we can compare various ways of calculating the index and find a best one.
Of course we could examine shorter strings first.
Given the following regular expressions:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
The string alice#myprovider.com will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted.
So I hope you guys could help me out here...
Thanks !!
Paul
The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.
Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.
Setup
import re
class RegexElement:
def __init__(self, string,index):
self.string=string
self.supersets = []
self.subsets = []
self.disjoints = []
self.intersects = []
self.maybes = []
self.precompilation = {}
self.compiled = re.compile(string,re.IGNORECASE)
self.index = index
SUPERSET = object()
SUBSET = object()
INTERSECT = object()
DISJOINT = object()
EQUAL = object()
The Tests
Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.
SUPERSET means a is a superset of b. All matches of b will match a.
SUBSET means b is a superset of a.
INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.
DISJOINT means no matches of a will match b.
EQUAL means all matches of a will match b and all matches of b will match a.
def equal_test(a, b):
if a == b: return EQUAL
The graph
class SubsetGraph(object):
def __init__(self, tests):
self.regexps = []
self.tests = tests
self._dirty = True
self._roots = None
#property
def roots(self):
if self._dirty:
r = self._roots = [i for i in self.regexps if not i.supersets]
return r
return self._roots
def add_regex(self, new_regex):
roots = self.roots
new_re = RegexElement(new_regex)
for element in roots:
self.process(new_re, element)
self.regexps.append(new_re)
def process(self, new_re, element):
relationship = self.compare(new_re, element)
if relationship:
getattr(self, 'add_' + relationship)(new_re, element)
def add_SUPERSET(self, new_re, element):
for i in element.subsets:
i.supersets.add(new_re)
new_re.subsets.add(i)
element.supersets.add(new_re)
new_re.subsets.add(element)
def add_SUBSET(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
element.subsets.add(new_re)
new_re.supersets.add(element)
def add_DISJOINT(self, new_re, element):
for i in element.subsets:
i.disjoints.add(new_re)
new_re.disjoints.add(i)
new_re.disjoints.add(element)
element.disjoints.add(new_re)
def add_INTERSECT(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
new_re.intersects.add(element)
element.intersects.add(new_re)
def add_EQUAL(self, new_re, element):
new_re.supersets = element.supersets.copy()
new_re.subsets = element.subsets.copy()
new_re.disjoints = element.disjoints.copy()
new_re.intersects = element.intersects.copy()
def compare(self, a, b):
for test in self.tests:
result = test(a.string, b.string)
if result:
return result
def match(self, text, strict=True):
matches = set()
self._match(text, self.roots, matches)
out = []
for e in matches:
for s in e.subsets:
if s in matches:
break
else:
out.append(e)
if strict and len(out) > 1:
for i in out:
print(i.string)
raise Exception("Multiple equally specific matches found for " + text)
return out
def _match(self, text, elements, matches):
new_elements = []
for element in elements:
m = element.compiled.match(text)
if m:
matches.add(element)
new_elements.extend(element.subsets)
if new_elements:
self._match(text, new_elements, matches)
Usage
graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")
A complete usable version is here.
My gut instinct says that not only is this a hard problem, both in terms of computational cost and implementation difficulty, but it may be unsolvable in any realistic fashion. Consider the two following regular expressions to accept the string alice#myprovider.com
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Which one of these is more specific?
This is a bit of a hack, but it could provide a practical solution to this question asked nearly 10 years ago.
As pointed out by #torak, there are difficulties in defining what it means for one regular expression to be more specific than another.
My suggestion is to look at how stable the regular expression is with respect to a string that matches it. The usual way to investigate stability is to make minor changes to the inputs, and see if you still get the same result.
For example, the string alice#myprovider.com matches the regex /alice#myprovider\.com/, but if you make any change to the string, it will not match. So this regex is very unstable. But the regex /.*/ is very stable, because you can make any change to the string, and it still matches.
So, in looking for the most specific regex, we are looking for the least stable one with respect to a string that matches it.
In order to implement this test for stability, we need to define how we choose a minor change to the string that matches the regex. This is another can of worms. We could for example, choose to change each character of the string to something random and test that against the regex, or any number of other possible choices. For simplicity, I suggest deleting one character at a time from the string, and testing that.
So, if the string that matches is N characters long, we have N tests to make. Lets's look at deleting one character at a time from the string alice#foo.com, which matches all of the regular expressions in the table below. It's 12 characters long, so there are 12 tests. In the table below,
0 means the regex does not match (unstable),
1 means it matches (stable)
/alice#[a-z]+\.[a-z]+/ /[a-z]+#[a-z]+\.[a-z]+/ /.*/
lice#foo.com 0 1 1
aice#foo.com 0 1 1
alce#foo.com 0 1 1
alie#foo.com 0 1 1
alic#foo.com 0 1 1
alicefoo.com 0 0 1
alice#oo.com 1 1 1
alice#fo.com 1 1 1
alice#fo.com 1 1 1
alice#foocom 0 0 1
alice#foo.om 1 1 1
alice#foo.cm 1 1 1
--- --- ---
total score: 5 10 12
The regex with the lowest score is the most specific. Of course, in general, there may be more than one regex with the same score, which reflects the fact there are regular expressions which by any reasonable way of measuring specificity are as specific as one another. Although it may also yield the same score for regular expressions that one can easily argue are not as specific as each other (if you can think of an example, please comment).
But coming back to the question asked by #torak, which of these is more specific:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
We could argue that the second is more specific because it constrains more characters, and the above test will agree with that view.
As I said, the way we choose to make minor changes to the string that matches more than one regex is a can of worms, and the answer that the above method yields may depend on that choice. But as I said, this is an easily implementable hack - it is not rigourous.
And, of course the method breaks if the string that matches is empty. The usefulness if the test will increase as the length of the string increases. With very short strings, it is more likely produce equal scores for regular expressions that are clearly different in their specificity.
I'm thinking of a similar problem for a PHP projects route parser. After reading the other answers and comments here, and also thinking about the cost involved I might go in another direction altogether.
A solution however, would be to simply sort the regular expression list in order of it's string length.
It's not perfect, but simply by removing the []-groups it would be much closer. On the first example in the question it would this list:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
To this, after removing content of any []-group:
- alice#+\.+
- +#+\.+
- .*
Same thing goes for the second example in another answer, with the []-groups completely removed and sorted by length, this:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Would become sorted as:
+#myprovider.com
alice#+\.+
This is a good enough solution at least for me, if I choose to use it. Downside would be the overhead of removing all groups of [] before sorting and applying the sort on the unmodified list of regexes, but hey - you can't get everything.
I have always been interested in algorithms, sort, crypto, binary trees, data compression, memory operations, etc.
I read Mark Nelson's article about permutations in C++ with the STL function next_perm(), very interesting and useful, after that I wrote one class method to get the next permutation in Delphi, since that is the tool I presently use most. This function works on lexographic order, I got the algo idea from a answer in another topic here on stackoverflow, but now I have a big problem. I'm working with permutations with repeated elements in a vector and there are lot of permutations that I don't need. For example, I have this first permutation for 7 elements in lexographic order:
6667778 (6 = 3 times consecutively, 7 = 3 times consecutively)
For my work I consider valid perm only those with at most 2 elements repeated consecutively, like this:
6676778 (6 = 2 times consecutively, 7 = 2 times consecutively)
In short, I need a function that returns only permutations that have at most N consecutive repetitions, according to the parameter received.
Does anyone know if there is some algorithm that already does this?
Sorry for any mistakes in the text, I still don't speak English very well.
Thank you so much,
Carlos
My approach is a recursive generator that doesn't follow branches that contain illegal sequences.
Here's the python 3 code:
def perm_maxlen(elements, prefix = "", maxlen = 2):
if not elements:
yield prefix + elements
return
used = set()
for i in range(len(elements)):
element = elements[i]
if element in used:
#already searched this path
continue
used.add(element)
suffix = prefix[-maxlen:] + element
if len(suffix) > maxlen and len(set(suffix)) == 1:
#would exceed maximum run length
continue
sub_elements = elements[:i] + elements[i+1:]
for perm in perm_maxlen(sub_elements, prefix + element, maxlen):
yield perm
for perm in perm_maxlen("6667778"):
print(perm)
The implentation is written for readability, not speed, but the algorithm should be much faster than naively filtering all permutations.
print(len(perm_maxlen("a"*100 + "b"*100, "", 1)))
For example, it runs this in milliseconds, where the naive filtering solution would take millenia or something.
So, in the homework-assistance kind of way, I can think of two approaches.
Work out all permutations that contain 3 or more consecutive repetitions (which you can do by treating the three-in-a-row as just one psuedo-digit and feeding it to a normal permutation generation algorithm). Make a lookup table of all of these. Now generate all permutations of your original string, and look them up in lookup table before adding them to the result.
Use a recursive permutation generating algorthm (select each possibility for the first digit in turn, recurse to generate permutations of the remaining digits), but in each recursion pass along the last two digits generated so far. Then in the recursively called function, if the two values passed in are the same, don't allow the first digit to be the same as those.
Why not just make a wrapper around the normal permutation function that skips values that have N consecutive repetitions? something like:
(pseudocode)
funciton custom_perm(int max_rep)
do
p := next_perm()
while count_max_rerps(p) < max_rep
return p
Krusty, I'm already doing that at the end of function, but not solves the problem, because is need to generate all permutations and check them each one.
consecutive := 1;
IsValid := True;
for n := 0 to len - 2 do
begin
if anyVector[n] = anyVector[n + 1] then
consecutive := consecutive + 1
else
consecutive := 1;
if consecutive > MaxConsecutiveRepeats then
begin
IsValid := False;
Break;
end;
end;
Since I do get started with the first in lexographic order, ends up being necessary by this way generate a lot of unnecessary perms.
This is easy to make, but rather hard to make efficient.
If you need to build a single piece of code that only considers valid outputs, and thus doesn't bother walking over the entire combination space, then you're going to have some thinking to do.
On the other hand, if you can live with the code internally producing all combinations, valid or not, then it should be simple.
Make a new enumerator, one which you can call that next_perm method on, and have this internally use the other enumerator, the one that produces every combination.
Then simply make the outer enumerator run in a while loop asking the inner one for more permutations until you find one that is valid, then produce that.
Pseudo-code for this:
generator1:
when called, yield the next combination
generator2:
internally keep a generator1 object
when called, keep asking generator1 for a new combination
check the combination
if valid, then yield it