is there a C function for regex using a deterministic automaton? - c

The POSIX regex functions compile the regular expressions into non-deterministic finite automata (NFAs). One problem with that is there is no way to tell at compilation time whether those automata will use excessive stack space or take excessive cpu time. That makes them (in some sense) unsuitable for use in a real time system.
An equivalent deterministic finite automaton executes in linear time. It disadvantage is that it may use an excessive number of states, which translates to a large amount of program memory. On the plus side, though, is the fact that you know the number of states used at the time you compile the regular expression.
That means you can know at regular expression compile time whether it is suitable for your application. That brings me to my question: Is there a regular expression library for C that compiles to a DFA? The answer to a question that might be as useful would answer the question: Is there a regular expression library for C that gives useful information on memory and cpu utilization?
Ken

Yes. 2. It's a matter of simple algebra. 3. Here
https://github.com/RockBrentwood/RegEx
(originally in the comp.compilers archive.)
Here an early description on comp.compilers, from which this ultimately descended.
https://compilers.iecc.com/comparch/article/93-05-083
and another later description
https://compilers.iecc.com/comparch/article/93-10-022
The older version of the RegEx C programs on GitHub may be found in the AI repository at Carnegie Mellon University here
https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/regex
I will try to retcon the 1993-2021 evolution-stream of it into the current GitHub snapshot so that you can have the whole history, rather than just the snapshot of the latest versions. (It would be nice if GitHub supported retconning and retrofitting history-streams, by the way.)
An automaton is little more than the graphic display of a finite right linear system of inequations. Every rational expression is the least fixed point solution to such a system, which can be unfolded from the expression purely by algebraic means.
This is a general result of Kleene algebra, so it goes well beyond just regular expressions; e.g. rational subsets of any monoid; a special case being rational subsets of product monoids, which includes rational transductions as a special case. And the algebraic method used in the C routines is mostly (but not entirely) generic to Kleene algebras.
I'm trying to adapt the calculation in {nfa,dfa}.c to handle both inputs and outputs. There are a few places where it makes specific assumption that the Kleene algebra is the free Kleene algebra ( = regular expression algebra). And that has to be modified, to allow it to be generalized to non-free Kleene algebras, like the rational transductions.
Regular expressions over an alphabet $X$ comprise the Kleene algebra $ℜX^*$ of the rational subsets of the free monoid $X^*$ generated by $X$. Correspondingly, $ℜX^*$ is the free Kleene algebra generated by $X$.
The underlying theory (with respect to which "free" refers to) can be 1st-order or 2nd order.
The 1st-order theory (notwithstanding Conway's "no finite axiomatization" result, mis-stated and mis-applied as a folklore theorem) is a finitely axiomatized algebra consisting of (a) the axioms for a semiring, with an idempotent sum $x + x = x$ (usually denoted $x | x$) ... i.e. a "dioid", and (b) the corresponding partial ordering relation defined by ($x ≥ y ⇔ (∃z) x = z + y ⇔ x = x + y$); (c) the Kleene star operator $x ↦ x^*$, which (d) provides least fixed point solutions $b^* a c^* = μx (x ≥ a + bx + xc)$. (A set of axioms to embody (d) are $x^* = 1 + x x^*$ and $x ≥ a + bx + xc ⇒ x ≥ b^* a c^*$.) That dates from the mid 1990's by Kozen.
The algebra presented by the 1st order theory is not closed under congruence relations (because, in fact, all computations can be represented by a Kleene algebra taken over a suitably defined non-trivial monoid; so the word problem isn't solvable either). The 2nd order formulation - which predates the 1st order formulation - is the closure of the 1st order formulation under congruence. It has (a) the axioms of a dioid and (b) the least fixed points of all rational subsets and (c) distributivity with respect to the rational least fixed point. The last two axioms can be narrowed and combined into a single axiom for the least fixed point: $μ_{n≥0}(ab^nc) = ab^*c$.
Using the terminology in LNCS 4988 (https://link.springer.com/chapter/10.1007%2F978-3-540-78913-0_14), this comprises the category of "rational dioids" = rationally closed idempotent semirings. It has a tensor product ⊗, which was part of the suite of additional infrastructure and expanded terminology laid out in LNCS11194 (pp. 21-36, 37-52) https://dblp.org/db/conf/RelMiCS/ramics2018.html.
The software requires and uses only the 1st order formulation.
Rational transductions over an input alphabet $X$ and output alphabet $Y$, similarly, comprise the rational subsets $ℜ(X^* × Y^*)$ of the product monoid $X^* × Y^*$; and in the rational-dioid category, the rational transduction algebra is the tensor product $ℜ(X^* × Y^*) = ℜX^* ⊗ ℜY^*$ of the respective regular expression algebras.
In turn, that algebra is effectively just the algebra of regular expressions over the disjoint union of $X$ and $Y$ modulo the commutativity rule $xy = yx$ for $x ∈ X, y ∈ Y$, so the process can be adapted and generalized adapted to:
(a) "transducers" - where both X and Y are present,
(b) "generators", where only $Y$ is present and $X = {1}$,
(c) "recognizers", where only $X$ is present and $Y = {1}$ and even
(d) generalizations of these where multiple input and/or output channels are allowed.
Example: the Kleene algebra $ℜX_0^* ⊗ ℜX_1^* ⊗ ℜY_0^* ⊗ ℜY_1^*$ would be for transducers with two input channels (one with alphabet $X_0$ the other with alphabet $X_1$) and two output channels (with respective alphabets $Y_0$ and $Y_1$).

Related

Order of transformations confusion in muPDF Library API documentation

From the basics of the C language we know that in the following code:
y = fn3(fn2(fn1(x)));
...the fn1() is executed first, the fn2() is executed second and the fn3() is executed last.
What order of matrix transformations is built by the following C code ?:
ctm = fz_pre_translate(fz_pre_rotate(fz_scale(sx, sy), r), tx, ty);
Case A or Case B ?:
The documentation of the muPDF Library API is available at this link and it states the following on Page 61:
Alternatively, operations can be specifically applied to existing
matrices. Because of the non-commutative nature of matrix operations,
it matters whether the new operation is applied before or after the
existing matrix. For example, if you have a matrix that performs a
rotation, and you wish to combine that with a translation, you must
decide whether you want the translation to occur before the rotation
(‘pre’) or afterwards (‘post’).
MuPDF has various API functions for such operations:
To me the statement above suggests that the order of transformations, being built by these functions, is not the same as the order of nested function evaluations in C (and their invocations) ...but I just can't be sure.
In mathematical terms, case A, which is a translation, followed by a rotation, followed by scaling could be expressed as
x' = S · (R · (T · x)) = S · R · T · x = ((S · R) · T ) · x
So we want to apply the translation before the other transformations combined, while the scaling should only be applied after.
M = S · R · T = MS, R · T = S · MR, T
To me the statement above suggests that the order of transformations, being built by these functions, is not the same as the order of nested function evaluations in C.
I'd say that a richer API, like the one exposed by this library, let the user choose between different ways of expressing their intent, but it doesn't (of course) violate the rules of C nested functions calls.
There could be cases in which particular algorithms may benefit from one approach or the other.
ctm = fz_pre_translate(fz_pre_rotate(fz_scale(sx, sy), r), tx, ty);
Note that, despite the order in which the nested functions are called, this line can be read exactly as the first statement of this answer (first translation, then rotation, then scale), while the mathematical notation is basically backwards.
ctm = fz_post_scale(fz_post_rotate(fz_translate(tx, ty), r), sx, sy);
Whether this generates confusion or not in the reader, I'm afraid is a matter of opinion and personal backgrounds.
To my knowledge, though, having a small as possible public API is considered less error prone and easier to mantain.

When to use which base of log for tf-idf?

I'm working on a simple search engine where I use the TF-IDF formula to score how important a search word is. I see people using different bases for the formula, but I see no explanation for when to use which. Does it matter at all, and do you have any recommendations?
My current implementation uses the regular log() function of the math.h library
TF-IDF literature usually uses base 2, although a common implementation sklearn uses natural logarithms for example. Just take in count that the lower the base, the bigger the score, which can affect truncation of search resultset by score.
Note that from a mathematical point of view, the base can always be changed later. It's easy to convert from one base to another, because the following equality holds:
log_a(x)/log_a(y) = log_b(x)/log_b(y)
You can always convert from one base to another. It's actually very easy. Just use this formula:
log_b(x) = log_a(x)/log_a(b)
Often bases like 2 and 10 are preferred among engineers. 2 is good for halftimes, and 10 is our number system. Math people prefer the natural logarithm, because it makes calculus a lot easier. The derivative of the function b^x where b is a constant is k*b^x. Bur if b is equal to e (the natural logarithm) then k is 1.
So let's say that you want to send in the 2-logarithm of 5.63 using log(). Just use log(5.63)/log(2).
If you have the need for it, just use this function for arbitrary base:
double logb(double x, double b) {
return log(x)/log(b);
}

Higher order versions of basic gates Q#

Is there a higher order H-gate in Q# language? For example, if I want to apply Hadamard gate to an array(combined state) of 3 qubits. Is there a way to generate a tensor product version of H-gate or other gates?
One way to think of it is to think of the unitary operator H = |+⟩⟨0| + |−⟩⟨1| and the quantum operation H separately.
Taking this view, the unitary H is how we simulate the effect of applying the operation H on an ideal quantum processor.
The quantum operation ApplyToEach(H, _) is then represented by the unitary operator H ⊗ H ⊗ ⋯ ⊗ H in precisely the same way that H is represented by H.
One consequence of this mental model is that the tensor product is defined between unitary operators and not between quantum operations. Rather, the ideal action of quantum operations acting on distinct qubits is represented by the tensor product of the unitary representations of each individual operation.
Q# does not allow you to pass more qubits than the basic gate allows. So you have to run each of the qubits through the H() gate manually like this
let n = Length(qs);
for(index in 0 .. (n-1)) {
H(qs[index]);
}
Or use the convenient standard library function
ForEach(H,qs);
The basic reason why you can't apply a higher order H gate is it will increase the function signature to more qubits which creates complication. Also you may want to pass only some of the qubits of the same array to the gate, in that case you can't pass the entire array either and have to do it manually.

Maximum minimum and mean in Datalog

I cannot figure how calculate a mean, maximum and minimum using Datalog declarative logic programming language.
Eg. Considering this simple schema
Flows(Stream, River)
Rivers(River, Length)
If I want
a) the mean length of the rivers,
b) the longest river,
c) and the river with less Streams
what are the right Datalog queries?
I have read the Datalog theory, but cannot figure how these simple in another language queries could be solved with Datalog and haven't found any similar sample.
NOTE
The datalog that I use is with basic arithmetic functions like y is z+1, y is z-1, y is z\1 or y is z*1, and you can use X<Y or Y>X statements, and negation, so theoretically should be possible to do this kind of interrogation in some way since It has enough expressive power.
Is negation supported or not? If so, we can do max (or min) as follows:
shorterRivers(R1, L1) :- Rivers(R1, L1), Rivers(R2, L2), L1 < L2.
longestRivers(R1, L1) :- Rivers(R1, L1), !shorterRivers(R1,L1).
"mean" will be harder to do, as it require "SUM" and "COUNT" aggregations.
Standard Datalog supports first-order logic only, which does not include aggregate functions. However, some datalog implementations, such as pyDatalog, supports aggregate functions as an extension.

Sorting and making "genes" in output bitstrings from a genetic algorithm

I was wondering if anybody had suggestions as to how I could analyze an output bitstring that is being permuted by a genetic algorithm. In particular it would be nice if I could try to identify patterns of bits (I'm calling them genes here) that seem to yield a desirable cv score. The difficulty comes in trying to examine these datasets because there are a lot of them (I have probably already something like 30 million bitstrings that are 140 bits long and I'll probably hit over 100 million pretty quickly), so after I sort out the desirable data there is still ALOT of potential datasets and doing similarity comparisons by eye is out of the question. My questions are:
How should I compare for similarity between these bitstrings?
How can I identify "genes" in these bitstrings in an algorithmic (aka programmable) way?
As you want to extract common gene-patterns, what about looking at the intersection of the two strings. So if you have
set1 = 11011101110011...
set2 = 11001100000110...
# apply bitwise '=='
set1 && set2 == 11101110000010...
The result now shows what genes are the same, and could be used in further analysis.
For the similarity part you need to do an exclusive-or (XOR). The result of this bit-wise operation will give you the difference between two bit strings, and is probably the most efficient and easy way of doing it (for pair comparison). As an example:
>>> from bitarray import bitarray
>>> a = bitarray('0001100111')
>>> b = bitarray('0100110110')
>>> a ^ b
bitarray('0101010001')
Then you can either count the differences, inspect quickly where the differences lie, etc.
For the second part, it depends on the representation of course, and on the programming language (PL) chosen for the implementation. Most PL libraries will have a search function, that retrieves all or at least the first of the indexes where some pattern is found in a string (or bitstring, or bitstream...). You just have to refer to the documentation of your chosen PL to know more about the performance if you have more than one option for the task.

Resources