Maximum minimum and mean in Datalog - database

I cannot figure how calculate a mean, maximum and minimum using Datalog declarative logic programming language.
Eg. Considering this simple schema
Flows(Stream, River)
Rivers(River, Length)
If I want
a) the mean length of the rivers,
b) the longest river,
c) and the river with less Streams
what are the right Datalog queries?
I have read the Datalog theory, but cannot figure how these simple in another language queries could be solved with Datalog and haven't found any similar sample.
NOTE
The datalog that I use is with basic arithmetic functions like y is z+1, y is z-1, y is z\1 or y is z*1, and you can use X<Y or Y>X statements, and negation, so theoretically should be possible to do this kind of interrogation in some way since It has enough expressive power.

Is negation supported or not? If so, we can do max (or min) as follows:
shorterRivers(R1, L1) :- Rivers(R1, L1), Rivers(R2, L2), L1 < L2.
longestRivers(R1, L1) :- Rivers(R1, L1), !shorterRivers(R1,L1).
"mean" will be harder to do, as it require "SUM" and "COUNT" aggregations.

Standard Datalog supports first-order logic only, which does not include aggregate functions. However, some datalog implementations, such as pyDatalog, supports aggregate functions as an extension.

Related

is there a C function for regex using a deterministic automaton?

The POSIX regex functions compile the regular expressions into non-deterministic finite automata (NFAs). One problem with that is there is no way to tell at compilation time whether those automata will use excessive stack space or take excessive cpu time. That makes them (in some sense) unsuitable for use in a real time system.
An equivalent deterministic finite automaton executes in linear time. It disadvantage is that it may use an excessive number of states, which translates to a large amount of program memory. On the plus side, though, is the fact that you know the number of states used at the time you compile the regular expression.
That means you can know at regular expression compile time whether it is suitable for your application. That brings me to my question: Is there a regular expression library for C that compiles to a DFA? The answer to a question that might be as useful would answer the question: Is there a regular expression library for C that gives useful information on memory and cpu utilization?
Ken
Yes. 2. It's a matter of simple algebra. 3. Here
https://github.com/RockBrentwood/RegEx
(originally in the comp.compilers archive.)
Here an early description on comp.compilers, from which this ultimately descended.
https://compilers.iecc.com/comparch/article/93-05-083
and another later description
https://compilers.iecc.com/comparch/article/93-10-022
The older version of the RegEx C programs on GitHub may be found in the AI repository at Carnegie Mellon University here
https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/regex
I will try to retcon the 1993-2021 evolution-stream of it into the current GitHub snapshot so that you can have the whole history, rather than just the snapshot of the latest versions. (It would be nice if GitHub supported retconning and retrofitting history-streams, by the way.)
An automaton is little more than the graphic display of a finite right linear system of inequations. Every rational expression is the least fixed point solution to such a system, which can be unfolded from the expression purely by algebraic means.
This is a general result of Kleene algebra, so it goes well beyond just regular expressions; e.g. rational subsets of any monoid; a special case being rational subsets of product monoids, which includes rational transductions as a special case. And the algebraic method used in the C routines is mostly (but not entirely) generic to Kleene algebras.
I'm trying to adapt the calculation in {nfa,dfa}.c to handle both inputs and outputs. There are a few places where it makes specific assumption that the Kleene algebra is the free Kleene algebra ( = regular expression algebra). And that has to be modified, to allow it to be generalized to non-free Kleene algebras, like the rational transductions.
Regular expressions over an alphabet $X$ comprise the Kleene algebra $ℜX^*$ of the rational subsets of the free monoid $X^*$ generated by $X$. Correspondingly, $ℜX^*$ is the free Kleene algebra generated by $X$.
The underlying theory (with respect to which "free" refers to) can be 1st-order or 2nd order.
The 1st-order theory (notwithstanding Conway's "no finite axiomatization" result, mis-stated and mis-applied as a folklore theorem) is a finitely axiomatized algebra consisting of (a) the axioms for a semiring, with an idempotent sum $x + x = x$ (usually denoted $x | x$) ... i.e. a "dioid", and (b) the corresponding partial ordering relation defined by ($x ≥ y ⇔ (∃z) x = z + y ⇔ x = x + y$); (c) the Kleene star operator $x ↦ x^*$, which (d) provides least fixed point solutions $b^* a c^* = μx (x ≥ a + bx + xc)$. (A set of axioms to embody (d) are $x^* = 1 + x x^*$ and $x ≥ a + bx + xc ⇒ x ≥ b^* a c^*$.) That dates from the mid 1990's by Kozen.
The algebra presented by the 1st order theory is not closed under congruence relations (because, in fact, all computations can be represented by a Kleene algebra taken over a suitably defined non-trivial monoid; so the word problem isn't solvable either). The 2nd order formulation - which predates the 1st order formulation - is the closure of the 1st order formulation under congruence. It has (a) the axioms of a dioid and (b) the least fixed points of all rational subsets and (c) distributivity with respect to the rational least fixed point. The last two axioms can be narrowed and combined into a single axiom for the least fixed point: $μ_{n≥0}(ab^nc) = ab^*c$.
Using the terminology in LNCS 4988 (https://link.springer.com/chapter/10.1007%2F978-3-540-78913-0_14), this comprises the category of "rational dioids" = rationally closed idempotent semirings. It has a tensor product ⊗, which was part of the suite of additional infrastructure and expanded terminology laid out in LNCS11194 (pp. 21-36, 37-52) https://dblp.org/db/conf/RelMiCS/ramics2018.html.
The software requires and uses only the 1st order formulation.
Rational transductions over an input alphabet $X$ and output alphabet $Y$, similarly, comprise the rational subsets $ℜ(X^* × Y^*)$ of the product monoid $X^* × Y^*$; and in the rational-dioid category, the rational transduction algebra is the tensor product $ℜ(X^* × Y^*) = ℜX^* ⊗ ℜY^*$ of the respective regular expression algebras.
In turn, that algebra is effectively just the algebra of regular expressions over the disjoint union of $X$ and $Y$ modulo the commutativity rule $xy = yx$ for $x ∈ X, y ∈ Y$, so the process can be adapted and generalized adapted to:
(a) "transducers" - where both X and Y are present,
(b) "generators", where only $Y$ is present and $X = {1}$,
(c) "recognizers", where only $X$ is present and $Y = {1}$ and even
(d) generalizations of these where multiple input and/or output channels are allowed.
Example: the Kleene algebra $ℜX_0^* ⊗ ℜX_1^* ⊗ ℜY_0^* ⊗ ℜY_1^*$ would be for transducers with two input channels (one with alphabet $X_0$ the other with alphabet $X_1$) and two output channels (with respective alphabets $Y_0$ and $Y_1$).

When to use which base of log for tf-idf?

I'm working on a simple search engine where I use the TF-IDF formula to score how important a search word is. I see people using different bases for the formula, but I see no explanation for when to use which. Does it matter at all, and do you have any recommendations?
My current implementation uses the regular log() function of the math.h library
TF-IDF literature usually uses base 2, although a common implementation sklearn uses natural logarithms for example. Just take in count that the lower the base, the bigger the score, which can affect truncation of search resultset by score.
Note that from a mathematical point of view, the base can always be changed later. It's easy to convert from one base to another, because the following equality holds:
log_a(x)/log_a(y) = log_b(x)/log_b(y)
You can always convert from one base to another. It's actually very easy. Just use this formula:
log_b(x) = log_a(x)/log_a(b)
Often bases like 2 and 10 are preferred among engineers. 2 is good for halftimes, and 10 is our number system. Math people prefer the natural logarithm, because it makes calculus a lot easier. The derivative of the function b^x where b is a constant is k*b^x. Bur if b is equal to e (the natural logarithm) then k is 1.
So let's say that you want to send in the 2-logarithm of 5.63 using log(). Just use log(5.63)/log(2).
If you have the need for it, just use this function for arbitrary base:
double logb(double x, double b) {
return log(x)/log(b);
}

Updating values in an array with logical indexing with a non-constant value

A common problem I encounter when I want to write concise/readable code:
I want to update all the values of a vector matching a logical expression with a value that depends on the previous value.
For example, double all even entries:
weights = [10 7 4 8 3];
weights(mod(weights,2)==0) = weights(mod(weights,2)==0) * 2;
% weights = [20 7 8 16 3]
Is it possible to write the second line in a more concise fashion (i.e. avoiding the double use of the logical expression, something like i+=3 for i=i+3 in other languages). If I often use this kind of vector operation in different contexts/variables, and I have long conditionals, I feel that my code is less concise and readable than it could be.
Thanks!
How about
ind = mod(weights,2)==0;
weights(ind) = weights(ind)*2;
This way you avoid calculating the indices twice and it's easy to read.
Starting your other comment to Wauzl, such powerful operation capabilities is the Fortran side. This is purely matlab's design that is quickly getting obsolete. Let's use this horribleness further:
for i=1:length(weights),if (mod(weights(i),2)==0)weights(i)=weights(i)*2;end,end
It is even slightly faster than your two liner because you are doing the conditional indexing twice on both sides. In general, consider switching to Python3.
Well, I after more searching around, I found this link that deals with this issue (I used search before posting, I swear!), and there is interesting further discussion regarding this topic in the links in that thread. So apparently there are issues with ambiguity when introducing such an operator.
Looks like that is the price we have to pay in terms of syntactic limitations for having such powerful matrix operation capabilities.
Thanks a lot anyway, Wauzl!

Graphs for Big O notation

I wonder if there's any tool/website where I can plot some run times as graphs. Because, the asymptotic notation is often not what I wanted i.e. Don't want to ignore constants.
For example, suppose I have two notations, like:
1) O = (n * log n).
2) O = (n * log n * log n)/5.
It's obvious that 1st one is asymptotically better. But what I want to see how they perform and at which point the second one starts to become better.
A graphical notation where I can enter different equations and plot them them to see how they vary would be greatly useful for this purpose. In my search I found this site where they have some plots. I am looking for something similar but I also want to input my equations to plot to analyse the performance for various 'n' values.
As soon as you stop "ignoring constants", you're no longer graphing "Big O" notation, but just performing a standard XY plot. As such, any graphing program, even online graphing calculators, would let you display this, just replace "n" for "X" and you'll get the proper graph.
Would this or this help?
If you use a 3d grapher, you can use the other dimension (say y) as a constant replacement.
This way you would be able to interpret results as:
when y is greater than 5, n*log(n)*log(n)/y is better than n*log(n) starting from n = (actual value)
Also, you can ignore the 3rd dimension. Or use it if you have a complexity depending on 2 variables.
Just input the difference between the complexities. In this case, ignoring the 3rd dimension and considering log(x) = ln(x), the equation is:
z = x*ln(x) - x*ln(x)*ln(x)/5
An you can interpret that as x*ln(x) is more efficient when z is negative.
If you want to see how they perform then you have to implement the algorithms and execute them on various graph. Modern processors with memory locality and cache misses make it really hard to come up with an equation that gives you a reasonable estimation.
I can guarantee you that oyu won't measure what you would expect.

Efficient algorithm for searching 3D coordinate in an array

I have a large array (>10^5 entries) of 3D coordinates r=(x, y, z), where x, y and z are floats. Which is the most efficient way to search a given coordinate r' in the array and give the array index. Note that the r' may not given with the same accuracy as r; say, if the array has stored coordinate (1.5, 0.5, 0.0) and r' is given as (1.49999, 0.49999, 0.0), the algorithm should rightly pick the coordinate. I am developing the code in C.
How can one use O(1) search capability of hash table for this purpose? Converting the coordinate into string is out of question due to accuracy related issue. Is there any particular data structure that would help in O(1) algorithm?
Thanks
OnRoadCoder
check R-trees, already implemented on some RDBMS, like SQLite, and (i think) Postgres
In order to have "fuzzy" searching as you're describing (so you can support slight inaccuracies), you will have to sacrifice on O(1) algorithms.
That being said, there are some very good algorithms for this. Space partitioning (such as using an Octree or KD-Tree) is a common, popular option.
If the range of values is limited, pick the precision you want. Now, the key (1,2,3) will point to a linked list (or a fancier data structure) of all points that are within Manhattan Distance of 3 * d (d = 0.5? - depends on details) from (1,2,3). You know your application best, so you can do a better job of choosing d. Optimization approach would depend on how the data is distributed.
EDIT:
The weakness here is - if you have many points concentrated within a single cube, then there is little that can be done using a hash table about guaranteeing O(1) ... more like O(n) :)
Some sort of tree-based data structure can guaranteed O(log n).
What you are asking sounds like Nearest Neighbour Search. One approach might be to code a kd-tree (or any space partition based technique) and use that to find the nearest point to your query. But you can also go with a hash based approach, which basically does what Ipthnc's answer describes, but tries to avoid bad performance for degenerate cases.

Resources