Safe Datalog rules for the query - database

Consider the relations: edge(X,Y), Red(X,Y), Blue(X,Y). These relations are representing a graph whose edges can be colored red or blue(or no color).
provide safe datalog rules(with negation if necessary) for the following queries.
Q1. find the pairs of nodes X and Y where there is path (a sequence of linked edges) from X to Y ?
my attempt:- reachable(X,Y) :- edge(X,Y)
reachable(X,Y) :- edge(X,Z), reachable(Z,Y)
Q2. Find the pairs of nodes X and Y where there is path (a sequence of linked edges) of even length from X to Y with alternating red and blue colors?
my attempt:
I have made odd even datalog program and a red/blue program, but don't know how to combine the both to get even length alternate red/blue node
ODD(x,y):- edge(x,y)
ODD(x,y):- edge(x,z), EVEN(z,y)
EVEN(x,y):- edge(x,z), ODD(z,y).
path(X,Y) :- Red(X,Y)
path(X,Y) :- path(X,Z), Blue(Z,W), path(W,Y)

From your questions I get the impression that you're trying to answer these assignments without testing your potential solutions. That's a very difficult way to work. I would strongly suggest to make little examples to test whether your solution is correct and run it in a Datalog engine (easiest is something online like http://www.iris-reasoner.org/demo or https://developer.logicblox.com/playground/ ).
In this particular case, it is easy to see that if you have an edge("a", b"), edge("b", c") and edge("c", "d"), that your solution will not find a path from "a" to "d".
This query can only be solved with recursion. This path query is pretty basic recursion, search for ancestor or transitive closure.

Related

How do I find vertices without particular edges from a particular vertex in Gremlin?

I am using python to run my gremlin queries, for which I am still getting used to / learning. Let's say I have vertex labels A and B, and there can be an edge from A to B called abEdge. I can find B vertices that have such edges from a particular A vertex with:
g.V()\
.hasLabel("A")\
.has("uid", uid)\ # uid is from a variable
.outE("abEdge")\
.inV()\
.toList()
But how would I go about finding B vertices that have no such edges from a particular A vertex?
My instinct is to try the following:
# Find B edges without an incoming edge from a particular A vertex
gremlin.V()\
.hasLabel("B")\
.where(__.not_(inE("abEdge").from_(
V()\
.hasLabel("A")\
.has("uid", uid)
)))\
.next()
And that results in a bad query (Not sure where yet), I'm still reading up on Gremlin, but I am also under time contraints so here I am asking. If anyone can help but just using groovy syntax that is fine too as it's not too bad to convert between the two.
Any explanations would be helpful too as I am still learning this technology.
Update
I guess if this were a real-life question, and related to this website. We could ask (A=Person, B=Programming-Language, and abEdge=knows):
Which programming languages does particular person A not know?
It looks like you almost had the answer:
g.V().hasLabel("B"). \
filter(__.not_(__.in_('abEdge').has("A","uid",uid))). \
toList()
I prefer filter() when using just a single Traversal argument, but I think you could replace with where() if you liked.

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

PROLOG / All paths in a directed graph with loops

I've got the following diagram given:
Diagram here
The first gateway/connector is an OR-gateway/connector (it has a circle in it). The gateway/connector with a 'x' in it is a XOR-gateway/connector.
An OR-gateway specifies that one or more of the available paths will be taken.
An XOR-gateway represents a decision to take exactly one path in the flow.
I need to transform this diagram to PROLOG in order to get all possible paths from node 1 to node 8 but I have problems to code the OR-gateway and to find all possible paths.
How can I transform this diagram easily to Prolog and how can i find all possible paths respecting the gateways between two nodes?
Thank you for answers in advance.
As you should know, a Prolog program is basically a set of rules. From your graph, each node could begin a rule where each directed edge gives an explicit rule. By encoding your graph as a set of rules, a query on what satisfies say, (1, X, 8), would give you every possible path, even infinitely.
Encoding the rules should be easy (basic Prolog). Maybe I'm not understanding the special functions behind the OR and XOR. Please explain more if this isn't as trivial as it seems.

Solving the problem of finding parts which work well with each other

I have a database of items. They are for cars and similar parts (eg cam/pistons) work better than others in different combinations (eg one product will work well with another, while another combination of 2 parts may not).
There are so many possible permutations, what solutions apply to this problem?
So far, I feel that these are possible approaches (Where I have question marks, something tells me these are solutions but I am not 100% confident they are).
Neural networks (?)
Collection-based approach (selection of parts in a collection for cam, and likewise for pistons in another collection, all work well with each other)
Business rules engine (?)
What are good ways to tackle this sort of problem?
Thanks
The answer largely depends on how do you calculate 'works better'?
1) Independent values
Assuming that 'works better' function f of x combination of items x=(a,b,c,d,...) and(!) that there are no regularities that can be used to decide if f(x') is bigger or smaller then f(x) knowing only x, f(x) and x' (which could allow to find the xmax faster) you will have to calculate f for all combinations at least once.
Once you calculate it for all combinations you can sort. If you will need to look up data in a partitioned way, using SQL/RDBMS might be a good approach (for example, finding top 5 best solutions but without such and such part).
For extra points after calculating all of the results and storing them you could analyze them statistically and try to establish patterns
2) Dependent values
If you can establish some regularities (and maybe you can) regarding the values the search for the max value can be simplified and speeded up.
For example if you know that function that you try to maximize is linear combination of all the parameters then you could look into linear programming
If it is not...

KD-Trees and missing values (vector comparison)

I have a system that stores vectors and allows a user to find the n most similar vectors to the user's query vector. That is, a user submits a vector (I call it a query vector) and my system spits out "here are the n most similar vectors." I generate the similar vectors using a KD-Tree and everything works well, but I want to do more. I want to present a list of the n most similar vectors even if the user doesn't submit a complete vector (a vector with missing values). That is, if a user submits a vector with three dimensions, I still want to find the n nearest vectors (stored vectors are of 11 dimensions) I have stored.
I have a couple of obvious solutions, but I'm not sure either one seem very good:
Create multiple KD-Trees each built using the most popular subset of dimensions a user will search for. That is, if a user submits a query vector of thee dimensions, x, y, z, I match that query to my already built KD-Tree which only contains vectors of three dimensions, x, y, z.
Ignore KD-Trees when a user submits a query vector with missing values and compare the query vector to the vectors (stored in a table in a DB) one by one using something like a dot product.
This has to be a common problem, any suggestions? Thanks for the help.
Your first solution might be fastest for queries (since the tree-building doesn't consider splits in directions that you don't care about), but it would definitely use a lot of memory. And if you have to rebuild the trees repeatedly, it could get slow.
The second option looks very slow unless you only have a few points. And if that's the case, you probably didn't need a kd-tree in the first place :)
I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent.
EDIT
Sorry, didn't realize it would be so obvious in the source code. You should use this version of the neighbor function:
nearest(double [] key, int n, Checker<T> checker)
And implement your own Checker class; see their EuclideanDistance.java to see the Euclidean version. You may also need to comment out any KeySizeException that the query code throws, since you know that you can handle differently sized keys.
Your second option looks like a reasonable solution for what you want.
You could also populate the missing dimensions with the most important( or average or whatever you think it should be) values if there are any.
You could try using the existing KD tree -- by taking both branches when the split is for a dimension that is not supplied by the source vector. This should take less time than doing a brute force search, and might be less trouble than trying to maintain a bunch of specialized trees for dimension subsets.
You would need to adapt your N-closest algorithm (without more info I can't advise you on that...), and for distance you would use the sum of the squares of only those elements supplied by the source vector.
Here's what I ended up doing: When a user didn't specify a value (when their query vector lacked a dimension), I I simply adjusted my matching range (in the API) to something huge so that I match any value.

Resources