which graphDB will perform best for this specific type of graph-related query? neo4J? flockdb? - database

I have a large, non-cyclic directed graph.
Every node has some key/value pairs. Some of the keys can be searched by range.
Let's say all nodes have these keys:
color - red, blue, green, etc.
price - an integer
size - an integer
I want to select a list of nodes from my graph such that:
all nodes have color = red
all nodes have price >= 10 and <= 20
the list is ordered by increasing size
any node that meets the criteria for color and price and has no outlinks (no edges go from that node) is in the list
no two nodes in the list share an edge
Ideally, the list would have the maximum number of nodes possibly to satisfy all those constraints.
I need to be able to search this graph very quickly.
What kind of data store (graph or otherwise) is best suited for this problem? Any hints on how to implement schema and query to get the best performance?

Do you have any size estimations on this? That would give some more hints on how you can do the computations - in memory or index or lazy graph loading. /peter neubauer

Related

OrientDB SQL - how to traverse path based on weight?

Let's say we have a simple graph stored in OrientDB, with an edge type called weightedEdge that has a property weight. I'd like to be able to traverse from a starting node to an arbitrary depth, but only traverse to nodes that have the maximum value of weight on their incoming edge compared to all other edges at the same depth. Is this possible using OrientDB SQL?
So in the given example above, I would only want to hop along the red arrows.
Thank you!
May be you can use a let block and then property selection via [ ]:
traverse out()[weight = $maxW] ...
LET $maxW = SELECT max(weight) FROM ...

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Performing aggregrate queries using Gremlin / TitanDB

I have a Titan graph database witha set of vertices connected by an edge with a property named "property1".
Is it possible to write a Gremlin (or anything else Titan would support) query to:
Find all edges that have a value for "property1" that is seen 5 or less times.
In SQL I would use "Group By", in MongoDB I would use one of the aggregate functions.
I am thinking this may be a job for Furnace/Faunus?
You can do this by iterating all edges and using groupBy. Here's an example with the toy graph using weight in place of property1:
gremlin> g = TinkerGraphFactory.createTinkerGraph()
==>tinkergraph[vertices:6 edges:6]
gremlin> g.E.groupBy{it.weight}{it}.cap.next()
==>0.5=[e[7][1-knows->2]]
==>1.0=[e[8][1-knows->4], e[10][4-created->5]]
==>0.4=[e[11][4-created->3], e[9][1-created->3]]
==>0.2=[e[12][6-created->3]]
So that groups all edges by their weight. From there you can drop down to standard groovy functions like findAll to filter out what you don't want (here i filter out weights that have >1 edge in them...in your case it would be <5).
gremlin> g.E.groupBy{it.weight}{it}.cap.next().findAll{k,v->v.size()>1}
==>1.0=[e[8][1-knows->4], e[10][4-created->5]]
==>0.4=[e[11][4-created->3], e[9][1-created->3]]
Obviously this is a bit of an expensive operation on a really large graph as you have a lot of iteration to do over edges and you have to build up a Map in memory which could be big depending on the diversity of the values in property1. If you can find ways to limit edge iteration with other filters, that might be helpful.
This would be a good job for Faunus if you had a really large graph. I'll go with the easy answer here and simply say that you don't necessarily want the specific edges with a property1 value occurring less than 5 times and that you just want to know how many times different property1 values occur. With Faunus you could get a distribution like that with:
g.E.property1.groupCount()

Need help solving a problem using graphs in C

i'm coding a c project for an algorithm class and i really need some help!
Here's the problem:
I have a set of names like this one N = (James,John,Robert,Mary,Patricia,Linda Barbara) wich are stored in an RB tree.
Starting from this set of names a series of couple like those ones are formed:
(James,Mary)
(James,Patricia)
(John,Linda)
(John,Barbara)
(Robert,Linda)
(Robert,Barbara)
Now i need to merge the elements in a way that i can form n subgroups with the constraint that each pairing is respected and the group has the smallest possible cardinality.
With the couples in the example they will form two groups
(James,Mary,Patricia) and (John,Robert,Barbara,Linda).
The task is to return the maximum number of groups formed and the number of males and females in the group with the maximum cardinality.
In this case it would be 2 2 2
I was thinking about building a graph where every name is represented by a vertex and two vertex are in an edge only if they are paired.
I can then use an algorithm (like Kruskal) to find the Minimum spanning tree.Is that right?
The problem is that the graph would not be completely connected.
I also need to find a way to map the names to the edges of the Graph and vice-versa.
Can the edges be indexed by a string?
Every help is really appreciated :)
Thanks in advice!
You don't need to find the minimum spanning tree. That is really for finding the "best" edges in a graph that will still keep the graph connected. In other words, you don't care how John and Robert are connected, just that they are.
You say that the problem is that the graph would not be completely connected, but I think that is actually the point. If you represent graph edges by using the couples as you suggest, then the vertices that are connected form the groups that you are looking for.
In your example, James is connected to Mary and also James is connected to Patricia. No other person connects to any of those three vertices (if they did, you would have another couple that included them), which is why they form a single group of (James, Mary, Patricia). Similarly all of John, Robert, Barbara, and Linda are connected to each other.
Your task is really to form the graph and find all of the connected subgraphs that are disjoint from each other.
While not a full algorithm, I hope that helps get you started.
I think that you can easily solve this with a dfs and connected components. Because every person(node) has a relation with an other one (edge). So you have an outer loop and run an explore function for every node which is unvisited and add the same number for every node explored by the explore function.
e.g
dfs() {
int group 0;
for(int i=0;i<num_nodes;i++) {
if(nodes[i].visited==false){
explore(nodes[i],group);
group++;
}
}
then you simple have to sort the node by the group and then you are ready. if you want to track the path you can use a pre number which indicates which node was explored first, second..etc
(sorry for my bad english)!
The sets of names and pairs of names already form a graph. A data structure with nodes and pointers to other nodes is just another representation, one that you don't necessarily need. Disjoint sets are easier to implement IMO, and their purpose in life is exactly to keep track of sameness as pairs of things are joined together.

Datamodel for Property Graph over HBase/Cassandra

I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.
My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.
I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)
My datamodel follows.
Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...
This table allows me to fetch quickly the properties of a vertex and
its adjacency list. I can't use the vertexid as the other endpoint
because multiple edges (with different types) can connect the same two
vertices.
Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...
This table allows me to fetch quickly the properties of an edge.
Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...
This table allows me to search/scan for edges that are either incoming
or outgoing from a vertex and belong to specific type and would be the
core of the traversing ability of the API (so i want it to be as fast as
possible both in terms of network I/O (RPCs), disk I/O (seek)). It
should also "scale" on the size of the graph, meaning that with the
growth of the graph the cost of this type of operation should depend on
the number of edges outgoing from the vertex and not on the total number
of vertices and edges.
The example above i'd be considering vertexid1 the source vertex with
property name:claudio i'd scan vertexid1_out_knows and receive the list of
vertices connected. After that i can scan on the column
"Properties:gender" on these vertices and look for those having the
"female" value.
Questions:
1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?
This type of model looks like a sensible starting point for Cassandra (don't know much about HBase) - but for any distributed store you will run up against problems when traversing, because traversals will cross multiple nodes.
This is why dedicated graph databases such as Neo4J use a single-node design, and try to keep all data in RAM.
Looking up properties of particular nodes or edges should work well and scale horizontally - Twitter's FlockDB (now apparently abandoned) was a notable example of this.
You also need to consider whether you need lookups other than by ID (i.e. do you need any indexes)?

Resources