Methods to avoiding cross-product APOC query (using hashmap?)? - database

I currently have a Neo4J database with a simple data structure comprised of about 400 million (:Node {id:String, refs:List[String]}), with two properties: An id, which is a string, and refs, which is a list of strings.
I need to search all of these nodes to identify relationships between them. These directed relationships exist if a node's id is in the ref list of another nose. A simple query that accomplishes what I want (but is too slow):
MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b) AND a.id IN b.refs
CREATE (b)-[:CITES]->(a)
I can use apoc.periodic.iterate, but the query is still much too slow:
CALL apoc.periodic.iterate(
"MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b)
AND a.id IN b.refs RETURN a, b",
"CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:false,iterateList:true})
Any suggestions as to how I can build this database and relationships efficiently? I've vague thoughts about creating a hash table as I first add the Nodes to the database, but am not sure how to implement this, especially in Neo4j.
Thank you.

If you first create an index on :Node(id), like this:
CREATE INDEX ON :Node(id);
then this query should be able to take advantage of the index to quickly find each a node:
MATCH (b:Node)
UNWIND b.refs AS ref
MATCH (a:Node)
WHERE a.id = ref
CREATE (b)-[:CITES]->(a);
Currently, the Cypher execution planner does not support using the index when directly comparing the values of 2 properties. In the above query, the WHERE clause is comparing a property with a variable, so the index can be used.
The ID(a) < ID(b) test was omitted, since your question did not state that ordering the native node IDs in such a way was required.
[UPDATE 1]
If you want to run the creation step in parallel, try this usage of the APOC procedure apoc.periodic.iterate:
CALL apoc.periodic.iterate(
"MATCH (b:Node) UNWIND b.refs AS ref RETURN b, ref",
"MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
The first Cypher statement passed to the procedure just returns each b/ref pair. The second statement (which is run in parallel) uses the index to find the a node and creates the relationship. This division of effort puts the more expensive processing in the statement running in a parallel thread. The iterateList: true option is omitted, since we (probably) want the second statement to run in parallel for each b/ref pair.
[UPDATE 2]
You can encounter deadlock errors if parallel executions try to add relationships to the same nodes (since each parallel transaction will attempt to write-lock every new relationship's end nodes). To avoid deadlocks involving just the b nodes, you can do something like this to ensure that a b node is not processed in parallel:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
However, this approach is still vulnerable to deadlocks if parallel executions can try to write-lock the same a nodes (or if any b nodes can also be used as a nodes). But at least hopefully this addendum will help you to understand the problem.
[UPDATE 3]
Since these deadlocks are race conditions that depend on multiple parallel executions trying to lock the same nodes at the same time, you might be able to work around this issue by retrying the "inner statement" whenever it fails. And you could also try making the batch size smaller, to reduce the probability that multiple parallel retries will overlap in time. Something like this:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize: 1000, parallel: true, retries: 100})

Related

SQL Recursive CTE: stop recursion on all branches when one branch fulfills some condition?

I have a table of parent-child relationships, and for a list of nodes (let's call them "roots"), I want to see which ones have one (or more) parents with some characteristic (let's call it "bad").
I can write a recursive CTE to find all parent nodes, and then compare them against a list of bad nodes. However, this is slow, as a node can have very many parents, and I am not really interested in finding all of them. I only need to know if some parent is bad. I would like to stop the recursion for a particular root when a bad parent is found.
Here is a mock-up of what the query could look like:
with recursive_table (root, level, path, current_node) as (
select root = node, level = 0, path = 'root', current_node = node
from seed_nodes
UNION ALL
select o.root, level + 1, CONCAT(e.parent, '/', o.path), e.parent
from structure e
inner join recursive_table o
on o.current_node = e.child
-- it is possible to stop searching this particular branch if we pass a bad node
and o.node not in (
select b.node
from bad_node_list b
)
-- it is NOT possible to stop investigating a particular root if any bad node is found
and o.root not in (
select oo.root
from recursive_table oo
join bad_node_list b
on oo.current_node = b.node
where o.root = oo.root
)
)
select *
from seed_nodes a
outer apply (
select top 1 current_node
from recursive_table b
join bad_node_list c
on b.current_node = c.node
where a.node = b.root
) b
As the code in the comments say, it is possible to stop searching in a branch when a bad node is found. But it is not possible to stop searching the root altogether when the first bad node is found. SQL Server will give the error "Recursive member of a common table expression 'recursive_table' has multiple recursive references."
Is there any way around this? (I know that SQL server tries to disallow some other functions, like aggregate functions, but that it's possible to get around that and still use them.)
Is there any way to access the other branches in a recursive function? No.
Recursion in SQL Server is implemented so that this is not possible. Additionally, recursion actually happens depth-first, and not breadth-first (as the syntax of SQL recursion might have you believe).
It is possible to rewrite this as an actual breadth-first search, by using a loop, and include the condition to stop searches when the first bad node is found.
For my particular set of data, this did not improve performance, and was instead around 50% more time-consuming (15 minutes instead of 10 minutes).

MongoDB grab last versions from specified version

I have a set of test results in my mongodb database. Each document in the database contains version information, test data, date, test run information etc...
The version is broken up in the document and stored as individual values. For example: { VER_MAJOR : "0", VER_MINOR : "2", VER_REVISION : "3", VER_PATCH : "20}
My application wants the ability to specify a specific version and grab the document as well as the previous N documents based on the version.
For example:
If version = 0.2.3.20 and n = 5 then the result would return documents with version 0.2.3.20, 0.2.3.19, 0.2.3.18, 0.2.3.17, 0.2.3.16, 0.2.3.15
The solutions that come to my mind is:
Create a new database that contains documents with version information and is sorted. Which can be used to obtain the previous N version's which can be used to obtain the corresponding N documents in the test results database.
Perform the sorting in the test results database itself like in number 1. Though if the test results database is large, this will take a very long time. Also consider inserting in order every time.
Creating another database like in option 1 doesn't seem like the right way. But sorting the test results database seems like there will be lots of overhead, am I mistaken that I should be worried about option 2 producing lots of overhead? I have the impression I'd have to query the entire database then sort it on application side. Querying the entire database seems like overkill...
db.collection_name.find().sort([Paramaters for sorting])
You are quite correct that querying and sorting the entire data set would be very excessive. I probably went overboard on this, but I tried to break everything down in detail below.
Terminology
First thing first, a couple terminology nitpicks. I think you're using the term Database when you mean to use the word Collection. Differentiating between these two concepts will help with navigating documentation and allow for a better understanding of MongoDB.
Collections and Sorting
Second, it is important to understand that documents in a Collection have no inherent ordering. The order in which documents are returned to your app is only applied when retrieving documents from the Collection, such as when specifying .sort() on a query. This means we won't need to copy all of the documents to some other collection; we just need to query the data so that only the desired data is returned in the order we want.
Query
Now to the fun part. The query will look like the following:
db.test_results.find({
"VER_MAJOR" : "0",
"VER_MINOR" : "2",
"VER_REVISION" : "3",
"VER_PATCH" : { "$lte" : 20 }
}).sort({
"VER_PATCH" : -1
}).limit(N)
Our query has a direct match on the three leading version fields to limit results to only those values, i.e. the specific version "0.2.3". A range $lte filter is applied on VER_PATCH since we will want more than a single patch revision.
We then sort results by VER_PATCH to return results descending by the patch version. Finally, the limit operator is used to restrict the number of documents being returned.
Index
We're not done yet! Remember how you said that querying the entire collection and sorting it on the app side felt like overkill? Well, the database would doing exactly that if an index did not exist for this query.
You should follow the equality-sort-match rule when determining the order of fields in an index. In this case, this would give us the index:
{ "VER_MAJOR" : 1, "VER_MINOR" : 1, "VER_REVISION" : 1, "VER_PATCH" : 1 }
Creating this index will allow the query to complete by scanning only the results it would return, while avoiding an in-memory sort. More information can be found here.

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

Why is this query increasing count by more than expected? (cypher/neo4j)?

I have term nodes connected to content nodes and a query that is meant up update a content nodes connections to these term nodes.
First I decrease the count of connected content nodes for each term node originally attached, then delete the relationships.
After that I create a new relationship to all the specified term nodes, attempting to increase the count of connected content nodes for each newly connected term node by one.
The problem is, after the query runs, the count of connected content nodes is not increased by one, but rather increased by what looks like the total number of new term nodes being connected.
It seems I'm still having trouble grasping exactly how the data is being handled behind the query. I suspect the answer may deal with doing a count of the connected nodes as has been the case previous when I've gotten stuck.
Here is the query:
var query = [
"MATCH (contentNode:content {UUID: {contentID} })-[r:TAGGED_WITH]->(oldTermNode:term) ",
"SET oldTermNode.contentConnections = oldTermNode.contentConnections - 1 ",
"DELETE r ",
"WITH contentNode ",
"MATCH (newTermNode:term) ",
"WHERE newTermNode.UUID IN {termIDs} ",
"CREATE UNIQUE contentNode-[:TAGGED_WITH]->newTermNode ",
"SET newTermNode.contentConnections = newTermNode.contentConnections + 1 ",
].join('\n');
As a side question, when updating the terms, often many of the new terms are the same as the old terms (the user only adds/removes one or two terms, leaving the rest the same). Would it make more sense/have faster performance if only the relationships that wouldn't be reconnected were deleted and then only the new terms added?
Thanks a lot.
This does not answer your question directly, but I wonder if your terms actually need to have a 'contentConnections' property at all. If not, then you original question becomes moot.
Based just on the info from your question, it looks like the term.contentConnections value is just a count of the number of times that the term is the pointed to by a :TAGGED_WITH relationship. If that is the case, then you should be able to get an equivalent count with something like the following:
MATCH ()-[:TAGGED_WITH]->(t:term {UUID:{termId}}) RETURN count(t);
This query would be really fast if you create an index (or, probably even better, a uniqueness constraint) for the UUID property of term nodes. If this works for you, then you can simplify and speed up your other queries, since there would be no need to maintain the contentConnections value.
For example, your original query could be simplified to:
var query = [
"MATCH (contentNode:content {UUID: {contentID} })-[r:TAGGED_WITH]->(oldTermNode:term) ",
"DELETE r ",
"WITH contentNode ",
"MATCH (newTermNode:term) ",
"WHERE newTermNode.UUID IN {termIDs} ",
"CREATE UNIQUE contentNode-[:TAGGED_WITH]->newTermNode ",
].join('\n');
I've revised your query to work as you've described it should function. What I've done is collected your terms into a distinct collection and iterated through each node to increment and decrement their connection counts. This should work in theory, but I would advise taking other precautions to maintain the consistency of your relationship counts on the term nodes.
I'm assuming though that each term could have unbounded connections and that would be expensive computationally to poll through each of your terms, and to then count the connections, and then to set that as a weight on the node.
MATCH (contentNode:content {UUID: "1234" })-[r:TAGGED_WITH]->(oldTermNode:term)
WITH contentNode, collect(r) as oldRels, collect(DISTINCT oldTermNode) as oldTermNodes
FOREACH (oldTermNode in oldTermNodes |
SET oldTermNode.contentConnections = oldTermNode.contentConnections - 1)
FOREACH (r in oldRels | DELETE r)
WITH contentNode
MATCH (newTermNode:term)
WHERE newTermNode.UUID IN ["1112", "1113"]
CREATE UNIQUE (contentNode)-[:TAGGED_WITH]->(newTermNode)
WITH collect(DISTINCT newTermNode) as newTermNodes
FOREACH (newTermNode in newTermNodes |
SET newTermNode.contentConnections = newTermNode.contentConnections + 1)
You'll need to reinsert your parameters, I constructed this code example for an actual test to make sure it worked.
As a side question, when updating the terms, often many of the new
terms are the same as the old terms (the user only adds/removes one or
two terms, leaving the rest the same). Would it make more sense/have
faster performance if only the relationships that wouldn't be
reconnected were deleted and then only the new terms added?
You could revise the query by specifying you only want oldTermNodes that are not in the newTermNode collection. So yes, to answer your question, this would prevent unnecessary writes, which would increase performance. You'll just need to make sure that you remove from your newTermNodes collection any of the redundant terms so that the contentConnections are not incremented for those terms in the last line of the script.

lua and lsqlite3: speeding up select statement

I'm using the lsqlite3 lua wrapper and I'm making queries into a database. My DB has ~5million rows and the code I'm using to retrieve rows is akin to:
db = lsqlite3.open('mydb')
local temp = {}
local sql = "SELECT A,B FROM tab where FOO=BAR ORDER BY A DESC LIMIT N"
for row in db:nrows(sql) do temp[row['key']] = row['col1'] end
As you can see I'm trying to get the top N rows sorted in descending order by FOO (I want to get the top rows and then apply the LIMIT not the other way around). I indexed the column A but it doesn't seem to make much of a difference. How can I make this faster?
You need to index the column on which you filter (i.e. with the WHERE clause). THe reason is that ORDER BY comes into play after filtering, not the other way around.
So you probably should create an index on FOO.
Can you post your table schema?
UPDATE
Also you can increase the sqlite cache, e.g.:
PRAGMA cache_size=100000
You can adjust this depending on the memory available and the size of your database.
UPDATE 2
I you want to have a better understanding of how your query is handled by sqlite, you can ask it to provide you with the query plan:
http://www.sqlite.org/eqp.html
UPDATE 3
I did not understand your context properly with my initial answer. If you are to ORDER BY on some large data set, you probably want to use that index, not the previous one, so you can tell sqlite to not use the index on FOO this way:
SELECT a, b FROM foo WHERE +a > 30 ORDER BY b

Resources