Cypher returns exponential count result on 'join' - database

I am learning Cypher on Neo4j and I am having trouble in understanding how to perform an efficient 'join' equivalent in Cypher.
I am using the standard Matrix character example and I have added some nodes to the mix called 'Gun' with a relation of ':GIVEN_TO'. You can see the console with my query result here:
http://console.neo4j.org/r/rog2hv
The query I am using is:
MATCH (Neo:Crew { name: 'Neo' })-[:KNOWS*..]->(other:Crew),(other)<-[:GIVEN_TO]-(g:Gun),(Neo)<-[:GIVEN_TO]-(g2:Gun)
RETURN count(g2);
I have given Neo 4 guns, but when I perform the above I get a count of '12'. This seems to be the case because there are 3 'others' and 3*4 = 12. So I get some exponential result.
What should my query look like to get the correct count ('4') from the example?
Edit:
The reason I am not querying through Guns directly as suggested by #ceej is because in my real use case I have to do this traversal as described above. Adding DISTINCT does not do anything for my result.

The reason you get 12 guns instead of 4 is because your query produces a cartesian product. This is because you have asked for items in the same match statement without joining them. #ceej rightly pointed out if you want to find Neo's guns you would do as he suggested in his first query.
If you wanted to get a list of the crew members and their guns then you could do something like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN crew.name, collect(g.name)
Which finds all of the crew members with guns and returns their name and the guns that they were given.
If you wanted to invert it and get a list of the guns and the respective crew members they were give to you could do the following...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN g.name, collect(crew.name)
If you wanted to find all of the crew that knew Neo multiple levels deep that were given a gun you could write the query like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
WITH crew, g
MATCH (neo:Crew {name: 'Neo'})-[:KNOWS*0..]->(crew)
RETURN crew.name, collect(g.name)
That finds all the crew that were given guns and then determines which of them have a :KNOWS path to Neo.

Forgive me, but I am am unclear why you have the initial MATCH in your query. From your explanation it would appear that you are trying to get the number of :Gun nodes linked to Neo by the :GIVEN_TO relationship. In which case all you need is the latter part of your query. Which would give you something like
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count(g)
Furthermore, to make sure that you are only counting distinct :Gun nodes you can add DISTINCT to the RETURN statement.
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count( DISTINCT g )
This is possibly unnecessary in your case but can be helpful when the pattern that you are matching on can arrive at the same node by different traversals.
Have I misunderstood your requirement?

Related

Why duplicated results query with cypher neo4j

I am implementing the example database Movies en Neo4j. I already search something about duplicated rows but I still have doubts
I am using XOR. I am getting the
MATCH (m:Movie)<-[r]-(p:Person)
WHERE m.title STARTS WITH 'The'
XOR (m.released = 1999 OR m.released = 2003)
RETURN m.title, m.released
So, my result is
As you can see, there are duplicated rows, I don't understand why there are doing that and the number of duplicated results is according to what?
I know that DISTINCT removes duplicated. But I am interested in understanding why the query duplicated the results and the number of duplicated is according to what?.
This is because you are matching
MATCH (m:Movie)<-[r]-(p:Person)
So the movie title will be returned for each person in the movie, so if there are 4 people in the movie, you will get four movie titles back. You can remove duplicates by matching only the movie
MATCH (m:Movie)
As Tomaz said, it is returning a row for every :Person that has a relationship to :Movie. If you concluded your query with just RETURN m and viewed the results, you probably would only see non-duplicated nodes appear. Otherwise, you can conclude the query with RETURN DISTINCT m to ensure that non-duplicated results are returned.

Cypher query in neo4j to find specific node with most paths matching pattern

I have a neo4j database with statistical information on water and waste. In this database are data points linked with the facts that are relevant, including mappings to internal definitions. Here in the attached screenshot is an example of a data point and the related metadata. The node in the center is the value, and the immediate nodes linked by "HAS_DIMENSION" are the dimensions that came with the data provider. These are not fixed and change depending on the provider. Each dimension of interest is mapped to an internal definition. Currently this is my query:
MATCH (o:Observation {uq_id:'e__ABS_AGR_AQ__FSW__MIO_M3__BG__1970____9f07c7a629625e5ae00e35838fcd4f824a3593dd'})-[:HAS_DIMENSION]->()
MATCH (o)-[:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(v:Variable)<-[:HAS_UNIT]-(u:Unit)
MATCH (o)-[vl0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(l:Location)
MATCH (o)-[vc0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(c:Country)
MATCH (o)-[vy0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(y:Year)
MATCH (o)-[:HAS_DIMENSION]->(unk0)
MATCH (o)-[sr0:CAME_FROM_FILE]->(ds0)-[sr1:BELONGS_TO]->(s0)
OPTIONAL MATCH (o)-[dtr0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(d:DataType)
RETURN *
The issue I have is exemplified by the pink circles. I want only one pink circle (which is a node with label Variable) in the query, in particular I want the variable like follows
MATCH (v:Variable)<-[:MAPS_TO]-()<-[:HAS_DIMENSION]-(o:Observation)
By this I want to force it to observe a pattern where it identifies the single variable that matches the pattern above for the most number of intermediate nodes. So the "Fresh surface water abstracted" variable would match this pattern, since it has two paths that match this. But the "Fresh groundwater abstracted" would not, since it only has one. How could I accomplish this?
It sounds like you want to return the Variable node with the most number of paths leading to it. Would something like this roughly return the results you are after? You will need to adapt according to your matching statements.
MATCH p=(o:Observation {uq_id:'<your_id>'})-[:HAS_DIMENSION]->()<-[:MAPS_TO]-(v:Variable)
RETURN v.name, COUNT(p) as p ORDER BY p DESC LIMIT 1

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

neo4j / cypher - why is starting node excluded?

I have a simple graph:
When I run this simple query in neoeclipse:
START me=node:node_auto_index(name="Me")
MATCH me-[:LIVES_IN]->()<-[:LIVES_IN]-(f)
RETURN f.name;
only my Girlfriend is returned!
Why am I excluded from the result?
Results
f.name Girlfriend
Because a path (what you specify in the match) will never contain the same relationship twice.
To find all the people living in the same location including yourself, you need to split into two actions, one finding your city and the other collecting people in this city using the with statement:
start me=node:node_auto_index(name='Me')
match me-[:LIVES_IN]->homebase
with homebase
match homebase<-[:LIVES_IN]-people
return people
See http://console.neo4j.org/?id=t0wjhg

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources