I have a number of graphs in an anzograph. Using rdflib, I want to access them as a union graph.
Without the store and parsing the graphs into memory, it works fine as a ConjunctiveGraph. However, when attaching to Anzo as a store, I cannot get the union graph.
Here are the snippets:
import rdflib
import rdflib.plugins.stores.sparqlstore as store
store = store.SPARQLUpdateStore("http://192.168.1.104:7070/sparql", "http://192.168.1.104:7070/update")
graph = rdflib.ConjunctiveGraph(store=store)
Then
for c, in graph.query("select (count(*)as ?c) {?s ?p ?o}"):
print(c)
gives this output
0
While
for g, c in graph.query("select distinct ?g (count(*)as ?c) {graph ?g{?s ?p ?o}}group by ?g"):
print(g, c)
gives
urn:2022-11-25.ttl 5760
urn:2022-11-27.ttl 9160
urn:2022-11-24.ttl 5565
urn:2022-11-26.ttl 7645
urn:2022-11-22.ttl 2820
urn:2022-11-23.ttl 7250
The rdflib manual states
A ConjunctiveGraph is an (unnamed) aggregation of all the named graphs
in a store. ... All queries are carried out against the union of all
graphs.
Is there any setup to achieve a union graph with Anzo as the store?
Related
I'm planning to use SQL Server 2019 graph features for one of my project. The data scheme would look something like the picture below.
Given the user (Id: 2356, name: Mark), I would want to retrieve all the Posts and the Tweets done by the user's follower ordered by when it was posted or when it was tweeted together with a limit/pagination on the overall result.
As of now, I don't know of a better way other than doing 2 separate queries & manually handling pagination, which makes it inefficient & also cumbersome if we add another new edge type in future in addition to Posted/Tweeted.
Are there better ways to address such usecases in SQL Server graph ?
SELECT mainUser.*, followingUser.*, followerPost.*
FROM
User mainUser, Follows userFollows, User followingUser, Posted followerPosted, Post followerPost
WHERE
MATCH (mainUser-(userFollows)->followingUser-(followerPosted)->followerPost)
AND
mainUser.id=2356
ORDER BY
followerPosted.posted_on desc
SELECT mainUser.*, followingUser.*, followerTweet.*
FROM
User mainUser, Follows userFollows, User followingUser, Tweeted tweeted, Tweet followerTweet
WHERE
MATCH (mainUser-(userFollows)->followingUser-(tweeted)->followerTweet)
AND
mainUser.id=2356
ORDER BY
tweeted.tweeted_on desc
Use heterogenous edge or node view. See answer https://stackoverflow.com/a/70055567/3434168.
---- there may be column colisions due to UNION ALL so fix them as you need
---- it doesn't matter which columns you select in your view
---- the MATCH algorithm (probably) uses metadata of the VIEW
CREATE VIEW v_SecondTierEdges AS
SELECT *, 'Tweeted' AS Type FROM Tweeted
UNION ALL
SELECT *, 'Posted' AS Type FROM Posted
GO
CREATE VIEW v_SecondTierNodes AS
SELECT tweeted_on AS did_that_on, 'Tweet' AS Type FROM Tweet
UNION ALL
SELECT posted_on AS did_that_on, 'Post' AS Type FROM Post
GO
SELECT
mainUser.*, followingUser.*, followerTweet_or_Post.*
FROM
User mainUser, Follows userFollows, User followingUser, v_SecondTierEdges tweeted_or_posted, v_SecondTierNodes followerTweet_or_Post
WHERE
MATCH (mainUser-(userFollows)->followingUser-(tweeted_or_posted)->followerTweet_or_Post)
AND
mainUser.id=2356
ORDER BY
tweeted_or_posted.did_that_on desc
I have a tree of nodes in my application, let's say it looks like this -
A
-> B
-> C
-> D
-> E
Where A, B, C... are the nodes labels. I have a GUID of the root node, and I'd like to retrieve all possible nodes of the given types for this tree.
What I am doing is creating all possible paths in that tree, like {A -> B}, {A -> C}, {A -> C -> D} ... and concatenating them in one big query, using UNION ALL (sometimes it's about 100 UNION ALL statements), e.g:
MATCH path = (:A {guid:'123456'})->[:REL]->(:B) UNION ALL
MATCH path = (:A {guid:'123456'})->[:REL]->(:C) UNION ALL
MATCH path = (:A {guid:'123456'})->[:REL]->(:C)->[:REL]->(:D) UNION ALL
RETURN path
It works, but takes seconds even on a small dataset. I noticed that it is slow only first time, after that query takes 10-20 ms. Looks like query plan consumes most of the time, but unfortunately my trees are dynamic, all those paths are unique each time - looks like Neo4j just can't cache them.
I've profiled a subquery of my UNION ALL query (one path in the tree), and even this subquery takes 90 ms for the first run -
MATCH path = (:PROVIDER {guid:'cafbf60e-612a-4c36-9337-50c26c941911'})<-[:REL]-(:ADDRESS)-[:REL]->(:ATTRIBUTE)-[:REL]->(:VALUE)-[:REL]->(:FIELD)<-[:REL]-(:TYPE)-[:REL]->(:CODE) RETURN path
Why it is so bad? Can this subquery be optimized, or maybe the whole UNION ALL can be redesigned somehow?
I have index on PROVIDER:guid, dataset is about 800 nodes, this particular query returns 0 results.
Query profile result -
Cypher version: CYPHER 3.2, planner: COST, runtime: INTERPRETED. 1 total db hits in 90 ms.
Neo4J version: 3.2.5
This is bad case for UNION ALL, you should look for alternatives.
Have you tried using variable-length patterns yet? If you just want the entire tree, this should work, no need for UNION ALL at all:
MATCH path = (:PROVIDER {guid:'cafbf60e-612a-4c36-9337-50c26c941911'})-[:REL*]-()
RETURN path
You may also want to double-check your indexes (run a PROFILE of this query and make sure it's using a node by index seek), as the query plan you provided is using a node by label scan.
Your query
MATCH path = (:A {guid:'123456'})->[:REL]->(:B) UNION ALL
MATCH path = (:A {guid:'123456'})->[:REL]->(:C) UNION ALL
MATCH path = (:A {guid:'123456'})->[:REL]->(:C)->[:REL]->(:D) UNION ALL
RETURN path
does not seem to need a UNION ALL. Why not:
MATCH path=((n:A{guid:'123456'})-[:REL*0..2]->(m) return path
if you are looking for specific nodes, you could add that constraint:
MATCH path=((n:A{guid:'123456'})-[:REL*0..2]->(m) where m.{property:value}) in [{list}] return path
Also, as InverseFalcon notes, what do you really want in the results? That may require a somewhat different query.
After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r
Initially I was trying to find out why it's so slow to do a spatial query with multiple SDO_REALTE in a single SELECT statement like this one:
SELECT * FROM geom_table a
WHERE SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=inside')='TRUE' AND
SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=anyinteract')='TRUE';
Note the two SDO_GEOMETRY may not be necessary the same. So it's a bit different from SDO_GEOMETRY(a.geom_column, the_same_geometry, 'mask=inside+anyinteract')='TRUE'
Then I found this paragraph from oracle documentation for SDO_RELATE:
Although multiple masks can be combined using the logical Boolean
operator OR, for example, 'mask=touch+coveredby', better performance
may result if the spatial query specifies each mask individually and
uses the UNION ALL syntax to combine the results. This is due to
internal optimizations that Spatial can apply under certain conditions
when masks are specified singly rather than grouped within the same
SDO_RELATE operator call. (There are two exceptions, inside+coveredby
and contains+covers, where the combination performs better than the
UNION ALL alternative.) For example, consider the following query using the logical
Boolean operator OR to group multiple masks:
SELECT a.gid FROM polygons a, query_polys B WHERE B.gid = 1 AND
SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch+coveredby') = 'TRUE';
The preceding query may result in better performance if it is
expressed as follows, using UNION ALL to combine results of multiple
SDO_RELATE operator calls, each with a single mask:
SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch') = 'TRUE' UNION ALL SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=coveredby') = 'TRUE';
It somehow gives the answer for my question, but still it only says: "due to internal optimizations that Spatial can apply under certain conditions". So I have two questions:
What does it mean with "internal optimization", is it something to do with spatial index? (I'm not sure if I'm too demanding on this question, maybe only developers in oracle know about it.)
The oracle documentation doesn't say anything about my original problem, i.e. SDO_RELATE(..., 'mask=inside') AND SDO_RELATE(..., 'maks=anyinteract') in a single SELECT. Why does it also have very bad performance? Does it work similarly to SDO_RELATE(..., 'mask=inside+anyinteract')?
i need to SUM the results of multiple queries.
the challenge that i have is that each query has defined members (to calculate a date range)
i need to be able to combine/sum those members across multiple mdx queries
WITH Member [M1] AS Sum(DateRange, Measure)
SELECT [M1]
FROM [Cube]
WHERE {[x].&[y]}
WITH Member [M1] AS Sum(Different DateRange, Measure)
SELECT [M1]
FROM [Cube]
WHERE {[z].&[q]}
each query selects the same members based on different criteria.
the only way i can think of doing this is a UNION and than SUM([M1]) but no idea how that is possible in MDX
UPDATE - in reply to icCube question, here is why i need to have a separate WHERE clause for each query:
i need separate WHERE sections for each query because i need to aggregate the results of different slices. and my slices are defined by n number of dimensions. i emit the mdx query for each slice dynamically based on user configuration input (and construct my WHERE clause dynamically to filter by user preferences). Users are allowed to configure overlapping slices (these are the ones i need to sum up together). then I need to combine these slice row counts into a report. The way i am doing is by passing a string with MDX query to a report. but since i can't think of a way to get multiple queries into one executable string, (nor do i know how many queries there will be) this approach is no longer possible (unless there is some way to union / sum them.
The only way i could think of accomplishing this for now, is with additional batching step that will iterate through all queries, process them (using Adomd.net) into a staging table, and then i can aggregate them into a report using SQL sum(..). Biggest disadvantage to this approach being additional system to be maintained and more possibilities that the data in the report will be stale.
Not sure if this is what you're looking
WITH Member [M1] AS Sum(Different DateRange, ([z].&[q],Measure) ) +
Sum(DateRange, ([x].&[y],Measure))
SELECT [M1]
FROM [Cube]
or
WITH Member [M1] AS Sum(Different DateRange * {[z].&[q]}, Measure ) +
Sum(DateRange * {[x].&[y]}, Measure)
SELECT [M1]
FROM [Cube]
I don't know any way adding the result of two selects in MDX...
I believe you need Aggregate() not Sum.
You could implement the UNION behavior in MDX using SubCubes on this way:
Select
{...} On Columns,
{...} On Rows
From (
Select
{
{Dimension1.Level.Members * Dimension2.&[1] * Dimension3.&[2]},
{Dimension1.&[X] * Dimension2.Members * Dimension3.&[5]}
} On Columns
From [Cube]
)