I have Node Label CUSTOMER that has one key, CUSTOMER_ID
Each customer id is linked to other customer id's, these bidirectional relationships are created using CSV files.
I want to have the result in below form for all nodes
CUSTOMER_ID, MIN(CUSTOMER_ID) over the set of related nodes
600,600
601,600
602,600
604,600
605,600
There will many such linked nodes (sub graphs) in the total data
I was able to get it using the below query
MATCH (a:Member_Matching_1) -[r:MATCHED*]-> (b:Member_Matching_1)
WITH DISTINCT a,b
RETURN a.OPTUM_LAB_ID ,min(b.OPTUM_LAB_ID)
order by toInt(min(b.OPTUM_LAB_ID)),ToINT(a.OPTUM_LAB_ID)
but the issue is that the query will traverse the graph too many number of unwanted times
Ex-
wanted : 600 -> 601 -> 602 -> 604
Unwanted : 600 -> 601 -> 602 -> 603 -> 602 -> 604
As the data volume will be too high, I want to use the most optimal query.
After having spent some time searching the web came across a solution
MATCH p=(a:Member_Matching_1) -[:MATCHED*]-> (b:Member_Matching_1)
WHERE NONE (n IN nodes(p)
WHERE size(filter(x IN nodes(p)
WHERE n = x))> 1)
RETURN EXTRACT(n IN NODES(p)| n.OPTUM_LAB_ID) ;
But I am facing the error
Neo.DatabaseError.General.UnknownError
key not found: UNNAMED32
Please advise
Thanks in advance
As of today, Cypher is not really well-suited for these sort of queries, as it only supports edge uniqueness, but not vertex uniqueness. There is a proposal in the openCypher language to support configurable matching semantics, but it has only been accepted recently and is not merged to Neo4j.
So currently, for this sort of traversal, you are probably better of using the APOC library's apoc.path.expandConfig stored procedure. This allows you to set uniqueness constraints such as NODE_PATH, which enforces that "For each returned node there’s a unique path from the start node to it."
Also, when I faced a similar problem, I tried to use the following hack: set a fixed depth of the traversal and manually specify the uniqueness constraints. This did not work well for my use case, but it might be worth to give it a try. Sketch code:
MATCH p=(n)-[*5]->(n)
WHERE nodes(p)[0] <> nodes(p)[2]
AND nodes(p)[0] <> nodes(p)[4]
AND nodes(p)[2] <> nodes(p)[4]
RETURN nodes(p)
LIMIT 1
The error you got Neo.DatabaseError.General.UnknownError / key not found: UNNAMED32 is very strange indeed, it seems that your query overstressed the database which resulted in this (quite unique) error message.
Note: I agree with the comment of #TomGeudens stating that you should not create the MATCHED edge twice - just use a single direction and incorporate the undirected nature of the edge in your queries, i.e. use (...)-[...]-(...) in Cypher.
Related
I need to cluster some data in Snowflake using DBSCAN. I created a UDF but the results won't match with a local run, so I ran an UDF that just creates a list with the row number that is being processed and it results in a list that has repeated values and its max value is much smaller than the number of rows in my table. (The expected result was unique values up to the number of rows)
Can this be a parallelization issue?
If so, is there a way to cluster data using DBSCAN in Snowflake?
Thanks!
EDIT -> code example:
#funcs.pandas_udf(name='DBSCAN_TEST', is_permanent=True,
stage_location='#UDF', replace=True,
packages=['scikit-learn==1.0.2', 'pandas', 'numpy'])
def DBSCAN_TEST(data_x: types.PandasDataFrame[float, float]) -> types.PandasSeries[float]:
data_x.columns = [features]
DBSCAN_cluster = DBSCAN(eps=2.5, min_samples=4)
DBSCAN_cluster.fit(data_x)
return resul
As input data I used this
to test DBSCAN.
If I run that locally (using the exact same code inside the UDF) I end up with 24 clusters and that is the expected result. But if I use the UDF it scales up to 71.
I've tried changing the input types to string, as a coworker suggested but it didn't work.
The only clustering option in Snowflake is a clustering key, and general rule of thumb is that this isn't something typically needed until you eclipse a TB of data in a table and/or the auto clustering proves to be deficient in performance.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html
I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.
I'm in the process of trying to learn Cypher for use with graph databases.
Using Neo4j's test database (http://www.neo4j.org/console?id=shakespeare)
, how can I find all the performances of a particular play? I'm trying to establish how many times Julius Caesar was performed.
I have tried to use:
MATCH (title)<-[:PERFORMED]-(performance)
WHERE (title: "Julias Caesar")
RETURN title AS Productions;
I'm aware it's quite easy to recognise manually, but on a larger scale it wouldn't be possible.
Thank you
You would have to count the number of performance nodes . You can use count to get the number of nodes.
MATCH (t)<-[:PERFORMED]-(performance)
WHERE t.title = "Julias Caesar"
RETURN DISTINCT t AS Productions,count(performance) AS count_of_performance
I primarily intended to ask this question : "Is ArangoDB a true graph database ?"
But, this question would sound quite offending.
You, peoples at triAGENS, did a really great job in creating a "multi-paradigm" database.
As a user of PostgreSQL, PostGIS, MongoDB and Neo4J/Titan, I really appreciate to see an "all-in-one" solution :)
But the question remains, basically creating a graph in ArangoDB requires to create two separate collections : one for edges and one for vertices, thus, as far as I understand, it already means that vertices and related edges are not "physically" neighbors.
Moreover, even after creating appropriate index, I'm facing some serious performance issues when doing this kind of stuff in Gremlin
g.v('an_id').out('likes').in('likes').count()
Which returns a result after ~ 3 seconds (perceived time)
I assumed I poorly understood how Gremlin and Blueprint/ArangoDB worked so I tried to rewrite the same query using AQL :
LET lst = (FOR e1 in NEIGHBORS(vertices, edges, "an_id", "outbound", [ { "$label": "likes" } ] )
FOR e2 in NEIGHBORS(vertices, edges, e1.edge._to, "inbound", [ { "$label": "likes" } ] )
RETURN 1
)
RETURN length(lst)
Which gives me a delay of same order of magnitude.
If I tried to run the same query on a Titan or Neo4j database (with the very same data), queries returns almost immediately (perceived time : <200ms)
So it seems to me that ArangoDB graph features are a "smart graph layer" above a "traditionnal document database" but that ArangoDB is not a "native" graph database.
To confirm this feeling, I transform data to load it in PostgreSQL and run a query (with a multiple table JOIN as you can assume) and got similar (to ArangoDB) execution delays
Did I do something wrong (in AQL query) ?
Is there a way to optimize the database to get better traversal times ?
In PostgreSQL, conceptually, I would mix edge and node and use a CLUSTER clause to physically order data, does something similar can be done in ArangoDB ? (I assume that it would be hard, as it would involve to "interlace" edges and nodes, just an intuition)
i am a Core Developer of ArangoDB. Could you give me a bit more information ob the dimensions of data you are using?
Amount of vertices
Amount of edges
Then we can create our own setup with equal dimensions and optimize it.
I have a question taken from pg 16 of IBM's Nested Relational Database White Paper, I'm confused why in the below CREATE command they use MV/MS/MS rather than MV/MV/MS, when both ORDER_#, and PART_# are one-to-many relationships.. I don't understand what value, vs sub-value means in non-1nf database design. I'd also like to know to know more about the ASSOC () clause.
Pg 16 of IBM's Nested Relational Database White Paper (slight whitespace modifications)
CREATE TABLE NESTED_TABLE (
CUST# CHAR (9) DISP ("Customer #),
CUST_NAME CHAR (40) DISP ("Customer Name"),
ORDER_# NUMBER (6) DISP ("Order #") SM ("MV") ASSOC ("ORDERS"),
PART_# NUMBER (6) DISP (Part #") SM ("MS") ASSOC ("ORDERS"),
QTY NUMBER (3) DISP ("Qty.") SM ("MS") ASSOC ("ORDERS")
);
The IBM nested relational databases implement nested tables as repeating attributes and
repeating groups of attributes that are associated. The SM clauses specify that the attribute is either repeating (multivalued--"MV") or a repeating group (multi-subvalued--"MS"). The ASSOC clause associates the attributes within a nested table. If desired, the IBM nested relational databases can support several nested tables within a base table. The following standard SQL statement would be required to process the 1NF tables of Figure 5 to produce the report shown in Figure 6:
SELECT CUSTOMER_TABLE.CUST#, CUST_NAME, ORDER_TABLE.ORDER_#, PART_#, QTY
FROM CUSTOMER_TABLE, ORDER_TABLE, ORDER_CUST
WHERE CUSTOMER_TABLE.CUST_# = ORDER_CUST.CUST_# AND ORDER_CUST.ORDER_# =
ORDER _TABLE.ORDER_#;
Nested Table
Customer # Customer Name Order # Part # Qty.
AA2340987 Zedco, Inc. 93-1123 037617 81
053135 36
93-1154 063364 32
087905 39
GV1203948 Alphabravo 93-2321 006776 72
055622 81
067587 29
MT1238979 Trisoar 93-2342 005449 33
036893 52
06525 29
93-4596 090643 33
I'll go ahead and answer my own question, while pursuing IBM's UniVerse SQL Administration for DBAs I came across code for CREATE TABLE on pg 55.
ACT_NO INTEGER FORMAT '5R' PRIMARY KEY
BADGE_NO INTEGER FORMAT '5R' PRIMARY KEY
ANIMAL_ID INTEGER FORMAT '5L' PRIMARY KEY
(see distracting side note below) This amused me at first, but essentially I believe this to be a column directive the same as a table directive like PRIMARY ( ACT_NO, BADGE_NO, ANIMAL_ID )
Later on page 5-19, I saw this
ALTER TABLE LIVESTOCK.T ADD ASSOC VAC_ASSOC (
VAC_TYPE KEY, VAC_DATE, VAC_NEXT, VAC_CERT
);
Which leads me to believe that tacking on ASSOC (VAC_ASSOC) to a column would be the same... like this
CREATE TABLE LIVESTOCK.T (
VAC_TYPE ... ASSOC ("VAC_ASSOC")
VAC_DATE ... ASSOC ("VAC_ASSOC")
VAC_NEXT ... ASSOC ("VAC_ASSOC")
VAC_cERT ... ASSOC ("VAC_ASSOC")
);
Anyway, I'm not 100% sure I'm right, but I'm guessing the order doesn't matter, and that rather than these being an intransitive association they're just a order-insensitive grouping.
Onward! With the second part of the question pertaining to MS and MV, I for the life of me can not figure out where the hell IBM got this syntax from. I believe it to be imaginary. I don't have access to a dev machine I can play on to test this out, but I can't find it (the term MV) in the old 10.1 or the new UniVerse 10.3 SQL Reference
side note for those not used to UniVerse the 5R and 5L mean 5 characters right or left justified. That's right a display feature built into the table meta data... Google for UniVerse FORMAT (or FMT) for more info.
Just so you know, Attribute, Multivalue and Sub-Multivalue comes from the way they structure their data.
Essentially, all data is stored in a tree of sorts.
UniVerse is a Multivalue Database. Generally, it does not work in the say way as Relational DBs of the SQL work function.
Each record can have multiple attributes.
Each attribute can have multiple multivalues.
Each multivalue can have multiple sub-multivalues.
So, if I have a record called FRED
Then, FRED<1,2,3> refers to the 1st attribute, 2 multivalue position and 3 subvalue position.
To read more about it, you need to learn more about how UniVerse works. The SQL section is just a side part of it. I suggest you read the other manuals to understand what you are working with.
EDIT
Essentially, the code above is telling you that:
There may be multiple orders per client. These are stored at an MV level in the 'table'
There may be multiple parts per order. These are stored at the MS level in the 'table'
There may be multiple qtys per order. These are stored at the MS level in the 'table'. since they are at the same level, although they are 1-n for orders, they are 1-1 in regards to parts.