I am new to gremlin and trying out a query but as I observe store value is always ignored
I have tried store, aggregate and as as well but all are giving me false values.
g.V().has('__typeName','avro_schema').where(local(out('__avro_record.fields').as('xm').local(out('classifiedAs').has('__typeName', 'DataClassification').count().is(eq(1))).count().is(eq(select('xm').size()))))
this gives size of 'xm' as always 0
I expect the value of size 'xm' to be equal to the number of outgoing edges from each avro_schema with the edge label as '__avro_record.fields'
As pointed,changed the query to :
g.V().has('__typeName','avro_schema').where(local(out('__avro_record.fields').local(out('classifiedAs').has('__typeName', 'DataClassification').count().is(eq(1))).count().is(count(local))))
Now getting empty results.
Edit :
Also I have doubts related to printing of dynamic values as sideEffect
g.V().has('__typeName','avro_schema').where(local(out('__avro_record.fields').local(out('classifiedAs').has('__typeName', 'DataClassification').count().is(eq(1))).count().sideEffect{ println count(local) }.is(count(local))))
Output:
[CountLocalStep]
Where as I expect the actual value of count(local).What is the best practice to debug gremlin query?
There are a few things that won't work in your traversal. First, .size() is not a Gremlin step, you're probably looking for .count(local). Next, eq() does not take dynamic values, it only works with constant values. Read the docs for where() step to learn how to compare against dynamic values.
UPDATE
To compare the two count() values you would do something like this:
g.V().has('__typeName','avro_schema').filter(
out('__avro_record.fields').fold().as('x').
map(unfold().
out('classifiedAs').has('__typeName', 'DataClassification').fold()).as('y').
where('x', eq('y')).
by(count(local)))
Related
I'm attempting to find all values that match any item within a list of values within cypher. Similar to a SQL query with in and not in. I also want to find all values that are not in the list in a different query. The idea is I want to assign a property to each node that is binary and indicates whether the name of the node is within the predefined list.
I've tried the following code blocks:
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE NOT temp2.Name IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp
This block returns nothing, but should return a rather large amount of data.
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE temp2.Name NOT IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp
This code block returns an error in relation to the NOT's position. Does anyone know the correct syntax for this statement? I've looked around online and in the neo4j documentation, but there are a lot of conflicting ideas with version changes. Thanks in advance!
Neo4j is case sensitive so you need to check the data to ensure that the EMAIL_DOMAIN.Name is all upper case. If Name is mixed case, you can convert the name using toUpper(). If Name is all lower case, then you need to convert the values in your query.
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE NOT toUpper(temp2.Name) IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp
I have a set of test results in my mongodb database. Each document in the database contains version information, test data, date, test run information etc...
The version is broken up in the document and stored as individual values. For example: { VER_MAJOR : "0", VER_MINOR : "2", VER_REVISION : "3", VER_PATCH : "20}
My application wants the ability to specify a specific version and grab the document as well as the previous N documents based on the version.
For example:
If version = 0.2.3.20 and n = 5 then the result would return documents with version 0.2.3.20, 0.2.3.19, 0.2.3.18, 0.2.3.17, 0.2.3.16, 0.2.3.15
The solutions that come to my mind is:
Create a new database that contains documents with version information and is sorted. Which can be used to obtain the previous N version's which can be used to obtain the corresponding N documents in the test results database.
Perform the sorting in the test results database itself like in number 1. Though if the test results database is large, this will take a very long time. Also consider inserting in order every time.
Creating another database like in option 1 doesn't seem like the right way. But sorting the test results database seems like there will be lots of overhead, am I mistaken that I should be worried about option 2 producing lots of overhead? I have the impression I'd have to query the entire database then sort it on application side. Querying the entire database seems like overkill...
db.collection_name.find().sort([Paramaters for sorting])
You are quite correct that querying and sorting the entire data set would be very excessive. I probably went overboard on this, but I tried to break everything down in detail below.
Terminology
First thing first, a couple terminology nitpicks. I think you're using the term Database when you mean to use the word Collection. Differentiating between these two concepts will help with navigating documentation and allow for a better understanding of MongoDB.
Collections and Sorting
Second, it is important to understand that documents in a Collection have no inherent ordering. The order in which documents are returned to your app is only applied when retrieving documents from the Collection, such as when specifying .sort() on a query. This means we won't need to copy all of the documents to some other collection; we just need to query the data so that only the desired data is returned in the order we want.
Query
Now to the fun part. The query will look like the following:
db.test_results.find({
"VER_MAJOR" : "0",
"VER_MINOR" : "2",
"VER_REVISION" : "3",
"VER_PATCH" : { "$lte" : 20 }
}).sort({
"VER_PATCH" : -1
}).limit(N)
Our query has a direct match on the three leading version fields to limit results to only those values, i.e. the specific version "0.2.3". A range $lte filter is applied on VER_PATCH since we will want more than a single patch revision.
We then sort results by VER_PATCH to return results descending by the patch version. Finally, the limit operator is used to restrict the number of documents being returned.
Index
We're not done yet! Remember how you said that querying the entire collection and sorting it on the app side felt like overkill? Well, the database would doing exactly that if an index did not exist for this query.
You should follow the equality-sort-match rule when determining the order of fields in an index. In this case, this would give us the index:
{ "VER_MAJOR" : 1, "VER_MINOR" : 1, "VER_REVISION" : 1, "VER_PATCH" : 1 }
Creating this index will allow the query to complete by scanning only the results it would return, while avoiding an in-memory sort. More information can be found here.
I am learning Cypher on Neo4j and I am having trouble in understanding how to perform an efficient 'join' equivalent in Cypher.
I am using the standard Matrix character example and I have added some nodes to the mix called 'Gun' with a relation of ':GIVEN_TO'. You can see the console with my query result here:
http://console.neo4j.org/r/rog2hv
The query I am using is:
MATCH (Neo:Crew { name: 'Neo' })-[:KNOWS*..]->(other:Crew),(other)<-[:GIVEN_TO]-(g:Gun),(Neo)<-[:GIVEN_TO]-(g2:Gun)
RETURN count(g2);
I have given Neo 4 guns, but when I perform the above I get a count of '12'. This seems to be the case because there are 3 'others' and 3*4 = 12. So I get some exponential result.
What should my query look like to get the correct count ('4') from the example?
Edit:
The reason I am not querying through Guns directly as suggested by #ceej is because in my real use case I have to do this traversal as described above. Adding DISTINCT does not do anything for my result.
The reason you get 12 guns instead of 4 is because your query produces a cartesian product. This is because you have asked for items in the same match statement without joining them. #ceej rightly pointed out if you want to find Neo's guns you would do as he suggested in his first query.
If you wanted to get a list of the crew members and their guns then you could do something like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN crew.name, collect(g.name)
Which finds all of the crew members with guns and returns their name and the guns that they were given.
If you wanted to invert it and get a list of the guns and the respective crew members they were give to you could do the following...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN g.name, collect(crew.name)
If you wanted to find all of the crew that knew Neo multiple levels deep that were given a gun you could write the query like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
WITH crew, g
MATCH (neo:Crew {name: 'Neo'})-[:KNOWS*0..]->(crew)
RETURN crew.name, collect(g.name)
That finds all the crew that were given guns and then determines which of them have a :KNOWS path to Neo.
Forgive me, but I am am unclear why you have the initial MATCH in your query. From your explanation it would appear that you are trying to get the number of :Gun nodes linked to Neo by the :GIVEN_TO relationship. In which case all you need is the latter part of your query. Which would give you something like
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count(g)
Furthermore, to make sure that you are only counting distinct :Gun nodes you can add DISTINCT to the RETURN statement.
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count( DISTINCT g )
This is possibly unnecessary in your case but can be helpful when the pattern that you are matching on can arrive at the same node by different traversals.
Have I misunderstood your requirement?
After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r
I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!