AWS Neptune performance / Comparing to Neo4j AuraDB - database

We use Neo4j AuraDB for our graph database but there we have issues with data upload. So, we decided to move to AWS Neptune using the migration tool.
We have 3.7M nodes and 11.2M relations in our database. The DB instance is db.r5.large with 2 CPUs and 16GiB RAM.
The same AWS Neptune OpenCypher queries are much slower than AuraDB Cypher queries (about 7-10 times slower). Also, we tried to rewrite the queries to Gremlin and test performance but it is still very slow. We have node and lookup indexes on AuraDB but we can't create them on AWS Neptune as it handles them automatically.
Is there any way to reach better performance on AWS Neptune?
UPDATE:
Example of Gremlin query:
g.V().hasLabel('Member').has('address', eq('${address}')).outE('HAS').as('member_has').inV().as('token').hasLabel('Token').inE('HAS').as('other_member_has').outV().as('other_member').hasLabel('Member').where(__.select('member_has').where(neq('other_member_has'))).select('other_member', 'token').group().by(__.select('other_member').local(__.properties().group().by(__.key()).by(__.map(__.value())))).by(__.fold().project('member', 'number_of_tokens').by(__.unfold().select('other_member').choose(neq('cypher.null'), __.local(__.properties().group().by(__.key()).by(__.map(__.value()))))).by(__.unfold().select('token').count())).unfold().select(values).order().by(__.select('number_of_tokens'), desc).limit(20)
Example of Cypher query:
MATCH (member:Member { address: '${address}' })-[:HAS]->(token:Token)<-[:HAS]-(other_member:Member) RETURN PROPERTIES(other_member) as member, COUNT(token) AS number_of_tokens ORDER BY number_of_tokens DESC LIMIT 20

As discussed in the comments, as of this moment, the openCypher support is a preview, not quite GA level. The more recent engine versions do have some significant improvements but more are yet to be delivered. As to the Gremlin query, tools that convert Cypher to Gremlin tend to build quite complex queries. I think the Gremlin equivalent to the Cypher query is going to look something like this.
g.V().has('Member','address', address).as('m').
out('HAS').hasLabel('Token').as('t').
in('HAS').hasLabel('Member').as('om').
where(neq('m')).
group().
by('om').
by(select('t').count()).
order(local).
by(values,desc).
limit(20)
and if you want all of the properties just add a valueMap as in:
g.V().has('Member','address', address).as('m').
out('HAS').hasLabel('Token').as('t').
in('HAS').hasLabel('Member').as('om').
where(neq('m')).
group().
by(select('om').valueMap(true)).
by(select('t').count()).
order(local).
by(values,desc).
limit(20)

Related

Nested traversal gremlin query for Titan db

I am wondering how is possible to have a gremlin query which returns results in a nested format. Suppose there is property graph as follows:
USER and PAGE vertices with some properties such as AGE for USER vertex;
FOLLOW edge between USER and PAGE;
I am looking for a single efficient query which gives all Users with age greater than 20 years and all of the followed pages by those users. I can do that using a simple loop from the application side and per each iteration use a simple traversal query. Unfortunately, such solution is not efficient for me, since it will generate lots of queries and network latency could be huge in this case.
Not sure what your definition of "efficient" is, but keep in mind that this is a typical OLAP use-case and you shouldn't expect fast OLTP realtime responses.
That said, the query should be as simple as:
g.V().has("USER", "AGE", gt(20)).as("user").
map(out("FOLLOW").fold()).as("pages").
select("user", "pages")
A small example using the modern sample graph:
gremlin> g = TinkerFactory.createModern().traversal().withComputer()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], graphcomputer]
gremlin> g.V().has("person", "age", gt(30)).as("user").
map(out("created").fold()).as("projects").
select("user","projects")
==>[user:v[6], projects:[v[3]]]
==>[user:v[4], projects:[v[5], v[3]]]
this is very easy:
g.V().label('user').has('age',gt(20))
.match(__.as('user').out('follows').as('page'))
.select('user','page')
just attention when you are using this query in gremlin, gremlin gives you null pointer exception you can use it in code and check if 'page' exist get that.

Quickly adding edge counts to a document in ArangoDB

Not too complicated: I want to count the edges of each document and save the number in the document. I've come up with two queries that work; unfortunately since I have millions of edges both are quite slow. Is there a faster way to update documents with a property storing their number of edges? (just a count at a point in time)
AQL queries that are functional but slow:
FOR doc IN Documents
LET inEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'inbound', maxDepth:1})
LET outEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'outbound', maxDepth:1})
UPDATE doc WITH {inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
or:
FOR e IN Edges
COLLECT docId = e._to WITH COUNT INTO counter
UPDATE SPLIT(docId,'/')[1] WITH {inEdgeCount: counter}
(and then repeat for outbound edges)
As an aside, is there any way to view either query speed (e.g. FOR executions per second) or percentage completion? I've been trying to judge speed by using LIMITed queries to start with, but the time required doesn't seem to scale linearly.
With ArangoDB 2.8 you can use graph pattern matching traversals to execute this with better performance:
FOR doc IN documents
LET inEdgesCount = LENGTH(FOR v IN 1..1 INBOUND doc GRAPH 'edgeGraph' RETURN 1)
LET outEdgesCount = LENGTH(FOR v IN 1..1 OUTBOUND doc GRAPH 'edgeGraph' RETURN 1)
UPDATE doc WITH
{inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
Currently ArangoDB doesn't have a way to monitor the progress of long running tasks. With ArangoDB 3.0 we're going to introduce a new monitoring framkework that allows better inspection of whats actually going on in the server. However, with 3.0 it won't be able to gather live statistics; we may see this further down the 3.x road later this year. Judging percentage completion may become possible for easy tasks like creating indices, but on queries its rather going to be the number of documents read/written so far.
We did similar queries for validating whether a graph obeys a power law

How to optimize graph traversals in ArangoDB?

I primarily intended to ask this question : "Is ArangoDB a true graph database ?"
But, this question would sound quite offending.
You, peoples at triAGENS, did a really great job in creating a "multi-paradigm" database.
As a user of PostgreSQL, PostGIS, MongoDB and Neo4J/Titan, I really appreciate to see an "all-in-one" solution :)
But the question remains, basically creating a graph in ArangoDB requires to create two separate collections : one for edges and one for vertices, thus, as far as I understand, it already means that vertices and related edges are not "physically" neighbors.
Moreover, even after creating appropriate index, I'm facing some serious performance issues when doing this kind of stuff in Gremlin
g.v('an_id').out('likes').in('likes').count()
Which returns a result after ~ 3 seconds (perceived time)
I assumed I poorly understood how Gremlin and Blueprint/ArangoDB worked so I tried to rewrite the same query using AQL :
LET lst = (FOR e1 in NEIGHBORS(vertices, edges, "an_id", "outbound", [ { "$label": "likes" } ] )
FOR e2 in NEIGHBORS(vertices, edges, e1.edge._to, "inbound", [ { "$label": "likes" } ] )
RETURN 1
)
RETURN length(lst)
Which gives me a delay of same order of magnitude.
If I tried to run the same query on a Titan or Neo4j database (with the very same data), queries returns almost immediately (perceived time : <200ms)
So it seems to me that ArangoDB graph features are a "smart graph layer" above a "traditionnal document database" but that ArangoDB is not a "native" graph database.
To confirm this feeling, I transform data to load it in PostgreSQL and run a query (with a multiple table JOIN as you can assume) and got similar (to ArangoDB) execution delays
Did I do something wrong (in AQL query) ?
Is there a way to optimize the database to get better traversal times ?
In PostgreSQL, conceptually, I would mix edge and node and use a CLUSTER clause to physically order data, does something similar can be done in ArangoDB ? (I assume that it would be hard, as it would involve to "interlace" edges and nodes, just an intuition)
i am a Core Developer of ArangoDB. Could you give me a bit more information ob the dimensions of data you are using?
Amount of vertices
Amount of edges
Then we can create our own setup with equal dimensions and optimize it.

Optimizing XQuery projection

I'm getting some horrific performance from an XQuery projection in Sql Server.
What would be the best way to write the following transformation?
select DocumentData.query(
'<object type="dynamic">
<state>
<OrderTotal type="decimal">
{fn:sum(
for $A in /object[1]/state[1]/OrderDetails[1]/object/state[1]
return ($A/ItemPrice[1] * $A/Quantity[1]))}
</OrderTotal>
<CustomerId type="guid">
{xs:string(/object[1]/state[1]/CustomerId[1])}
</CustomerId>
<Details type="collection">
{/object[1]/state[1]/OrderDetails[1]/object}
</Details>
</state>
</object>') as DocumentData
from documents
(I know the code is a bit out of context)
If I check the executionplan for this code, there is about 10+ joins going on.
Should I break this down to use for $var for each level in the structure?
For more context, this is what I'm trying to accomplish:
http://rogeralsing.com/2011/03/02/linq-to-sqlxml-projections/
I'm writing a "Linq to XQuery translator" / NoSQL Document DB emulator, filtering works like a charm, projections suffer from perf problems.
This article is quite useful:
Performance Optimizations for the XML Data Type in SQL Server 2005
In particular it recommends that instead of writing paths of the form...
/object[1]/state[1]/CustomerId[1]
you should instead write...
(/object/state/CustomerId)[1]

What's your experience developing on Google App Engine?

Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!
My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.
The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.
GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()
A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)
At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)
Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.

Resources