I'm evaluating the replacement of a relational database by a graph database. I'm trying to copy the data over to OrientDB (version 2.0) from the original database.
I create an embedded server via OServer server = OServerMain.create();, passing false to storage.useWAL.
Then I create a non transactional graph:
OrientBaseGraph graph = new OrientGraphNoTx("plocal:"+db);
graph.declareIntent(new OIntentMassiveInsert());
graph.getRawGraph().declareIntent(new OIntentMassiveInsert());
graph.getRawGraph().setValidationEnabled(false);
I create vertex types for all my tables. I first create all vertices with their attributes in on call for each vertex:
OrientVertex node = graph.addVertex("class:"+lbl, properties);
I then create edges between these vertices. Some (but not all) of these edges have properties.
if (props!=null){
nfrom.addEdge(linkName, nto,null,null,props);
} else {
nfrom.addEdge(linkName, nto);
}
I've tried with and without edge classes, didn't notice any performance improvements.
All in all, I have 328822 vertices and 831293 edges. The total run time is at best around 25 minutes!! Most of that time (20 minutes at least) is spent inserting the edges, not the vertices.
On the same machine, reading the same data from the same relation db and writing it in Titan with BerkeleyDB backend, I transfer the data in 2 minutes!
What makes OrientDB around 10 times slower that a competitor? What do I do wrong?
Thanks!
When you have many edges, it's much better using Transactional Graph and commit every X items. Furthermore disable the journal for TX. Example:
OrientBaseGraph graph = new OrientGraph("plocal:"+db);
try{
graph.getRawGraph().getTransaction().setUsingLog(false);
int saved = 0;
while(){ // this is your loop
....
saved++;
if( saved % 5000 == 0 ){
graph.commit();
graph.getRawGraph().getTransaction().setUsingLog(false);
}
}
graph.commit();
} finally {
graph.close();
}
Related
Not too complicated: I want to count the edges of each document and save the number in the document. I've come up with two queries that work; unfortunately since I have millions of edges both are quite slow. Is there a faster way to update documents with a property storing their number of edges? (just a count at a point in time)
AQL queries that are functional but slow:
FOR doc IN Documents
LET inEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'inbound', maxDepth:1})
LET outEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'outbound', maxDepth:1})
UPDATE doc WITH {inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
or:
FOR e IN Edges
COLLECT docId = e._to WITH COUNT INTO counter
UPDATE SPLIT(docId,'/')[1] WITH {inEdgeCount: counter}
(and then repeat for outbound edges)
As an aside, is there any way to view either query speed (e.g. FOR executions per second) or percentage completion? I've been trying to judge speed by using LIMITed queries to start with, but the time required doesn't seem to scale linearly.
With ArangoDB 2.8 you can use graph pattern matching traversals to execute this with better performance:
FOR doc IN documents
LET inEdgesCount = LENGTH(FOR v IN 1..1 INBOUND doc GRAPH 'edgeGraph' RETURN 1)
LET outEdgesCount = LENGTH(FOR v IN 1..1 OUTBOUND doc GRAPH 'edgeGraph' RETURN 1)
UPDATE doc WITH
{inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
Currently ArangoDB doesn't have a way to monitor the progress of long running tasks. With ArangoDB 3.0 we're going to introduce a new monitoring framkework that allows better inspection of whats actually going on in the server. However, with 3.0 it won't be able to gather live statistics; we may see this further down the 3.x road later this year. Judging percentage completion may become possible for easy tasks like creating indices, but on queries its rather going to be the number of documents read/written so far.
We did similar queries for validating whether a graph obeys a power law
I am using Titan 0.4 + Cassandra.
My use-case requires insert multiple vertices at a time.
(aprrox batch size is 100 vertices at a time.)
eg :
v01 = g.addVertex(["UC":"B","i":2]); v02 = g.addVertex(["UC":"H","i":1])
v03 = g.addVertex(["LC":"a"]); v04 = g.addVertex(["LC":"a"]);
v05 = g.addVertex(["LC":"d"]); v06 = g.addVertex(["LC":"h"]);
v07 = g.addVertex(["LC":"i"]); v08 = g.addVertex(["LC":"p"]);
Is there any gremlin command to add all Eight vertices in a single request. ( something like g.addVertices() ?? )
I'm using the c# SDK. What worked for me is just chaining the addV commands:
g.addV('item').property('id', '5aa3a51e-6434-4d53-aed4-
5db3c90e3551').addV('item').property('id', '7f859920-2251-4553-8325-
5dbb2f626d1c')
for your example:
g.addVertex(["UC":"B","i":2]).addVertex(["UC":"H","i":1]).addVertex(["LC":"a"]).addVertex(["LC":"a"]).addVertex(["LC":"d"]).addVertex(["LC":"h"]).addVertex(["LC":"i"]).addVertex(["LC":"p"])
hope this helps
Gremlin does not have an addVertices() wrapper - you'll need to call addVertex() multiple times.
I had the requirement to add several vertices at the same time too. Individual addV queries weren't practical for inserting thousands of record at a time, while also retrieving their database generated ids.
Here's what I came up with as a batch insertion command/query
g.addV('One').values('id').as('one').addV('Two').values('id').as('two').select('one', 'two')
CosmosDB returns
[{
"one": "372be552-7f63-4d7b-be81-a73d5d677afa",
"two": "a60d3773-5c29-454e-b079-dec734c4f431"
}]
This may be a trivial question, but I was just hoping to get some practical experience from people who may know more about this than I do.
I wanted to generate a database in GAE from a very large series of XML files -- as a form of validation, I am calculating statistics on the GAE datastore, and I know there should be ~16,000 entities, but when I perform a count, I'm getting more on the order of 12,000.
The way I'm doing counting is basically I perform a filter, fetch a page of 1000 entities, and then spin up task queues for each entity (using its key). Each task queue then adds "1" to a counter that I'm storing.
I think I may have juiced the datastore writes too much; I set the rate of my task queues to 50/s.. I did get some writing errors, but not nearly enough to justify the 4,000 difference. Could it be possible that I was rushing the counting calls too much that it lead to inconsistency? Would slowing the rate that I process task queues to something like 5/s solve the problem? Thanks.
You can count your entities very easily (no tasks and almost for free):
int total = 0;
Query q = new Query("entity_kind").setKeysOnly();
// set your filter on this query
QueryResultList<Entity> results;
Cursor cursor = null;
FetchOptions queryOptions = FetchOptions.Builder.withLimit(1000).chunkSize(1000);
do {
if (cursor != null) {
queryOptions.startCursor(cursor);
}
results = datastore.prepare(q).asQueryResultList(queryOptions);
total += results.size();
cursor = results.getCursor();
} while (results.size() == 1000);
System.out.println("Total entities: " + total);
UPDATE:
If looping like I suggested takes too long, you can spin a task for every 100/500/1000 entities - it's definitely more efficient than creating a task for each entity. Even very complex calculations should take milliseconds in Java if done right.
For example, each task can retrieve a batch of entities, spin a new task (and pass a query cursor to this new task), and then proceed with your calculations.
I primarily intended to ask this question : "Is ArangoDB a true graph database ?"
But, this question would sound quite offending.
You, peoples at triAGENS, did a really great job in creating a "multi-paradigm" database.
As a user of PostgreSQL, PostGIS, MongoDB and Neo4J/Titan, I really appreciate to see an "all-in-one" solution :)
But the question remains, basically creating a graph in ArangoDB requires to create two separate collections : one for edges and one for vertices, thus, as far as I understand, it already means that vertices and related edges are not "physically" neighbors.
Moreover, even after creating appropriate index, I'm facing some serious performance issues when doing this kind of stuff in Gremlin
g.v('an_id').out('likes').in('likes').count()
Which returns a result after ~ 3 seconds (perceived time)
I assumed I poorly understood how Gremlin and Blueprint/ArangoDB worked so I tried to rewrite the same query using AQL :
LET lst = (FOR e1 in NEIGHBORS(vertices, edges, "an_id", "outbound", [ { "$label": "likes" } ] )
FOR e2 in NEIGHBORS(vertices, edges, e1.edge._to, "inbound", [ { "$label": "likes" } ] )
RETURN 1
)
RETURN length(lst)
Which gives me a delay of same order of magnitude.
If I tried to run the same query on a Titan or Neo4j database (with the very same data), queries returns almost immediately (perceived time : <200ms)
So it seems to me that ArangoDB graph features are a "smart graph layer" above a "traditionnal document database" but that ArangoDB is not a "native" graph database.
To confirm this feeling, I transform data to load it in PostgreSQL and run a query (with a multiple table JOIN as you can assume) and got similar (to ArangoDB) execution delays
Did I do something wrong (in AQL query) ?
Is there a way to optimize the database to get better traversal times ?
In PostgreSQL, conceptually, I would mix edge and node and use a CLUSTER clause to physically order data, does something similar can be done in ArangoDB ? (I assume that it would be hard, as it would involve to "interlace" edges and nodes, just an intuition)
i am a Core Developer of ArangoDB. Could you give me a bit more information ob the dimensions of data you are using?
Amount of vertices
Amount of edges
Then we can create our own setup with equal dimensions and optimize it.
Background:
I need to store the following data in a database:
osm nodes with tags;
osm edges with weights (that is an edge between two nodes extracted from 'way' from an .osm file).
Nodes that form edges, which are in the same 'way' sets should have the same tags as those ways, i.e. every node in a 'way' set of nodes which is a highway should have a 'highway' tag.
I need this structure to easily generate a graph based on various filters, e.g. a graph consisting only of nodes and edges which are highways, or a 'foot paths' graph, etc.
Problem:
I have not heard about the spatial index before, so I just parsed an .osm file into a MySQL database:
all nodes to a 'nodes' table (with respective coordinates columns) - OK, about 9,000,000 of rows in my case:
(INSERT INTO nodes VALUES [pseudocode]node_id,lat,lon[/pseudocode];
all ways to an 'edges' table (usually one way creates a few edges) - OK, about 9,000,000 of rows as well:
(INSERT INTO edges VALUES [pseudocode]edge_id,from_node_id,to_node_id[/pseudocode];
add tags to nodes, calculate weights for edges - Problem:
Here is the problematic php script:
$query = mysql_query('SELECT * FROM edges');
$i=0;
while ($res = mysql_fetch_object($query)) {
$i++;
echo "$i\n";
$node1 = mysql_query('SELECT * FROM nodes WHERE id='.$res->from);
$node1 = mysql_fetch_object($node1);
$tag1 = $node1->tags;
$node2 = mysql_query('SELECT * FROM nodes WHERE id='.$res->to);
$node2 = mysql_fetch_object($node2);
$tag2 = $node2->tags;
mysql_query('UPDATE nodes SET tags="'.$tag1.$res->tags.'" WHERE nodes.id='.$res->from);
mysql_query('UPDATE nodes SET tags="'.$tag2.$res->tags.'" WHERE nodes.id='.$res->to);`
Nohup shows the output for 'echo "$i\n"' each 55-60 seconds (which can take more than 17 years to finish if the size of the 'edges' table is more than 9,000,000 rows, like in my case).
Htop shows a /usr/bin/mysqld process which takes 40-60% of CPU.
The same problem exists for the script which tries to calculate the weight (the distance) of an edge (select all edges, take an edge, then select two nodes of this edge from 'nodes' table, then calculate the distance, then update the edges table).
Question:
How can I make this SQL updates faster? Should I tweak any of MySQL config settings? Or should I use PostgreSQL with PostGIS extension? Should I use another structure for my data? Or should I somehow utilize the spatial index?
If I understand you right there is two things to discuss.
First, your idea of putting the highway-tag on the start and stop nodes. A node can have more than one edge connected, where to put the tag from the second edge? Or third or fourth if it is a crossing? The reason the highway tag is putted in the edges table in the first place is that from a relational point of view that is where it belongs.
Second, to get the whole table and process it outside the database is not the right way. What a relational database is really good at is taking care of this whole process.
I have not worked with mysql, and I fully agree that you will probably get a lot more fun if migrating to PostGIS since PostGIS has a lot better spatial capabilities (even if you don't need any spatial capabilities for this particular task) from what I have heard.
So if we ignore the first problem and just for showing the concept say that there is only two edges connected to one node and that each node has two tag-fields. tag1 and tag2. Then it could look something like this in PostGIS:
UPDATE nodes set tag1=edges.tags from edges where nodes.id=edges.from;
UPDATE nodes set tag2=edges.tags from edges where nodes.id=edges.to;
If you disable the indexes that should be very fast.
Again,
if I have understood you right.
PostgreSQL
Openstreetmap itself uses PostgreSQL, so I guess that's recommended.
See: http://wiki.openstreetmap.org/wiki/PostgreSQL
You can see OSM's database schema at: http://wiki.openstreetmap.org/wiki/Database_Schema
So you can use the same fields, fieldtypes and indexes that OSM uses for maximum compatibility.
MySQL
If you want to import .osm files into a MySQL database, have a look at:
http://wiki.openstreetmap.org/wiki/OsmDB.pm
Here you will find perl code that will create MySQL tables, parse a OSM file and import it into your MySQL database.
Making it faster
If you are updating in bulk, you don't need to update the indexes after every update.
You can just disable the indexes, do all your updates and re-enable the index.
I'm guessing that should be a whole lot faster.
Good luck