Best way to mass-insert Edges into ArangoDB? - graph-databases

I'm writing a python converter for Neo4J to ArangoDB and expect +10k Nodes to be imported.
The converter for the Nodes is somewhat trivial but the creator of that database has a rather custom key-setting so I can't export his keys from his Neo4J instance but I know the name of the PK-Field.
That give me multiple approaches to set the Edges. Right now I'm getting the correct _key of the nodes in the ArangoDB-Collection of the from/to and insert a new edge (code below).
Theoretically I could write the AQL-Statements that just insert these edges, but is that more efficient?
Is there a better approach than my current one?
def getLinkN4jNodes(au,relationships,keyname,col,ecol):
for relationship in relationships:
startnode = relationship.start_node
endnode= relationship.end_node
sn_key=dict(startnode)[keyname]
en_key=dict(endnode)[keyname]
a_sn = au.getNodesFromDB(col,keyname,sn_key)# f"FOR doc IN {col} FILTER doc.`{keyname}`== '{sn_key}' RETURN doc"
a_en = au.getNodesFromDB(col,keyname,en_key)
newedge={
"_key":a_sn["_key"]+'_'+a_en["_key"],
"_from":a_sn["_key"],
"_to": a_en["_key"]
}
ecol.insert(newedge)

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

How to get data from an upstream node in maya?

I have a maya node myNode, which creates a shadeNode, which inMesh attribute is connected to shapeNode.outMesh and has an attribute distance.
myNode.outMesh -> shapeNode.inMesh
myNode.distance = 10
Then i have a command, which works on the shape node, but requires the distance argument, which it does by iterating over the inMesh connections:
MPlugArray meshConnections;
MPlug inMeshPlug = depNodeFn.findPlug("inMesh");
inMeshPlug.connectedTo(meshConnections, true, false); // in connections
bool node_found = false;
for(uint i = 0; i < numConnections; i++) {
MPlug remotePlug = meshConnections[i];
myNode = remotePlug.node();
if(MFnDependencyNode(myNode ).typeName() == "myNode") {
node_found = true;
break;
}
}
MFnDependencyNode myDepNode(myNode);
MPlug distancePlug = myDepNode.findPlug("distance");
Now i get a problem, when applying another node (of another type) to myShape, because the dependency graph now looks like this:
myNode.outMesh -> myOtherNode.inMesh
myOtherNode.outMesh -> shapeNode.inMesh
myNode.distance = 10
I tried to remove the check for typeName() == "myNode", because i understood the documentation like there should be recursion to the parent node, when the next node return Mstatus::kInvalidParameter for the unknown MPlug, but i cannot reach the distance plug without implementing further graph traversion.
What is the correct way to reliably find an attribute of a parent node, even when some other nodes were added in between?
The command itself should use the distance Plug to either connect to myNode or to some plug which gets the value recursively. If possible i do not want to change myOtherNode to have a distance plug and correspondig connections for forwarding the data.
The usual Maya workflow would be to make the node operate in isolation -- it should not require any knowledge of the graph structure which surrounds it, it just reacts to changes in inputs and emits new data from its outputs. The node needs to work properly if a user manually unhooks the inputs and then manually reconnects them to other objects -- you can't know, for example, that some tool won't insert a deformer upstream of your node changing the graph layout that was there when the node was first created.
You also don't want to pass data around outside the dag graph -- if the data needs to be updated you'll want to pass it as a connection. Otherwise you won't be able to reproduce the scene from the graph alone. You want to make sure that the graph can only ever produce an unambiguous result.
When you do have to do DAG manipulations -- like setting up a network of connectiosn -- put them into an MPXCommand or a mel/python script.
I found the answer in an answer (python code) to the question how to get all nodes in the graph. My code to find the node in the MPxCommand now looks like this:
MPlugArray meshConnections;
MPlug inMeshPlug = depNodeFn.findPlug("inMesh");
MItDependencyGraph depGraphIt(inMeshPlug, MFn::kInvalid, MItDependencyGraph::Direction::kUpstream);
bool offset_mesh_node_found = false;
while(!depGraphIt.isDone()) {
myNode = depGraphIt.currentItem();
if(MFnDependencyNode(myNode).typeName() == "myNode") {
offset_mesh_node_found = true;
break;
}
depGraphIt.next();
}
The MItDependencyGraph can traverse the graph in upstream or downstream direction either starting from an object or a plug. Here i just search for the first instance of myNode, as I assume there will only be one in my use case. It then connects the distance MPlug in the graph, which still works when more mesh transforms are inserted.
The MItDependencyGraph object allows to filter for node ids, but only the numeric node ids not node names. I probably add a filter later, when I have unique maya ids assigned in all plugins.

Neo4j - Array inside an array inside a relationship

For a few days now, I've been designing a social network database structure and I've been optimizing over and over again the data structures.
What I am trying to achieve in Neo4j:
I am trying to create a relationship between two nodes which has a property called "history" and one called "currentStatus". The problem is that both are (should be) arrays. Something like:
MATCH (u:User {username: 'john.snow#gmail.com'}), (uu:User {username: 'sansa.stark#gmail.com'})
MERGE u-[rel:FRIENDSHIP]->uu
ON CREATE SET rel.previousFriendshipUpdates = [], rel.currentFriendshipStatus = [sentTime: timestamp(), status: '0']
ON MATCH SET rel.previousFriendshipUpdates = [rel.previousFriendshipUpdates + rel.currentFriendshipStatus], rel.currentFriendshipStatus = [sentTime: timestamp(), status: '1']
I want to keep a history of whatever actions regarding they're friendship take place (sender sent friend request at x time, receiver rejected friend request at x time, sender sent friend request (again) at x time, receiver accepted at x time, receiver unfriended sender at x time, etc).
Thank you in advance.
To add values to array (collection) property arr on relationship r you can do
SET r.arr = r.arr + 'newvalue'
or
SET r.arr = r.arr + ['onevalue', 'nothervalue']
(see How to push values to property array Cypher-Neo4j)
But arrays cannot contain values like sentTime: timestamp(). That looks like a property and an array can't have properties.
Nodes can have properties, however, and both the structure of your example query and the description of your model suggests that you represent the friendship as a node instead. Let each :Friendship node have [:MEMBER] relationships to two :User nodes. Then keep the friendship status as a property on that node. A good way to model relationship history could be to create a node for each update and keep these in a "linked list" that extends from the :Friendship node.

How can I bind the radius of my Geofire query in AngularFire?

I have a model in angularJS which is bound to firebase $scope.items=$firebase(blah) and I use ng-repeat to iterate through the items.
Every item in firebase has a corresponding geofire location by the key of the item.
How can I update my controller to only include items by a custom radius around the user? I don't want to filter by distance in angular, just ask firebase to only retrieve closer items (say 0.3km around a location). I looked around geoqueries but they have a different purpose and I don't know how to bind them to the model anyway. The user may change the radius and the items list should be updated accordingly, so they need to be bound somehow.
Any suggestion is welcome, but an example would be greatly appreciated as I don't have fluency in this trio of angular/firebase/geofire yet :P
It's difficult to figure out what you need to do without seeing your code. But in general you'll need to query a Firebase ref that contains the Geohash as either the name of the child or the priority.
A good example of such a data structure can be found here: https://publicdata-transit.firebaseio.com/_geofire/i
i
9mgzcy8ewt:lametro:8637: true
9mgzgvu3hf:lametro:11027: true
9mgzuq55cc:lametro:11003: true
9mue7smpb9:nctd:51117: true
...
l
...
lametro:11027
0: 33.737797
1: -118.294708
actransit:1006
actransit:1011
actransit:1012
...
The actual transit verhicles are under the l node. Each of them has an array contains the location of that vehicle as a longitutude and latitude pair.
The i node is an index that maps each vehicle to a Geohash. You can see that the name of each node is built up as <geohash>:<metroarea>:<vehicleid>.
Since the Geohash is at the start of the name, we can filter on Geohash with a Query:
var ref = new Firebase("https://publicdata-transit.firebaseio.com/_geofire");
var query = ref.child('i').startAt(null, '9mgzgvu3ha').endAt(null, '9mgzgvu3hz');
query.once('child_added', function(snapshot) { console.log(snapshot.name()); });
With this query Firebase will give us all nodes whose name falls within the range. If all is well, this will output the name of one node:
9mgzgvu3hf:lametro:11027
Once you have that node, you can parse the name to extract the vehicleid and then lookup the actual location of the vehicle under l.
Calculating Geohashes based on a location and a range
In the snippet above, I hardcoded the geohash values to use. Normally you'll want to to get all nodes in a certain range around a center. Instead of calculating these yourself, I recommend using the geohashQueries function from GeoFire for that:
var whitehouse = [38.8977, -77.0366];
var rangeInKm = 0.3;
var hashes = geohashQueries(center, radiusInKm*1000);
console.log(JSON.stringify(hashes));
This outputs a number of Geohash ranges:
[["dqcjqch","dqcjqc~"],["dqcjr10","dqcjr1h"],["dqcjqbh","dqcjqb~"],["dqcjr00","dqcjr0h"]]
You can pass each of these Geohash ranges into a Firebase query:
hashes.forEach(function(hash) {
var query = geoFireRef.child('i').startAt(null, hash[0]).endAt(null, hash[1]);
query.once('child_added', function(snapshot) { log(snapshot.name()); });
});
I hope this helps you settings things up.
Here is a Fiddle that I created a while ago to experiment with this stuff: http://jsfiddle.net/aF9mN/.

Using a new KeySpaces in Titan Cassandra and persisting data

I have just started using Titan over Cassandra. I am new to Titan and also new to the the Graph Database Concept. Just followed the instructions on github and wiki.
Configuration conf = new BaseConfiguration();
conf.setProperty("storage.backend", "cassandra");
conf.setProperty("storage.hostname", "127.0.0.1");
TitanGraph g = TitanFactory.open(conf);
This is how I opened the graph.
I understand that the default key space is titan. I created some nodes and relations in the default key space. I did Indexing on the vertex and queried the nodes and were able to iterate through the results.
Now my questions -
1) How do I set a new Key Space ?
I tried using the property
conf.setProperty("storage.keyspace ", "newkeyspace");
Unfortunately, when I checked the cassandra keyspaces, I could only find titan. There was no keyspace by the name newkeyspace. What could be the reason?
2) How do I persist a graph have been created?
for example,
g.createKeyIndex("name", Vertex.class);
Vertex juno = g.addVertex(null);
juno.setProperty("name", "juno");
juno.setProperty("type", "node");
Vertex juno1 = g.addVertex(null);
juno1.setProperty("name", "juno");
juno1.setProperty("type", "node1");
This is one sample graph. Once I issue a query of the form
Iterator<Vertex> junoIterator = g.getVertices("name", "juno")
.iterator();
while (junoIterator.hasNext()) {
Vertex vertex = (Vertex) junoIterator.next();
System.out.println(vertex.getProperty("type"));
}
My expected result is
node
node1
The same query should work fine once I comment the following segment -
/*g.createKeyIndex("name", Vertex.class);
Vertex juno = g.addVertex(null);
juno.setProperty("name", "juno");
juno.setProperty("type", "node");
Vertex juno1 = g.addVertex(null);
juno1.setProperty("name", "juno");
juno1.setProperty("type", "node1");*/
Here, what I believe is that the nodes, relations and index already have been built and is persisted in some datastore. Am I of the wrong belief?
Please advice.
I have figured out the issue. Was just a matter of commit.
A single line g.commit(); solves it all.
To create a new Titan Table(Keyspace in cassandra) in please use the property :
set("storage.cassandra.keyspace","TitanTest")
the whole command wud be like :
graph = TitanFactory.build().
set("storage.backend","cassandra").
set("storage.hostname","127.0.0.1").set("index.search.backend","elasticsearch").set("storage.cassandra.astyanax.cluster-name","Test Cluster").set("storage.cassandra.keyspace","TitanTest").set("cache.db-cache","true").open();

Resources