STRING database Neo4j import - database

I have this network I'm working on from STRING database .
I have extracted that network so that I can impor it into Neo4j ,
The format I used is TSV and i have converted it to csv and done the import
yet , I can see only nodes and no relationships in between .
How can I make Neo4j able to know the relationships .and represent them
Here is the STRING network : http://string-db.org/cgi/network.pl?taskId=sPvqsEhi3Tk6&sessionId=MKQKZSH3dCb3&bottom_page_content=table
and that is the tabular form [1]: https://i.stack.imgur.com/IMMFk.png
And here is the Neo4j graph [1]: https://i.stack.imgur.com/eKeYv.png

I think the following will do the trick :
CREATE CONSTRAINT ON (n:Node) ASSERT a.NodeID IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///strings.csv'
AS line
WITH line
MERGE (n1:Node {NodeID: line.node1_string_internal_id, NodeName: line.node1})
MERGE (n2:Node {NodeID: line.node2_string_internal_id, NodeName: line.node2})
MERGE (n1)-[:INTERACTS {Score: TOFLOAT(line.combined_score)}]->(n2);
I hope this helps.
Regards,
Tom

Related

pyarrow parquet - encoding array into list of records

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).
I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.
I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery
Here is the python script to create parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
'organizationId' : ['org1', 'org2', 'org3'],
'entityType' : ['customer', 'customer', 'customer'],
'entityId' : ['cust_1', 'cust_2', 'cust_3'],
'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
}
)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:
{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}
but I would expect something this:
{"type":"array","type":["null","string"],"default":null}]}}
Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?
thanks
As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:
Required
optional
repeated.
In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.

How to Create Relationship between two Different Column in Neo4j

I am trying to initiate a relationship between two columns in Neo4j. my dataset is a CSV file with two-column refers to Co-Authorship and I want to Construct a Network of it. I already load the data, return them and match them.
Loading
load csv from 'file:///conet1.csv' as rec
return the data
create (:Guys {source: rec[0], target: rec[1]})
now I need to Construct the Collaboration Network of data by making a relationship between source and target columns. What do you propose for the purpose?
I was able to make a relationship between mentioned columns in NetworkX graph libray in python like this:
import pandas as pd
import networkx as nx
g = nx.Graph()
df = pd.read_excel('Colab.csv', columns= ['source', 'target'])
g = nx.from_pandas_edgelist(df,'source','target', 'weight')
If I understand your use case, I do not believe you should be creating Guys nodes just to store relationship info. Instead, the graph-oriented approach would be to create an Author node for each author and a relationship (say, of type COLLABORATED_WITH) between the co-authors.
This might work for you, or at least give you a clue:
LOAD CSV FROM 'file:///conet1.csv' AS rec
MERGE (source:Author {id: rec[0]})
MERGE (target:Author {id: rec[1]})
CREATE (source)-[:COLLABORATED_WITH]->(target)
If it is possible that the same relationship could be re-created, you should replace the CREATE with a more expensive MERGE. Also, a work can have any number of co-authors, so having a relationship between every pair may be sub-optimal depending on what you are trying to do; but that is a separate issue.

Creating a Neo4j Graph Database Using LOAD CSV

I have a CSV file containing the data that I want to convert into a graph database using Neo4j. The Columns in the file are in the following format :
Person1 | Person2 | Points
Now the ids in Person1 and Person2 are redundant , so I am using a Merge statement instead. But I am not getting the correct results.
For a sample dataset , the output seems to be correct , but when I import my dataset consisting of 2M rows, it somehow doesn't create the relationships.
I am putting the cypher code that I am using currently.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:C:/Users/yogi/Documents/Neo4j/default.graphdb/sample.csv" AS csvline
MERGE (p1:Person {id:toInt(csvline.id1)})
MERGE (p2:Person {id:toInt(csvline.id2)})
CREATE (p1)-[:points{count:toInt(csvline.c)}]->(p2)
Some things you should check:
are you using an index: CREATE INDEX ON :Person(id) should be run before the import
depending on the Neo4j version you're using, the statement might be subject to "eager-pipe" which basically prevents the periodic commit. For more on eager pipe, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

HiveQL to HBase

I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?
I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?

How to use indexed properties of NodeModels in cypher queries of Neo4django?

I'm a newbie to Django as well as neo4j. I'm using Django 1.4.5, neo4j 1.9.2 and neo4django 0.1.8
I've created NodeModel for a person node and indexed it on 'owner' and 'name' properties. Here is my models.py:
from neo4django.db import models as models2
class person_conns(models2.NodeModel):
owner = models2.StringProperty(max_length=30,indexed=True)
name = models2.StringProperty(max_length=30,indexed=True)
gender = models2.StringProperty(max_length=1)
parent = models2.Relationship('self',rel_type='parent_of',related_name='parents')
child = models2.Relationship('self',rel_type='child_of',related_name='children')
def __unicode__(self):
return self.name
Before I connected to Neo4j server, I set auto indexing to True and and gave indexable keys in conf/neo4j.properties file as follows:
# Autoindexing
# Enable auto-indexing for nodes, default is false
node_auto_indexing=true
# The node property keys to be auto-indexed, if enabled
node_keys_indexable=owner,name
# Enable auto-indexing for relationships, default is false
relationship_auto_indexing=true
# The relationship property keys to be auto-indexed, if enabled
relationship_keys_indexable=child_of,parent_of
I followed Neo4j: Step by Step to create an automatic index to update above file and manually create node_auto_index on neo4j server.
Below are the indexes created on neo4j server after executing syndb of django on neo4j database and manually creating auto indexes:
graph-person_conns lucene
{"to_lower_case":"true", "_blueprints:type":"MANUAL","type":"fulltext"}
node_auto_index lucene
{"_blueprints:type":"MANUAL", "type":"exact"}
As suggested in https://github.com/scholrly/neo4django/issues/123 I used connection.cypher(queries) to query the neo4j database
For Example:
listpar = connection.cypher("START no=node(*) RETURN no.owner?, no.name?",raw=True)
Above returns the owner and name of all nodes correctly. But when I try to query on indexed properties instead of 'number' or '*', as in case of:
listpar = connection.cypher("START no=node:node_auto_index(name='s2') RETURN no.owner?, no.name?",raw=True)
Above gives 0 rows.
listpar = connection.cypher("START no=node:graph-person_conns(name='s2') RETURN no.owner?, no.name?",raw=True)
Above gives
Exception Value:
Error [400]: Bad Request. Bad request syntax or unsupported method.
Invalid data sent: (' expected but-' found after graph
I tried other strings like name, person_conns instead of graph-person_conns but each time it gives error that the particular index does not exist. Am I doing a mistake while adding indexes?
My project mainly depends on filtering the nodes based on properties, so this part is really essential. Any pointers or suggestions would be appreciated. Thank you.
This is my first post on stackoverflow. So in case of any missing information or confusing statements please be patient. Thank you.
UPDATE:
Thank you for the help. For the benefit of others I would like to give example of how to use cypher queries to traverse/find shortest path between two nodes.
from neo4django.db import connection
results = connection.cypher("START source=node:`graph-person_conns`(person_name='s2sp1'),dest=node:`graph-person_conns`(person_name='s2c1') MATCH p=ShortestPath(source-[*]->dest) RETURN extract(i in nodes(p) : i.person_name), extract(j in rels(p) : type(j))")
This is to find shortest path between nodes named s2sp1 and s2c1 on the graph. Cypher queries are really cool and help traverse nodes limiting the hops, types of relations etc.
Can someone comment on the performance of this method? Also please suggest if there are any other efficient methods to access Neo4j from Django. Thank You :)
Hm, why are you using Cypher? neo4django QuerySets work just fine for the above if you set the properties to indexed=True (or not, it'll just be slower for those).
people = person_conns.objects.filter(name='n2')
The neo4django docs have some other querying examples, as do the Django docs. Neo4django executes those queries as Cypher on the backend- you really shouldn't need to drop down to writing the Cypher yourself unless you have a very particular traversal pattern or a performance issue.
Anyway, to more directly tackle your question- the last example you used needs backticks to escape the index name, like
listpar = connection.cypher("START no=node:`graph-person_conns`(name='s2') RETURN no.owner?, no.name?",raw=True)
The first example should work. One thought- did you flip the autoindexing on before or after saving the nodes you're searching for? If after, note that you'll have to manually reindex the nodes either using the Java API or by re-setting properties on the node, since it won't have been autoindexed.
HTH, and welcome to StackOverflow!

Resources