How to use "WITH" clause with Memgraph’s GQLAlchemy? - graph-databases

Im working with GQLAlchemy and was wondering how to properly use WITH clause and import a CSV file with headers using Memgraph’s GQLAlchemy.

For example,
match().node(variable=“n”).with_({“n”: “node”})….
it is with underscore method in the query builder and an argument is a dictionary that in the above example means:
WITH n AS node
Importing CSV files with (or without) headers using GQLAlchemy can be done by executing a Cypher query. Check out how the CSV files with a header were imported in our workshop.
For example:
memgraph.execute(
"""LOAD CSV FROM "/movies.csv"
WITH HEADER AS row
MERGE (m:Movie {id: toInteger(row.movieId), title: row.title})
WITH m, row
UNWIND split(row.genres, '|') as genre
MERGE (m)-[:OF_GENRE]->(:Genre {name: genre});
""" )

Related

Read a Struct JSON with AWS Glue that is on a single line

I have this JSON on a Bucket that has been crawled with a classifier that splits arrays into record with this JSON classifier $[*].
I noticed that the JSON is on a single line - nothing wrong with syntax - but this results in the table being created having a single column of type array containing a struct which contains the actual fields I need.
In Athena I wasn't able to access the data and Glue was not able to read the columns as in array.field; so I manually changed the structure of the table making a single struct type with the other fields inside. This I'm able to query on Athena and get the Glue wizard to recognise the single columns as part of the struct.
When I create the job and map the fields accordingly (this is what is automatically generated, note the array.field notation applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("array.col1", "long", "col1", "long"), ("array.col2", "string", "col2", "string"), ("array.col3", "string", "col3", "string")], transformation_ctx = "applymapping1")) I test the output on a table in an S3 Bucket. The Job does not fail at all, BUT creates files in the Bucket that are empty!
Another thing I've tried is to modify the Source JSON and add return lines:
this is before:
[{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}]
this is after:
[
{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},
{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}
]
Having the file modified as stated before lets me correctly read and write data; this led me to believe that the problem is having a bad JSON at the beginning. Before asking to change the JSON is there something I can implement in my Glue Job (Spark 2.4, Python 3) to handle a JSON on a single line? I've searched everywhere but found nothing.
The end goal is to load data into Redshift, we're working S3 to S3 to check on why data isn't being read.
Thanks in advance for your time and consideration.

pyarrow parquet - encoding array into list of records

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).
I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.
I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery
Here is the python script to create parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
'organizationId' : ['org1', 'org2', 'org3'],
'entityType' : ['customer', 'customer', 'customer'],
'entityId' : ['cust_1', 'cust_2', 'cust_3'],
'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
}
)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:
{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}
but I would expect something this:
{"type":"array","type":["null","string"],"default":null}]}}
Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?
thanks
As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:
Required
optional
repeated.
In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.

How to Create Relationship between two Different Column in Neo4j

I am trying to initiate a relationship between two columns in Neo4j. my dataset is a CSV file with two-column refers to Co-Authorship and I want to Construct a Network of it. I already load the data, return them and match them.
Loading
load csv from 'file:///conet1.csv' as rec
return the data
create (:Guys {source: rec[0], target: rec[1]})
now I need to Construct the Collaboration Network of data by making a relationship between source and target columns. What do you propose for the purpose?
I was able to make a relationship between mentioned columns in NetworkX graph libray in python like this:
import pandas as pd
import networkx as nx
g = nx.Graph()
df = pd.read_excel('Colab.csv', columns= ['source', 'target'])
g = nx.from_pandas_edgelist(df,'source','target', 'weight')
If I understand your use case, I do not believe you should be creating Guys nodes just to store relationship info. Instead, the graph-oriented approach would be to create an Author node for each author and a relationship (say, of type COLLABORATED_WITH) between the co-authors.
This might work for you, or at least give you a clue:
LOAD CSV FROM 'file:///conet1.csv' AS rec
MERGE (source:Author {id: rec[0]})
MERGE (target:Author {id: rec[1]})
CREATE (source)-[:COLLABORATED_WITH]->(target)
If it is possible that the same relationship could be re-created, you should replace the CREATE with a more expensive MERGE. Also, a work can have any number of co-authors, so having a relationship between every pair may be sub-optimal depending on what you are trying to do; but that is a separate issue.

CSV Template for Data Import

I need to import data into database using a CSV file and here is my table structure.
One vehicle can have multiple attachments(images,pdf,doc,etc..)
This is my current CSV template.
"vehicleMake", "vehicleModel", "vehicleRego", "attachmentName"
"Nissan", "Qashqai", "ASY504", "abc.pdf"
"Hyundai", "Lantra", "OCR491", "xyz.png"
"HONDA", "CIVIC", "WXO839", "qwe.txt"
I got stuck when there is more than one attachment for a vehicle.
What is the best practice to add multiple attachmentNames in CSV?
Comma separated attachment names like this, "123.pdf, xyz.jpeg, test.png" or is there any other way to achieve this?
The problem you have, is that you are trying to map a 1:n structure to a flat file format.
I usually get this kind of data as multiple rows eg:
"HONDA", "CIVIC", "WXO839", "123.txt"
"HONDA", "CIVIC", "WXO839", "xyz.txt"
"HONDA", "CIVIC", "WXO839", "test.txt"
you´ll then have to transform that upon import.
Over your approach, this one has no limitations in the number of attachments that can be added. The disadvantage is that it takes more space trough redundancy.
This is the opposite of normalization in a Entity-relationship model.
Split CSV files in two (import table vehicle attachment seperately) and add VehicleID in Vehicle_attachment to link to Vehicle table.
You may need to export Vehicle table, to use ID's generated there.

Creating a Neo4j Graph Database Using LOAD CSV

I have a CSV file containing the data that I want to convert into a graph database using Neo4j. The Columns in the file are in the following format :
Person1 | Person2 | Points
Now the ids in Person1 and Person2 are redundant , so I am using a Merge statement instead. But I am not getting the correct results.
For a sample dataset , the output seems to be correct , but when I import my dataset consisting of 2M rows, it somehow doesn't create the relationships.
I am putting the cypher code that I am using currently.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:C:/Users/yogi/Documents/Neo4j/default.graphdb/sample.csv" AS csvline
MERGE (p1:Person {id:toInt(csvline.id1)})
MERGE (p2:Person {id:toInt(csvline.id2)})
CREATE (p1)-[:points{count:toInt(csvline.c)}]->(p2)
Some things you should check:
are you using an index: CREATE INDEX ON :Person(id) should be run before the import
depending on the Neo4j version you're using, the statement might be subject to "eager-pipe" which basically prevents the periodic commit. For more on eager pipe, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

Resources