I need to import data into database using a CSV file and here is my table structure.
One vehicle can have multiple attachments(images,pdf,doc,etc..)
This is my current CSV template.
"vehicleMake", "vehicleModel", "vehicleRego", "attachmentName"
"Nissan", "Qashqai", "ASY504", "abc.pdf"
"Hyundai", "Lantra", "OCR491", "xyz.png"
"HONDA", "CIVIC", "WXO839", "qwe.txt"
I got stuck when there is more than one attachment for a vehicle.
What is the best practice to add multiple attachmentNames in CSV?
Comma separated attachment names like this, "123.pdf, xyz.jpeg, test.png" or is there any other way to achieve this?
The problem you have, is that you are trying to map a 1:n structure to a flat file format.
I usually get this kind of data as multiple rows eg:
"HONDA", "CIVIC", "WXO839", "123.txt"
"HONDA", "CIVIC", "WXO839", "xyz.txt"
"HONDA", "CIVIC", "WXO839", "test.txt"
you´ll then have to transform that upon import.
Over your approach, this one has no limitations in the number of attachments that can be added. The disadvantage is that it takes more space trough redundancy.
This is the opposite of normalization in a Entity-relationship model.
Split CSV files in two (import table vehicle attachment seperately) and add VehicleID in Vehicle_attachment to link to Vehicle table.
You may need to export Vehicle table, to use ID's generated there.
Related
I have this JSON on a Bucket that has been crawled with a classifier that splits arrays into record with this JSON classifier $[*].
I noticed that the JSON is on a single line - nothing wrong with syntax - but this results in the table being created having a single column of type array containing a struct which contains the actual fields I need.
In Athena I wasn't able to access the data and Glue was not able to read the columns as in array.field; so I manually changed the structure of the table making a single struct type with the other fields inside. This I'm able to query on Athena and get the Glue wizard to recognise the single columns as part of the struct.
When I create the job and map the fields accordingly (this is what is automatically generated, note the array.field notation applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("array.col1", "long", "col1", "long"), ("array.col2", "string", "col2", "string"), ("array.col3", "string", "col3", "string")], transformation_ctx = "applymapping1")) I test the output on a table in an S3 Bucket. The Job does not fail at all, BUT creates files in the Bucket that are empty!
Another thing I've tried is to modify the Source JSON and add return lines:
this is before:
[{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}]
this is after:
[
{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},
{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}
]
Having the file modified as stated before lets me correctly read and write data; this led me to believe that the problem is having a bad JSON at the beginning. Before asking to change the JSON is there something I can implement in my Glue Job (Spark 2.4, Python 3) to handle a JSON on a single line? I've searched everywhere but found nothing.
The end goal is to load data into Redshift, we're working S3 to S3 to check on why data isn't being read.
Thanks in advance for your time and consideration.
I am very new to SnowFlake and I am trying to work on a dataset. The column I am interested in has multiple feedbacks combined into one in the JSON format and I want to dig only the relevant key. Here's the snapshot of lets say Column_X:
Looking for a way to parse this data in such a way that I have a new column like "riskIndicator" and "riskIndicator" with values 27, 74 as two new rows. I am attempting to parse like the code below but that's not working. Had a look at the javascript/UDF approach but looks complicated for this piece.
,get_path(parse_json("riskIndicatorLNInstantID"),'riskCode') as riskIndicator
I will be thankful for any kind of help/suggestion here.
Thank you.
So if the problem you are having is breaking up the json, you will want to use FLATTEN
with data as (
select parse_json('[{"description":"unable to paste json", "riskCode":"27","seq":1},{"description":"typing in json is painful", "riskCode":"74","seq":2}]') as json
)
select d.json
,f.value:riskCode as riskIndicator
from data d
,lateral flatten(input=>d.json) f;
gives:
JSON RISKINDICATOR
[{ "description": "unable to paste j... "27"
[{ "description": "unable to paste j... "74"
Lateral flatten can help extract the fields of a JSON object and is a very good alternative to extracting them one by one using the respective names. However, sometimes the JSON object can be nested and normally extracting those nested objects requires knowing their names
Docs Reference: https://community.snowflake.com/s/article/Dynamically-extract-multi-level-JSON-object-using-lateral-flatten
I am trying to initiate a relationship between two columns in Neo4j. my dataset is a CSV file with two-column refers to Co-Authorship and I want to Construct a Network of it. I already load the data, return them and match them.
Loading
load csv from 'file:///conet1.csv' as rec
return the data
create (:Guys {source: rec[0], target: rec[1]})
now I need to Construct the Collaboration Network of data by making a relationship between source and target columns. What do you propose for the purpose?
I was able to make a relationship between mentioned columns in NetworkX graph libray in python like this:
import pandas as pd
import networkx as nx
g = nx.Graph()
df = pd.read_excel('Colab.csv', columns= ['source', 'target'])
g = nx.from_pandas_edgelist(df,'source','target', 'weight')
If I understand your use case, I do not believe you should be creating Guys nodes just to store relationship info. Instead, the graph-oriented approach would be to create an Author node for each author and a relationship (say, of type COLLABORATED_WITH) between the co-authors.
This might work for you, or at least give you a clue:
LOAD CSV FROM 'file:///conet1.csv' AS rec
MERGE (source:Author {id: rec[0]})
MERGE (target:Author {id: rec[1]})
CREATE (source)-[:COLLABORATED_WITH]->(target)
If it is possible that the same relationship could be re-created, you should replace the CREATE with a more expensive MERGE. Also, a work can have any number of co-authors, so having a relationship between every pair may be sub-optimal depending on what you are trying to do; but that is a separate issue.
I have a CSV file containing the data that I want to convert into a graph database using Neo4j. The Columns in the file are in the following format :
Person1 | Person2 | Points
Now the ids in Person1 and Person2 are redundant , so I am using a Merge statement instead. But I am not getting the correct results.
For a sample dataset , the output seems to be correct , but when I import my dataset consisting of 2M rows, it somehow doesn't create the relationships.
I am putting the cypher code that I am using currently.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:C:/Users/yogi/Documents/Neo4j/default.graphdb/sample.csv" AS csvline
MERGE (p1:Person {id:toInt(csvline.id1)})
MERGE (p2:Person {id:toInt(csvline.id2)})
CREATE (p1)-[:points{count:toInt(csvline.c)}]->(p2)
Some things you should check:
are you using an index: CREATE INDEX ON :Person(id) should be run before the import
depending on the Neo4j version you're using, the statement might be subject to "eager-pipe" which basically prevents the periodic commit. For more on eager pipe, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
I think nested data loading in ExtJS 4 is impossible in real world. For example, I have some models for trainings, for users, and some model which describes how trainings assigned to users, which user assigned training, etc. I think it's wrong to get join in PHP Model (I use ZF 1.x) to get users first names and last names instead their identifiers, and it's wrong to merge data from two selects (select join with traings and trainings-to-users table and select from users where identifier in (list of identifiers from join above)) in Controller. Good practise to do it in View, isn't it?
But ExtJS 4 nested data loading tells me: make hierarchial structure (with a lot of duplicates of nested data!) and pass it me. Why I need to do this work? I'll better merge data in Controller!
My final question is: if I have JSON like this:
{
trainings: [{"id":"1", "user_id":278,"assigner_id":30, ...}, {...}],
users: ["278": {"firstname":"Guy", "lastname":"Fawkes"}, "300":{....}, ...]
}
Can I to create another nested data loading from this data?
Thanks.
Yes you can ... sort of. One way to do it is to define separate stores for users and trainings and then use loadData(JSON.users) to load data that was prefetched.