I need to import data from a csv of the form
id;name;targetset
1;"somenode",[1,3,5,8]
2,"someothernode",[3,8]
into the graph and I need to have targetset stored as collection (array) using cypher. I tried
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mytable.csv" AS row FIELDTERMINATOR ';'
CREATE (:MyNode {id: row.id, name: row.name, targetset: row.targetset});
but it stores targetset as a string, e.g. "[1,3,5,8]". There does not seem to be a function to convert array-encoding-strings to actual arrays, like there is toInt to convert strings to integers. Is there still another possibility?
APOC Procedures will be your best bet here. Use the function apoc.convert.fromJsonList().
An example of use:
WITH "[1,3,5,8]" as arr
RETURN apoc.convert.fromJsonList(arr)
You can try this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mytable.csv" AS row FIELDTERMINATOR ';'
CREATE (:MyNode {id: row.id, name: row.name, targetset: split(substring(row.targetset, 1, length(row.targetset) - 2), ',') });
The above code remove the [ and ] chars from the string [1,3,5,8] using substring() and length() functions. After the string 1,3,5,8is splited considering , as separator.
Related
My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.
You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html
My Data is follows:
[ {
"InvestorID": "10014-49",
"InvestorName": "Blackstone",
"LastUpdated": "11/23/2021"
},
{
"InvestorID": "15713-74",
"InvestorName": "Bay Grove Capital",
"LastUpdated": "11/19/2021"
}]
So Far Tried:
CREATE OR REPLACE TABLE STG_PB_INVESTOR (
Investor_ID string, Investor_Name string,Last_Updated DATETIME
); Created table
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
created file format
create or replace stage investor_stage
file_format = investorformat;
created stage
copy into STG_PB_INVESTOR from #investor_stage
I am getting an error:
SQL compilation error: JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
You should be loading your JSON data into a table with a single column that is a VARIANT. Once in Snowflake you can either flatten that data out with a view or a subsequent table load. You could also flatten it on the way in using a SELECT on your COPY statement, but that tends to be a little slower.
Try something like this:
CREATE OR REPLACE TABLE STG_PB_INVESTOR_JSON (
var variant
);
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
create or replace stage investor_stage
file_format = investorformat;
copy into STG_PB_INVESTOR_JSON from #investor_stage;
create or replace table STG_PB_INVESTOR as
SELECT
var:InvestorID::string as Investor_id,
var:InvestorName::string as Investor_Name,
TO_DATE(var:LastUpdated::string,'MM/DD/YYYY') as last_updated
FROM STG_PB_INVESTOR_JSON;
I am trying to "COPY INTO" command to load data from s3 to the snowflake
Below are the steps I followed to create the stage and loading file from stage to Snowflake
JSON file
{
"Name":"Umesh",
"Desigantion":"Product Manager",
"Location":"United Kingdom"
}
create or replace stage emp_json_stage
url='s3://mybucket/emp.json'
credentials=(aws_key_id='my id' aws_secret_key='my key');
# create the table with variant
CREATE TABLE emp_json_raw (
json_data_raw VARIANT
);
#load data from stage to snowflake
COPY INTO emp_json_raw from #emp_json_stage;
I am getting below error
Field delimiter ',' found while expecting record delimiter '\n' File
'emp.json', line 2, character 18 Row 2, column
"emp_json_raw"["JSON_DATA_RAW":1]
I am using a simple JSON file, and I don't understand this error.
What causes it and how can I solve it?
File format is not specified and is defaulting to CSV format hence the error.
Try this:
COPY INTO emp_json_raw
from #emp_json_stage
file_format=(TYPE=JSON);
There are other options too that can be specified with file_format other than TYPE. Refer the documentation here: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#type-json
try:
file_format = (type = csv field_optionally_enclosed_by='"')
The default settings do not expect the " wrapping around your data.
So you could strip all the " or ... just set the field_optionally_enclosed_by to a ". This does mean if your data has " in it things get messy.
https://docs.snowflake.com/en/user-guide/getting-started-tutorial-copy-into.html
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv
Also have a standard practice to mention type of file either it could be CSV, JSON ,AVRO , Parquet etc.
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
I'm having trouble creating a table in Athena - that points at files with the following format:
string, string, string, array.
when I wrote the file - I delimited the array items with '|'.
I delimited each line with '\n' and each column with ','.
so for example, a row in my CSV would look like that:
Garfield, 15, orange, fish|milk|lasagna
in hive (according to the documentation i read)- when creating a table with a row delimited format - while stating the delimiters you can state a 'collection items' delimiter - that states the delimiter between elements in array columns.
I could not find an equivalent for Presto in the documentation,
Is anyone aware if it's possible, if so - what is the format, or where can I find it?
i tried "guessing" many forms, including 'collection items', none seem to work.
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
Would really appreciate any insight, Thanks! :)
According to AWS Athena docs on using SerDe, your guess was 100% correct.
In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify ROW FORMAT DELIMITED
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
Now, when I simply tried your DDL statement, I would get
line 1:8: no viable alternative at input 'create external'
However by deleting LINES TERMINATED BY '\n', I was able to create table schema in meta catalog
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
Sample file with lines as shown in your file would get parsed correctly and I was able to do UNNEST on foods column:
SELECT *
FROM "cats"
CROSS JOIN UNNEST(foods) as t(food)
which resulted in
Moreover, it was also enough to simply swap lines LINES TERMINATED BY '\n' and COLLECTION ITEMS TERMINATED BY '|' for query to work (although I don't have an explanation for it)
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
(Note: this answer is applicable to Presto in general, but not to Athena)
Currently you cannot set collection delimiter in Presto.
Please create a feature request # https://github.com/prestosql/presto/issues/
Note, we plan to provide generic support for table properties to address cases like this holistically -- https://github.com/prestosql/presto/issues/954. You can track the issue and associated pull request for updates.
I'm use presto engine creating a hive table , set collection delimiter in Presto, for example:
CREATE TABLE IF NOT EXISTS test (
id bigint COMMENT 'ID',
type varchar COMMENT 'TYPE',
content varchar COMMENT 'CONTENT',
create_time timestamp(3) COMMENT 'CREATE TIME',
pt varchar
)
COMMENT 'create time 2021/11/04 11:27:53'`
WITH (
format = 'TEXTFILE',
partitioned_by = ARRAY['pt'],
textfile_field_separator = U&'\0001'
)
I am trying to create a table which has a complex data type. And the data types are listed below.
array
map
array< map < String,String> >
I am trying to create a data structure of 3 type . Is it ever possible to create in Hive? My table DDL looks like below.
create table complexTest(names array<String>,infoMap map<String,String>, deatils array<map<String,String>>)
row format delimited
fields terminated by '/'
collection items terminated by '|'
map keys terminated by '='
lines terminated by '\n';
And my sample data looks like below.
Abhieet|Test|Complex/Name=abhi|age=31|Sex=male/Name=Test,age=30,Sex=male|Name=Complex,age=30,Sex=female
Whever i am querying the data from the table i am getting the below values
["Abhieet"," Test"," Complex"] {"Name":"abhi","age":"31","Sex":"male"} [{"Name":null,"Test,age":null,"31,Sex":null,"male":null},{"Name":null,"Complex,age":null,"30,Sex":null,"female":null}]
Which is not i am expecting. Could you please help me to find out what should be the DDL if it ever possible for data type array< map < String,String>>
I don't think this is possible using the inbuilt serde. If you know in advance what the values in your maps are going to be, then I think a better way of approaching this would be to convert your input data to JSON, and then use the Hive json serde:
Sample data:
{'Name': ['Abhieet', 'Test', 'Complex'],
'infoMap': {'Sex': 'male', 'Name': 'abhi', 'age': '31'},
'details': [{'Sex': 'male', 'Name': 'Test', 'age': '30'}, {'Sex': 'female', 'Name': 'Complex', 'age': '30'}]
}
Table definition code:
create table complexTest
(
names array<string>,
infomap struct<Name:string,
age:string,
Sex:string>,
details array<struct<Name:string,
age:string,
Sex:string>>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
This can be handled with array of structs using the following query.
create table complexStructArray(custID String,nameValuePairs array<struct< key:String, value:String>>) row format delimited fields terminated by '/' collection items terminated by '|' map keys terminated by '=' lines terminated by '\n';
Sample data:
101/Name=Madhavan|age=30
102/Name=Ramkumar|age=31
Though struct allows duplicate key values unlike Map, above query should handle the ask if the data is having unique key values.
select query would give the output as follows.
hive> select * from complexStructArray;
101 [{"key":"Name","value":"Madhavan"},{"key":"age","value":"30"}]
102 [{"key":"Name","value":"Ramkumar"},{"key":"age","value":"31"}]
Sample data: {"Name": ["Abhieet", "Test", "Complex"],"infoMap": {"Sex":"male", "Name":"abhi", "age":31},"details": [{"Sex":"male", "Name":"Test", "age":30}, {"Sex":"female", "Name":"Complex", "age":30}]}
Table definition code:
#hive>
create table complexTest
(names array<string>,infomap struct<Name:string,
age:string,
Sex:string>,details array<struct<Name:string,
age:string,
Sex:string>>)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'