Importing a tsv file into dynamoDB - dataset

I have a dataset, got from the registry of open data on AWS.
This is the link of my data set. I want to import this data set into a DynamoDb table, but I don't know how to do it
I tried to use data Pipeline, from S3 bucket to dynamoDB but it didn't work
at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) Caused
by: com.google.gson.stream.MalformedJsonException: Expected ':' at
line 1 column 20 at
com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1505) at
com.google.gson.stream.JsonReader.doPeek(JsonReader.java:519) at
com.google.gson.stream.JsonReader.peek(JsonReader.java:414) at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:157)
at
com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:187)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:145)
at com.google.gson.Gson.fromJson(Gson.java:803) ... 15 more Exception
in thread "main" java.io.
I have this error and I don't know how to fix it.
than I downloaded the file locally but I can't import it into my table in dynamoDb
There is no code for the moment, all i do is configuration
I'm expecting to have the data set into my table, but unfortunately I couldn't reach my goal

i finally converted the TSV file to a CSV file then to a Json file using a python script.
with a Json file it's much easier

Related

Presto query error: Error reading tail from

I'm trying to query data with Presto connection. The data(delta format) is in S3 bucket and fails with this error:
SQL Error [16777232]: Query failed (#20211005_122441_00037_s2r9w): Error reading tail from s3://*/*/*/table/*/part-00015-bc2cc6d2-706d-4859-ab57-5f87d93d81f5-c000.snappy.parquet with length 16384
When I look at the bucket the file doesn't exist.
Looks like your data has been changed, but the metadata (I assume you're using AWS Glue as a metastore) hasn't.
You can try to CALL system.sync_partition_metadata('<YOUR_SCHEMA>', '<YOUR _TABLE>', 'full'); to get it updated.
Also make sure you've got a consistent schema between your partitions if you're using them.

error while loading csv file though snlowsql copy into command

i was trying to load csv file from AWS s3 bucket with copy into command in one of the csv file throw error like
End of record reached while expected to parse column
'"RAW_PRODUCTS"["PACK_COUNT_UNITS":25]
and with the VALIDATION_MODE = RETURN_ALL_ERRORS it also give me 2 rows that have error i am not sure what error would be.
my concern is can we get specific error so that we can fix it in file.
You might try using the VALIDATE table function. https://docs.snowflake.com/en/sql-reference/functions/validate.html
Thanks Eda, i already reviewed above link but that did not work with sql query with copy into table from s3 bucket, so i create stage and place that csv file on stage and then try to run that validate command that give me same error row.
there is another way to identify error while executing copy into command you can add VALIDATION_MODE = RETURN_ALL_ERRORS you will get same result.
by the way i resolve error its due to /,, i remove / and it loaded successfully. / or /, is working as it was in other row but /,, did not work.

SNOWFLAKE Error parsing JSON: incomplete object value when parsing a Json file in snowflake worksheet. (json file is verified and correct)

the problem is I have a json file stored in a stage in one of my databases (newly generated) I am not performing any database related activities I'm just trying to query the json data using :
SELECT parse_json($1):order_id FROM #my_stage/mahdi_test.json.gz t;
and here is the mahdi_test.json sample :
{"order_id": 67, "file_id": *****, "file_name": "name.pdf", "file_type": "pdf", "language_id": 1, "created_dt": "2030-11-17 19:39:25", "delivery_id": *****},
(the "*" are just to avoid showing actual data.)
the json file contains multiple lines just like the sample above ... but the results of the query is :
"Error parsing JSON: incomplete object value, pos 17"
the most tricky part is that I took the same json file into another DB's stage (this database was generated before and not by me) and tried the same thing in the snowflake worksheet ( I changed the database in the panel on top right side of worksheet to the older DB) and started the exact same query with the exact same json file ... but this time it worked and it showed me the results.
what is causing this problem how can I make the new database to act like the other one, because clearly there is nothing wrong with the json file itself because it worked on the legacy database.
The answer to this question ended up being that the stage did not have a file format specified. Adding a file format that specified a JSON format to the stage fixed the issue.

Snowflake "Max LOB size (16777216) exceeded" error when loading data from Parquet

I am looking to load data from S3 into Snowflake. My files are in Parquet format, and were created by a spark job. There are 199 parquet files in my folder in S3, each with about 5500 records. Each parquet file is snappy compressed and is about 485 kb.
I have successfully created a storage integration and staged my data. However, when I read my data, I get the following message:
Max LOB size (16777216) exceeded, actual size of parsed column is 19970365
I believe I have followed the General File Sizing Recommendations but I have not been able to figure out a solution to this issue, or even a clear description of this error message.
Here is the basics of my SQL query:
CREATE OR REPLACE TEMPORARY STAGE my_test_stage
FILE_FORMAT = (TYPE = PARQUET)
STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
URL = 's3://my-bucket/my-folder';
SELECT $1 FROM #my_test_stage(PATTERN => '.*\\.parquet')
I seem to be able to read each parquet file individually by changing the URL parameter in the CREATE STAGE query to the full path of the parquet file. I really don't want to have to iterate through each file to load them.
The VARIANT data type imposes a 16 MB (compressed) size limit on individual rows.
The resultset is actually a display as a virtual column, so the 16MB limit also applied
Docs Reference:
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#semi-structured-data-size-limitations
There may be issue with one or more records from your file, try to run copy command with "ON_ERROR" option, to debug whether all the records has similar problem or only few.

talend tGSCopy selective file copy to other bucket

Using Talend, I am trying to move an App engine data store backup file specifically skiping file name ends with ".backup_info" to new folder.
I have to load only file 2,3 skipping file 1.
File:1
ahFzfnZpcmdpbi1yZWQtdGVzdHJACxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiRyH8MCxIWX0FFX0JhY2t1cF9JbmZvcm1hdGlvbhgBDA.backup_info
File:2
ahFzfnZpcmdpbi1yZWQtdGVzdHJACxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiRyH8MCxIWX0FFX0JhY2t1cF9JbmZvcm1hdGlvbhgBDA.MasterContentType.backup_info
File#3 ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhjSz7UDDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.Timeline.backup_info
There are around 100 objects, how do I key in for "Source Object Key" of tGScopy in component configuration to select particular file. This seems challenging, please assist.

Resources