how to phrase a single line json file in snowflake external table - snowflake-cloud-data-platform

this is the first time I try to load a single line json file into snowflake via external table.
the files are around 60MB and stored on s3. it contains nest records, arrays and no newline character.
Right now I cannot see the data from the external table, However, if the file is small enough like 1MB the external table works fine.
The closest solution I can find is this, but it doesn't provide me sufficient answer.
The file can be in a much bigger size and I have no control of the files.
Is there a way to fix this issue?
thanks!
Edit:
Look like the only tangible solution is to make the file smaller, as the json file is not ndjson. What JSON format does STRIP_OUTER_ARRAY support?

If your nested records are wrapped in an outer array, you can potentially use STRIP_OUTER_ARRAY on the Stage file definition to remove them so each JSON element within the outer array gets loaded as a record. You can include use METADATA$FILENAME in the table definition to derive which file the elements came from.

Related

How do you copy data from a JSON file in a external stage in Snowflake when the file is too large?

I have a JSON file (well technically several) in an external GCS stage in Snowflake. I'm trying to extract data from it into tables, but the file is too large. The file doesn't contain an array of JSON objects, but it actually one giant JSON object. As such, setting STRIP_OUTER_ARRAY to true isn't an option in this case. Breaking the files into smaller files isn't really an option either because they are maintained by an external program and I don't have any control over that.
The general structure of the JSON is:
{
meta1: value1,
meta2: value2,
...
data: {
component1: value1,
component2: value2,
...
}
}
The issue is due to the value of data. There can be a varying number of components and their names aren't reliably predictable. I could avoid the size issue if I could separate out the components, but I'm unable to do a lateral flatten inside of copy into. I can't load into a temporary variant column either because the JSON is too large. I tried to do an insert instead of a copy, but that complains about the size as well.
Are there any other options? I wondered if there might be a way to utilize a custom function or procedure, but I don't have enough experience with those to know if they would help in this case.
So its external programs to GCP to Snowflake. Looking at all the things you have tried around snowflake, seems like there may not be any other option available within SF.. However, how about checking something at GCP end. Once the file reaches GCP say in Folder 1, try to split the JSON into smaller bits, and move them into GCP Folder 2. Create your external stage on top of Folder 2.
Even I do not have much idea on GCP but thought of sharing this idea in case it works out for you.
I could avoid the size issue if I could separate out the components,
but I'm unable to do a lateral flatten inside of copy into.
I think you can use LATERAL FLATTEN in a CREATE TABLE ... AS SELECT ... statement (see this). You leave your huge JSON file in an external stage like S3, and you access it directly from there with a statement like this, to split it up by the "data" elements:
create table mytable (id string, value variant) as
select id::string, value::variant
from #mystage/myfile.json.gz,
lateral flatten(input => parse_json($1:data));

How to create a logic app which creates a tab-delimited table?

Right now, I run a stored procedure whose output feeds a "Create CSV Table" Data Operations component. This component, not surprisingly, outputs a comma-delimited list of fields, which is not supported by our remote system. The fields need to be tab-delimited. One would think that the Data Operations component would have a tab (or other character-delimited option). But no, only commas are available, and no other Data Operations component outputs a tab-delimited table.
Using any mechanism for which we'd have to write code is completely the last option, as there's no need for code to use CSV. Also, any mechanism which requires paying for 3rd party components is categorically out, as is using any solution which is in preview mode.
The only option we've thought of is to revamp the stored procedure which outputs a single "column" containing the tab-delimited columns, and then output to a file - ostensibly, a comma-delimited file, but one without commas embedded inside (which is allowed for my system) so that the single column isn't itself enquoted.
Otherwise, I guess Function Apps is the solution. Anyone with ideas?
The easiest way is to use string function and replace comma with other delimiter. If you could accept this way, after creating the csv table I initiate a string variable with this input replace(body('Create_CSV_table_2'),',',' ').
And this is the result.
And if you don't want this way, yes you have to solve it with code and the Function is a choice.

Talend - Issue when using Dynamic Schema with CSV file

I'm reading a csv file Dynamically but it seems that this data structure is deleting the first row (The mot important one in my case)
You can see that my file contains 109 row but the tFileInput seems to read 108.
Same file, with same configuration and changing Dynamic to String is working perfectly.
I need to read the file dynamically because columns number is variable and is I need to pivot my file according to the first missing line.
Any Idea ?
Thank you :)
ok. This is a know thing when you are using dynamic schema. i took this snip from talend documentation for you to read,
HEADER:Enter the number of rows to be skipped in the beginning of file.
Note that when dynamic schema is used, the first row of the input data is always treated as the header row no matter whether the Header field value is set or not. For more information about dynamic schema, see Talend Studio User Guide.

SSIS using too much memory to load large (40GB+) XML file into SQL Server table

I need to load a single large (40GB+) XML file into an SQL Server 2012 database table using SSIS. I'm having problems because SSIS seems to be trying to load the entire document in memory instead of streaming it.
Here's more details of my SSIS package.
I've created an XML Source with the folowing properties:
Data access mode: XML file from variable (but could be XML File Location)
Variable name: variable that specifies the XML file path in my computer.
XSD location: the path to the XSD that defines the XML being read.
The XML structure is simple, with only 3 hierarchical levels:
Root element with header information
One level defining collections of objects
The leaf level defining individual objects (each with a fixed set of fields)
I need to insert one database record per leaf element, repeating fields from the higher hierarchy levels. In other words, I need to flaten the XML hierarchy.
How can I make SSIS stream load the data, instead of trying to load the entire document in memory?
The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).
The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.
Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.
There may be third-party data sources which can do this. I had to write my own about five years ago.
Have you considered processing the files in smaller chunks?
I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.
To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.
Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

How to import variable record length CSV file using SSIS?

Has anyone been able to get a variable record length text file (CSV) into SQL Server via SSIS?
I have tried time and again to get a CSV file into a SQL Server table, using SSIS, where the input file has varying record lengths. For this question, the two different record lengths are 63 and 326 bytes. All record lengths will be imported into the same 326 byte width table.
There are over 1 million records to import.
I have no control of the creation of the import file.
I must use SSIS.
I have confirmed with MS that this has been reported as a bug.
I have tried several workarounds. Most have been where I try to write custom code to intercept the record and I cant seem to get that to work as I want.
I had a similar problem, and used custom code (Script Task), and a Script Component under the Data Flow tab.
I have a Flat File Source feeding into a Script Component. Inside there I use code to manipulate the incomming data and fix it up for the destination.
My issue was the provider was using '000000' as no date available, and another coloumn had a padding/trim issue.
You should have no problem importing this file. Just make sure when you create the Flat File connection manager, select Delimited format, then set SSIS column length to maximum file column length so it can accomodate any data.
It appears like you are using Fixed width format, which is not correct for CSV files (since you have variable length column), or maybe you've incorrectly set the column delimiter.
Same issue. In my case, the target CSV file has header & footer records with formats completely different than the body of the file; the header/footer are used to validate completeness of file processing (date/times, record counts, amount totals - "checksum" by any other name ...). This is a common format for files from "mainframe" environments, and though I haven't started on it yet, I expect to have to use scripting to strip off the header/footer, save the rest as a new file, process the new file, and then do the validation. Can't exactly expect MS to have that out-of-the box (but it sure would be nice, wouldn't it?).
You can write a script task using C# to iterate through each line and pad it with the proper amount of commas to pad the data out. This assumes, of course, that all of the data aligns with the proper columns.
I.e. as you read each record, you can "count" the number of commas. Then, just append X number of commas to the end of the record until it has the correct number of commas.
Excel has an issue that causes this kind of file to be created when converting to CSV.
If you can do this "by hand" the best way to solve this is to open the file in Excel, create a column at the "end" of the record, and fill it all the way down with 1s or some other character.
Nasty, but can be a quick solution.
If you don't have the ability to do this, you can do the same thing programmatically as described above.
Why can't you just import it as a test file and set the column delimeter to "," and the row delimeter to CRLF?

Resources