How to load a table with multiple zip files in Snowflake? - snowflake-cloud-data-platform

I'm trying to upload data to a Snowflake table using a zip file containg multiple CSV files but I keep getting the following message:
Unable to copy files into table. Found character '\u0098' instead of
field delimiter ',' File 'tes.zip', line 118, character 42 Row 110,
column "TEST"["CLIENT_USERNAME":1] If you would like to continue
loading when an error is encountered, use other values such as
'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more
information on loading options, please run 'info loading_data' in a
SQL client.
If I skip the errors some data load but it is like snowflake is not properly opening the zip file and I just get some random characters like if the zip file was only opened with notepad.
I tried changing the File Format Compression Method to all the available ones: Auto, Gzip, Deflate, Raw Deflate, Bz2m Brotli, Zstd and None. Getting different error messages.
I know my Zip file is compressed using the standard Deflate compression method but when I select this type I'm getting the following error:
Invalid data encountered during decompression for file: 'test.zip',compression type used: 'DEFLATE', cause: 'data error'
The "Auto" method sends the same error message as None
I also tried with zip files containing only one file and I get the same errors. The files that worked correctly were an uncompressed one (CSV) and one compressed using GZ but I need this to work using a zip file containing multiple CSVs

A zip file is not a DEFLATE file, even though zip uses deflate. All the compression methods supported are single file compression methods. Where-as zip is a file archive, thus why it has many files, which would be similar to are tar.gz which is also not supported.
Thus you will ether need to uncompress your files yourself, in your S3 bucket, or alter your data export tool to conform.
CREATE FILE FORMAT help

Related

Pervasive SQL(10.3) File size exceeding 2GB resulting in a .^01 file being created

We have a database with a data file exceeding 2Gb, this resulted in a .^01 file being generated with the same file name. We now have a .DAT file and a .^01 with the same name.
I have subsequently deleted the unnecessary data (old history, no longer required) and the .DAT file is now only 372MB, but the .^01 file remains.
I would like to clone the .DAT file and save the data and reload it into the cloned (blank file. I normally use Butil (Clone, Save and Load) but am unsure what I need to do with the .^01 file as the Butil -Save FileName.^01 FileName.seq returns an error as it does not recognise the ^:
BUTIL-14: The file that caused the error is FileName.01.
BUTIL-100: MicroKernel error = 12. The MicroKernel cannot find the specified file.
I would greatly appreciate some direction/input in this regard
Thank you and kind regards,
You don't need to do anything with the .^XX file(s). They are called Extended files and are automatically handled by the PSQL engine. A BUTIL -CLONE / -COPY will read all of the data (original file and extended file(s)) and copy it to the new file.
To rebuild it, you should do something like:
BUTIL -CLONE <NEWFILE.DAT> <OLDFILE.DAT>
BUTIL -COPY <OLDFILE.DAT> <NEWFILE.DAT>
Also, if the file grows above 2GB again, the Extended File (.^01) will come back.

how to get the type of the file before its compression

For example, if we have the following file: file.txt that after the compression is now file.new (new is the new extension) , how to obtain that .txt extension, that is forgotten?
I need that to decompress the file.
In general, if you lose the file name extension you can't get it back. It's as simple as this.
However, there might be chances depending on the compression format. Some formats do store the original file name (along with other informations) in the compressed file. And the "decompressor" will be able to recreate those properties.
Anyway, it's good practise to name a compressed file with an additional extension, in your case file.txt.new.
Oh, and you don't need to know the file name extension to uncompress the compressed file. Just uncompress it and give it a temporary name. As #MarcoBonelli said, file contents and file name extensions have no fixed relation. They are just a convention to handle them conveniently.
For example: You can rename a EXE to DOCX. Windows will show the Word icon but it is still an executable. Windows will not attempt to run it, though.
To know what a file contains can be difficult. The magic number Marco linked to might give you some hint.

How would I store different types of data in one file

I need to store data in a file in this format
word, audio, jpeg
How would I store that all in one file? Is it even possible do would I need to store links to other data files in place of the audio and jpeg. Would I need a custom file format?
1. Your own filetype
As mentioned by #Ken White you would need to be creating your own custom file format for this sort of thing, which would then mean creating your own parser type. This could be achieved in almost any language you wanted but since you are planning on using word format, then maybe C# would be best for you. However, this technique could be quite complicated and take a relatively large amount of time to thoroughly test your file compresser / decompressor, but may be best depending on your needs.
2. Command line utilities
Another way to go about this would be to use a bash script to combine all of the files into one file, and then decompress it at the other end. For example the steps could involve:
Combine files using windows copy / linux cat command on command line
Create a metdata file of your own that says how many files are in this custom file, and how much memory each one takes up (could be a short XML or JSON file for example...)
Use the linux split command or install a Windows command line file splitter program (here's just one example) to split the file back into whatever components have made it up.
This way you only have to create a really small file type, and let the OS utilities handle the combining of them for you.
Example on Windows:
Copy all of the files in your current directory into one output file called 'file.custom'
copy /b * file.custom
Generate custom file format describing metadata (i.e. get the file size on disk in C# example here). This is just maybe what I would do in JSON. SO formatting was being annoying so here's a link (Copy paste it into an editor or online JSON viewer).
Use a decompress windows / linux command line tool to decompress each files to the exact length (and export it back to the exact name) specified in the JSON (metadata) file. (More info on splitting files on this post).
3. ZIP files
You could always store all of the files in a compressed zip file, and then just use a zip compressor, expander as and when you like to retreive any number of file formats stored within.
I found a couple of examples of :
Combining multiple files into one ZIP file in only C# .net,
Unzipping ZIP files in C#
Zipping & Unzipping with only windows built-in utilities
Zipping & Unzipping in Linux command line
Good Zipping/Unzipping library in Java
Zipping/Unzipping in Python

jMimeMagic returning mime type for docx, pptx, jar files as application/zip

I read the mimetype for .docx file is application/vnd.openxmlformats-officedocument.wordprocessingml.document. But when I upload a .docx file(one that I just created, not from a zip file) and check for its mimetype in my application using
String mimeType = Magic.getMagicMatch(file1, false).getMimeType();
I get Mimetype as application/zip.
I get the same result when I try to upload a .jar file.
I mean this way, how can I check if the user is uploading a msword or a jar file to my application?
All of the .*x Office variants (.docx, .pptx, and so on) are XML-based content which is wrapped in a ZIP "container" to keep them compact, and your library is detecting the ZIP header correctly but then either not checking for, or failing to find, the additional information that would allow it to distinguish those from a ZIP file containing whatever random data someone put into it.
Similarly, the JAR file format is an extension of the ZIP file format, so if the library does not know to check for the "special type of ZIP" case, it would simply report it as a ZIP file.

Detecting the database a .DAT file belongs to

I have a set of .DAT files present along side a set of .IDX files with the same name.
The goal is to be able to open these files and read its contents, parsing it into a new format. The problem: I have no idea what database the data is being stored in! The files contain no headers or clues, they are binary, and the resource from which I have received these has no idea as to its storage mechanism.
So the question is: What are some common databases which store databases in .DAT files and store their indexes in .IDX files with the same name? Is there an application I can use in Linux or Windows which can detect the database?
EDIT :-
File names:
price.dat
price.idx
Here is a hex dump of the beginning of the .DAT file:
030D04806420500FFE3E0500002078581001C000738054E0C0099804138100402550080442090082403C101F7406010080C0A010201002010C006FC0246C0403FE00B041C051F0091BFE042F812FE054F8177E066F81BFE078F8207E08AF824FE09CF8297E0AEF82DFE0C0F8327E0D2F836FE0E4F83B7E0F6F83FE5FEFF47C06608480FA91F003C0213101F1BFDFE804220100F500D2A00388430801E04028D4390D128B46804024010A067269FCA546003C0844060E11F084B9E1377850
Here is a hex dump of the beginning of the .IDX file:
030D04805820100FFD7E0000397FEB60050410007300246A3060068220009BE0401030088B3903F740E010C80402410281402030094004C708004DC058880FFC052F015EBFE042F812FE054F8177E066F81BFE078F8207E08AF824FE09CF8297E0AEF82DFE0C0F8327E0D2F836FE0E4F83B7E0F6F83FFE108F8447E11AF848FE12CF84D7E13EF851FE150F8567E162F85AFE174F85F7E186F863FE198F8687E1AAF86CFE1BCF8717E1CEF875FE1E0F87A7E1F2F87EF5FEFF005E30901714
Both files uniquely start out with 030D04806420500FF wonder if this is a good start?
Did a quick search on Google but it didn't return anything...
END EDIT :-
Any other ideas?
Thanks much in advance!
There is a faircom ODBC driver called 'ctreeODBC_RO.exe' which should be capable.

Resources