Query Staged files in Snowflake - snowflake-cloud-data-platform

I am interested in doing a bulk load to add loaded file name to a column in the target column. The file name contains the timestamp of the file created which helps us to identify the insert timestamp as a column.
I know I can do something like
COPY INTO MYTABLE(FILENAME,FILE_ROW_NUMBER,COL1,COL2)
FROM (SELECT METADATA$FILENAME,METADTA$FILE_ROW_NUMBER,T.$1,T.$2
FROM #MYSTAGE(FILE_FORMAT => MYFORMAT)T) ;
But this works for a single file. All the files do not follow the same pattern and hence I cannot use pattern option. I am looking for something similar like copy into where I can specify files
One option is to filter files but that does not seem like very efficient if I have many files. Looking for options and suggestions.

Related

where is the option to load CSV into Snowflake? I'm not seeing it

I'm testing out a trial version of Snowflake. I created a table and want to load a local CSV called "food" but I don't see any "load" data option as shown in tutorial videos.
What am I missing? Do I need to use a PUT command somewhere?
Don't think Snowsight has that option in the UI. It's available in the classic UI though. Go to Databases tab, select a database. Go to Tables tab and select a table the option will be at the top
If the classic UI is limiting you or you are already using Snowsight and don't want to switch back, then here is another way to upload a CSV file.
A preliminary is that you have installed SnowSQL on your device (https://docs.snowflake.com/en/user-guide/snowsql-install-config.html).
Start SnowSQL and perform the following steps:
Use the database where to upload the file to. You need various privileges for creating a stage, a fileformat, and a table. E.g. USE MY_TEST_DB;
Create the fileformat you want to use for uploading your CSV file. E.g.
CREATE FILE FORMAT "MY_TEST_DB"."PUBLIC".MY_FILE_FORMAT TYPE = 'CSV';
If you don't configure the RECORD_DELIMITER, the FIELD_DELIMITER, and other stuff, Snowflake uses some defaults. I suggest you have a look at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html. Some of the auto detection stuff can make your life hard and sometimes it is better to disable it.
Create a stage using the previously created fileformat
CREATE STAGE MY_STAGE file_format = "MY_TEST_DB"."PUBLIC".MY_FILE_FORMAT;
Now you can put your file to this stage
PUT file://<file_path>/file.csv #MY_STAGE;
You can find documentation for configuring the stage at https://docs.snowflake.com/en/sql-reference/sql/create-stage.html
You can check the upload with
SELECT d.$1, ..., d.$N FROM #MY_STAGE/file.csv d;
Then, create your table.
CREATE TABLE MY_TABLE (col1 varchar, ..., colN varchar);
Personally, I prefer creating first a table with only varchar columns and then create a view or a table with the final types. I love the try_to_* functions in snowflake (e.g. https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html).
Then, copy the content from your stage to your table. If you want to transform your data at this point, you have to use an inner select. If not then the following command is enough.
COPY INTO mycsvtable from #MY_STAGE/file.csv;
I suggest doing this without the inner SELECT because then the option ERROR_ON_COLUMN_COUNT_MISMATCH works.
Be aware that the schema of the table must match the format. As mentioned above, if you go with all columns as varchars first and then transform the columns of interest in a second step, you should be fine.
You can find documentation for copying the staged file into a table at https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
If you can check the dropped lines as follows:
SELECT error, line, character, rejected_record FROM table(validate("MY_TEST_DB"."MY_SCHEMA"."MY_CSV_TABLE", job_id=>'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'))
Details can be found at https://docs.snowflake.com/en/sql-reference/functions/validate.html.
If you want to add those lines to your success table you can copy the the dropped lines to a new table and transform the data until the schema matches with the schema of the success table. Then, you can UNION both tables.
You see that it is pretty much to do for loading a simple CSV file to Snowflake. It becomes even more complicated when you take into account that every step can cause some specific failures and that your file might contain erroneous lines. This is why my team and I are working at Datameer to make these types of tasks easier. We aim for a simple drag and drop solution that does most of the work for you. We would be happy if you would try it out here: https://www.datameer.com/upload-csv-to-snowflake/

Snowflake: Turning staged S3 URL into the columns for a table

I've staged some data from S3 into Snowflake, which I want to COPY into a table. However, I want some of the columns in the table to be values from the URL path of the staged data. For example -
The data is stored like this - s3://bucket1/subbucket1/object_ID/instance/type/file.json
I want to store the data in a table that looks like this:
object_ID
instance
type
values from file (JSON)
2222
3333
type1
{JSON}
The only way I've been able to find that helps filter on a COPY INTO command is the PATTERN function, which allows you to copy over only values specified through a regex. Using that function, I've only been able to pull back certain files, but the resulting table has the single value of the full path.
I assume you are using the METADATA$FILENAME function during your COPY INTO command? Have you tried parsing it directly as part of your SELECT?
Something along the lines of:
COPY INTO ...
SELECT SPLIT_PART(METADATA$FILENAME,'/',5),
SPLIT_PART(METADATA$FILENAME,'/',6),
SPLIT_PART(METADATA$FILENAME,'/',7),
$1
FROM #STAGE;
I haven't tried this to see if it works in a COPY command, but it definitely works when selecting directly against the stage, so I'd imagine it'll work for the COPY, as well. If it doesn't, let me know.

Mass import txt files in a single SQL Server table, using filename as key column

I have a folder of txt files. The filenames are of the form [integer].txt (like 1.txt, 2.txt and so on).
I have a table, let's say TableA (id int not null, contents varchar(max))
I want a way to mass import the contents of those files into TableA, populating the id column from the filename. Each file will be a single record in the table. It's not a delimited file.
I've looked into SSIS and flat-file source, but I could not find a way to select a folder instead of a single file (this answer claims it can be done, but I could not find out how).
Bulk Insert is my next bet, but I'm not sure how I can populate the id column with the filename.
Any ideas?
For anyone that might need it, I ended up solving this by:
Using a ForEach loop container (Thanks for the hint #Panagiotis
Kanavos)
Using a flat-file source, setting as row delimiter and column
delimiters a sequence I know didn't exist in the file (for example '$$$')
Assigning the filename to a variable, and the full path to a computed
variable (check this great post on how to assign the variables)
Using a derived column to pass the filename in the output (check out
this answer)

load matching files in table

I have a folder that contains multiple .csv files for each employee like empname_date.csv and I want to load files in one table.
Not all files but only files where file name matches the data with tbl_empmaster table that contains the master list of employees.
I do not want to check each file because it will take too much of time. I need to filter files as per master list and then load the matching employee files.
Please help what I can do in this case.
I am using SSIS to do the same.
Create an SSIS Package to with a For Each Loop Container to Read all the CSV files of the given Folder.
Read the File Name without extension to a variable and Before inserting perform a table lookup to see whether the given File Name exists in your table and insert only if the match is found

Need to map csv file to target table dynamically

I have several CSV files and have their corresponding tables (which will have same columns as that of CSVs with appropriate datatype) in the database with the same name as the CSV. So, every CSV will have a table in the database.
I somehow need to map those all dynamically. Once I run the mapping, the data from all the csv files should be transferred to the corresponding tables.I don't want to have different mappings for every CSV.
Is this possible through informatica?
Appreciate your help.
PowerCenter does not provide such feature out-of-the-box. Unless the structures of the source files and target tables are the same, you need to define separate source/target definitions and create mappings that use them.
However, you can use Stage Mapping Generator to generate a mapping for each file automatically.
PMy understanding is you have mant CSV files with different column layouts and you need to load them into appropriate tables in the Database.
Approach 1 : If you use any RDBMS you should have have some kind of import option. Explore that route to create tables based on csv files. This is a manual task.
Approach 2: Open the csv file and write formuale using the header to generate a create tbale statement. Execute the formula result in your DB. So, you will have many tables created. Now, use informatica to read the CSV and import all the tables and load into tables.
Approach 3 : using Informatica. You need to do lot of coding to create a dynamic mapping on the fly.
Proposed Solution :
mapping 1 :
1. Read the CSV file pass the header information to a java transformation
2. The java transformation should normalize and split the header column into rows. you can write them to a text file
3. Now you have all the columns in a text file. Read this text file and use SQL transformation to create the tables on the database
Mapping 2
Now, the table is available you need to read the CSV file excluding the header and load the data into the above table via SQL transformation ( insert statement) created by mapping 1
you can follow this approach for all the CSV files. I haven't tried this solution at my end but, i am sure that the above approach would work.
If you're not using any transformations, its wise to use Import option of the database. (e.g bteq script in Teradata). But if you are doing transformations, then you have to create as many Sources and targets as the number of files you have.
On the other hand you can achieve this in one mapping.
1. Create a separate flow for every file(i.e. Source-Transformation-Target) in the single mapping.
2. Use target load plan for choosing which file gets loaded first.
3. Configure the file names and corresponding database table names in the session for that mapping.
If all the mappings (if you have to create them separately) are same, use Indirect file Method. In the session properties under mappings tab, source option.., you will get this option. Default option will be Direct change it to Indirect.
I dont hav the tool now to explore more and clearly guide you. But explore this Indirect File Load type in Informatica. I am sure that this will solve the requirement.
I have written a workflow in Informatica that does it, but some of the complex steps are handled inside the database. The workflow watches a folder for new files. Once it sees all the files that constitute a feed, it starts to process the feed. It takes a backup in a time stamped folder and then copies all the data from the files in the feed into an Oracle table. An Oracle procedure gets to work and then transfers the data from the Oracle table into their corresponding destination staging tables and finally the Data Warehouse. So if I have to add a new file or a feed, I have to make changes in configuration tables only. No changes are required either to the Informatica Objects or the db objects. So the short answer is yes this is possible but it is not an out of the box feature.

Resources