Some use cases of Seeds in DBT (Data built tool) - snowflake-cloud-data-platform

Suppose we have a CSV file in DBT with fields like I'd and First_name but name is in Lower case ...
While loading the data covert it to upper case without using any extra space or creating macros for the same .

You are looking for the quote_columns config for seeds. From the docs:
An optional seed configuration, used to determine whether column names in the seed file should be quoted when the table is created.
The default is to not quote on Snowflake; Snowflake then converts your field names to SHOUT_CASE. If you set quote_columns=True (see the docs on how to do this for your specific seed or all seeds), Snowflake will respect the case of the quoted name that gets passed to it.

Related

Macro that loads multiple datasets that updates and change names?

Im working with a database in SAS that updates every so often. I want the macro to automatically load the most recent dataset of a given year. The datasets cover the years 2015-2018 and each year has a different updated version which is stated in the name of the dataset, i.e. 2015_version9. With my current code you need to update the macro manually everytime a dataset change its version and name.
You can scan through each library and find the max version number, then save those to a single macro variable string that you can supply to a set statement. Here are the assumptions of this solution:
Your libraries are named lib_2015, lib_2016, etc. and follow 8-char libname requirements
Your libraries are static for years 2015-2018
Your datasets are named _version1, _version2, etc.
Here's how we'll do it.
%let libraries = "LIB_2015", "LIB_2016", "LIB_2017", "LIB_2018";
proc sql noprint;
select cats(libname, '.', memname)
, input(compress(memname,,'KD'), 8.) as version
into :data separated by ' '
from dictionary.members
where upcase(libname) IN(&libraries.)
AND upcase(memname) LIKE "^_VERSION%" escape '^'
group by libname
having version = max(version)
;
quit;
data want;
set &data. indsname=name;
dsn = name;
run;
This code does the following:
Gets all dataset names from each library that starts with _VERSION. The ^ in the like clause is an escape character that we defined so that we can match _ literally.
Removes all non-digits from the dataset name and converts it to a version number, version. The KD option in the compress() function says to keep only digits from the string.
Keeps only names in each library where version is the highest value
Saves all the dataset names to a single macro variable, &data
&data will store a string of all the relevant datasets you want with the highest version number for each library. For example:
%put &data.;
LIB_2015._VERSION9 LIB_2016._VERSION19 LIB_2017._VERSION12 LIB_2018._VERSION8
The indsname option in the data step will store the full dataset name of each observation. We're saving that to a variable named dsn. This shows where each observation comes from so you can split them out to individual datasets as needed.

Snowflake:Export data in multiple delimiter format

Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.

Configuring Fairness Monitoring for categorical features

When configuring fairness monitoring for a model, the prediction column only allows for integer numerical value even though the prediction label are categorical, is this an intended feature? How do I configure this for categorical feature (that is not integer)? is a manual conversion required?
The training data could have class labels such as “Loan Denied”, “Loan Granted”. The prediction value returned by WML scoring end point has values such as “0.0”, “1.0", etc. The scoring end point also has an optional column which contains the text representation of prediction. E.g., if prediction=1.0, the predictionLabel column could have a value “Loan Granted”. If such a column is available, then while configuring the favourable and unfavourable outcome for the model, you can specify the string values “Loan Granted” and “Loan Denied”. If such a column is not available then you need to specify the integer/double values of 1.0, 0.0 for the favourable/unfavourable class.
WML has a concept of output schema which defines the schema of the output of WML scoring end point and the role for the different columns. The roles are used to identify which column contains the prediction value, which column contains the prediction probability, and the class label value, etc. The output schema is automatically set for models created using model builder. It can also be set using the WML python client. Users can use the output schema to define a column which contains the string representation of the prediction. This is done by setting the modeling_role for the column to ‘decoded-target’. The documentation for the WML python client is available at: http://wml-api-pyclient-dev.mybluemix.net/#repository. Search for “OUTPUT_DATA_SCHEMA” to understand the output schema and the API to use is to store_model API which accepts the OUTPUT_DATA_SCHEMA as a parameter.

How to Dynamically render Table name and File name in pentaho DI

I have a requirement in which one source is a table and one source is a file. I need to join these both on a column. The problem is that I can do this for one table with one transformation but I need to do it for multiple set of files and tables to load into another set of specific files as target using the same transformation.
Breaking down my requirement more specifically :
Source Table Source File Target File
VOICE_INCR_REVENUE_PROFILE_0 VoiceRevenue0 ProfileVoice0
VOICE_INCR_REVENUE_PROFILE_1 VoiceRevenue1 ProfileVoice1
VOICE_INCR_REVENUE_PROFILE_2 VoiceRevenue2 ProfileVoice2
VOICE_INCR_REVENUE_PROFILE_3 VoiceRevenue3 ProfileVoice3
VOICE_INCR_REVENUE_PROFILE_4 VoiceRevenue4 ProfileVoice4
VOICE_INCR_REVENUE_PROFILE_5 VoiceRevenue5 ProfileVoice5
VOICE_INCR_REVENUE_PROFILE_6 VoiceRevenue6 ProfileVoice6
VOICE_INCR_REVENUE_PROFILE_7 VoiceRevenue7 ProfileVoice7
VOICE_INCR_REVENUE_PROFILE_8 VoiceRevenue8 ProfileVoice8
VOICE_INCR_REVENUE_PROFILE_9 VoiceRevenue9 ProfileVoice9
The table and file names are always corresponding i.e. VOICE_INCR_REVENUE_PROFILE_0 should always join with VoiceRevenue0 and the result should be stored in ProfileVoice0. There should be no mismatches in this case. I tried setting the variables with table names and file names, but it only takes on value at a time.
All table names and file names are constant. Is there any other way to get around this. Any help would be appreciated.
Try using "Copy rows to result" step. It will store all the incoming rows (in your case the table and file names) into a memory. And for every row, it will try to execute your transformation. In this way, you can read multiple filenames at one go.
Try reading this link. Its not the exact answer, but similar.
I have created a sample here. Please check if this is what is required.
In the first transformation, i read the tablenames and filenames and loaded it in the memory. After that i have used the get variable step to read all the files and table names to generate the output. [Note: I have not used table input as source anywhere, instead used TablesNames. You can replace the same with the table input data.]
Hope it helps :)

How to retrieve the name of a file and store it in the database using SSIS package?

I'm doing an Excel loop through fifty or more Excel files. The loop goes through each Excel file, grabs all the data and inputs it into the database without error. This is the typical process of setting delay validation to true, and making sure that the expression for the Excel Connection is a string variable called EFile that is set to nothing (in the loop).
What is not working: trying to input the name of the Excel file into the database.
What's been tried (edit; SO changed my 2 to 1 - don't know why):
Add a derived column between the Excel file and database input, and add a column using the EFile expression (so under Expression in the Derived Column it would be #[User::EFile]). and add the empty. However, this inputs nothing a blank (nothing).
One suggestion was to add ANOTHER string variable and set its properties EvaluateAsExpression to True and set the Expression to the EFile variable (#[User::EFile]). The funny thing is that this does the same thing - inputs a blank into the database.
Numerous people on blogs claim they can do this, yet I haven't seen one actually address this (I have a blog and I will definitely be showing people how to do this when I get an answer because, so far, these others have fallen short). How do I grab an Excel file's name and input it in a database during a loop?
Added: Forgot to add, no scripts; the claim is that it can be done without them, so I want to see the solution without them.
Note: I already have the ability to import the data from the Excel files - that's easy (see my GitHub account, as I have two different projects for importing all sorts of txt, csv, xls, xlsx data). I am trying to also get the actual name of the file being imported also into the database. So, if there are fifty Excel files, along with the data in each file, the database will have the fifty file names alongside that data (so if each file has 1000 rows of data, each 1000 rows would also have the name of the file they came from next to them as an additional column). This point seems to cause a lot of confusion, as people assume I'm having trouble importing data in files - NOPE, see my GitHub; again that's easy. It's the FILENAME that needs to also be imported.
Test package: https://github.com/tmmtsmith/SSISLoopWithFileName
Solution: #jaimet pointed out that the Derived Column needed to be the #[User::CurrentFile] (see the test package). When I first ran the package, I still got a blank value in my database. But when we originally set up the connection, we do point it to an actual file (I call this "fooling the package"), then change the expression on the connecting later to the #[User::CurrentFile], which is blank. The Derived Column, using the variable #[User::CurrentFile], showed a string of 0. So, I removed the Derived Column, put the full file path and name in the variable, then added the variable to the Derived Column (which made it think the string was 91 characters long), then went back and set the variable to nothing (English teacher would hate the THENs about right now). When I ran the package, it inputted the full file path. Maybe, like the connection, it needs to initially think that a file exists in order for it to input the full amount of characters?
Appreciate all the help.
The issue is because of blank value in the variable #[User::FileNameInput] and this caused the SSIS package to assume that the value of this variable will always be of zero length in the Derived Column transformation.
Change the expression on the Derived column transformation from #[User::FileNameInput] to (DT_STR, 2000, 1252)#[User::FileNameInput].
Type casting the derived column to 2000 sets the column length to that maximum value. The value 1252 represents the code page. I assumed that you are using ANSI code page. I took the value 2000 from your table definition because the FilePath column had variable VARCHAR(2000). If the column data type had been NVARCHAR(2000), then the expression would be (DT_WSTR, 2000)#[User::FileNameInput]
Tim,
You're using the wrong variable in your Derived Column component. You are storing the filename in #[User::CurrentFile] but the variable that you're using in your Derived Column component is #[User::FileNameInput]
Change your Derived Column component to use #[User::CurrentFile] and you'll be good.
Hope that helps.
JT
If you are using a ForEach loop to process the files in a folder then I have have used the technique described in SSIS Junkie's blog to get the filename in to an SSIS variable: SSIS: Enumerating files in a Foreach loop
You can use the variable later in your flow to write it to the database.
TO all intents and purposes your method #1 should work. That's exactly how I would attempt to do it. I am baffled as to why it is not working. Could you perhaps share your package?
Tony, thanks very much for the link. Much appreciated.
Regards
Jamie

Resources