How to modify the projection of a dataset in a ADF Dataflow - dataset

I want to optimize my dataflow reading just data I really need.
I created a dataset that maps a view on my database. This dataset is used by different dataflow so I need a generic projection.
Now I am creating a new dataflow and I want to read just a subset of the dataset.
Here how I created the dataset:
And that is the generic projection:
Here how I created the data flow. That is the source settings:
But now I want just a subset of my dataset:
It works but I think I am doing wrong:
I wanto to read data from my dataset (as you can see from source settings tab), but when I modify the projection I read from the underlying table (as you can see from source option). It seems an inconsistence. Which is the correct way to manage this kind of customization?
Thank you
EDIT
The solution proposed does not solve my problem. If I go in monitor and I analyze the exections that is what I saw...
Before I had applyed the solution proposed and with the solution I wrote above I got this:
As you can see I had read just 8 columns from database.
With the solution proposed, I get this:
And just then:
Just to be clear, the purpose of my question is:
How ca n I read only the data I really need instead of read all data and filter them in second moment?
I found a way (explained in my question) but there is an inconsistency with the configuration of the dataflow (I set a dataflow as input but in the option I write a query that read from db).

First import data as a Source.
You can use Select transformation in DataFlow activity to select CustomerID from imported dataset.
Here you can remove unwanted columns.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-select

Related

Can I run or execute dbt model based on output from a SQL statement?

Background: I have few models which are materialized as 'Table'. This tables are populated with wipe(Truncate) and Load. Now I want to protect my existing data in the Table if the query used to populate data is returning empty result set. How can I make sure an empty result set is not replacing my existing data in table.
My table lies in Snowflake and using dbt to model the output table.
Nutshell: Commit the transaction only when SQL statement used is returning Not empty result set.
Have you tried using dbt ref() function, which allows us to reference one model within another?
https://docs.getdbt.com/reference/dbt-jinja-functions/ref
If you are loading data in a way that is not controlled via dbt and then you are using this table - this is called a source. You can read more about this in here.
dbt does not control what you load into a source, everything else that is the T in the ELT is controlled where you reference a model via ref() function. A great example if you have a source that changes and you load it into a table and make sure that incoming data does not "drop" already recorded data is "incremental" materialization. I suggest you read more in here.
Thinking incremental takes time and practise, also it is recommended every now and then to do a --full-refresh.
You can have pre-hooks and post-hooks that can check your sources with clever macros and add dbt tests. We would really need a little bit more context of what you have and what you wish to achieve to suggest a real response.

How to read and change series numbers in columns SSIS?

I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx

How to apply same data manipulation codes to a group of SSIS components' inputs?

I am new to SSIS.
I have a number of MS access tables to transform to SQL. Some of these tables have datetime fields needed to go under some rules before sitting in respected SQL tables. I want to use Script component that deals with these kind of fields converting them to the desired values.
Since all of these fields need same modification rules, I want to apply the same code base to all of them thus avoiding the code duplication. What would be the best option for this scenario?
I know I can't use the same Script Component and direct all of those datasets outputs to it because unfortunately it doesn't support multi-inputs . So the question is is it possible to apply a set of generic data manipulation rules
on a group of different datasets' fields without repeating the rules. I can use a Script component for each ole db input and apply the same rule on them each. But it would not be an efficient way of doing that.
Any help would be highly appreciated.
SQL Server Integration Services has a specific task to suit this need, called a Data Conversion Transformation. This can be accomplished on the data source or via the task, as noted here.
You can also use the Derived Column transformation to convert data. This transformation is also simple, select an input column and then chose whether to replace this column or create a new output column. Then you apply an expression for the output column.
So why use one over the other?
The Data Conversion transformation (Pictured Below) will take an input, convert the type and provide a new output column. If you use the Derived Column transformation, you get to apply an expression to the data, which allows you to do more complex manipulations on the data.

SSIS: Adding multiple Derived Columns without using the gui?

I have about 500 fixed width columns in a flat file that I want to apply the same logic to to replace an empty column with null before it goes into the database.
I know the command to replace the empty string with null but I really don't want to have to use the gui to input that command for every column.
So is there a tool out there that can do this all on the back end?
You could look at something like the EzAPI to create your data flow. This this answer, I have an example of how one creates a EzDerivedColumn and sets the formula within it.
Automatically mapping columns with EZApi with OLEDBSource
If you can install third party components, I've seen a number of implementations of a Trim-To-Null functionality on codeplex.com
BIML might be an option to generate your package as well. I'd need to play with that to figure the syntax though.
My googlefu worked a little better after lunch.
I as able to modify about the 5th comment down on http://social.msdn.microsoft.com/Forums/sqlserver/en-US/222e70f5-0a21-4bb8-a3fc-3f365d9c701f/ssis-custom-component-derivedcolumn-programmatically-problems?forum=sqlintegrationservices to work for my needs.
My c# code will now loop through all the input columns from a "Flat File Source" object and add a derived column for each.

How to insert a row into a dataset using SSIS?

I'm trying to create an SSIS package that takes data from an XML data source and for each row inserts another row with some preset values. Any ideas? I'm thinking I could use a DataReader source to generate the preset values by doing the following:
SELECT 'foo' as 'attribute1', 'bar' as 'attribute2'
The question is, how would I insert one row of this type for every row in the XML data source?
I'm not sure if I understand the question... My assumption is that you have n number of records coming into SSIS from your data source, and you want your output to have n * 2 records.
In order to do this, you can do the following:
multicast to create multiple copies of your input data
derived column transforms to set the "preset" values on the copies
sort
merge
Am I on the right track w/ what you're trying to accomplish?
I've never tried it, but it looks like you might be able to use a Derived Column transformation to do it: set the expression for attribute1 to "foo" and the expression for attribute2 to "bar".
You'd then transform the original data source, then only use the derived columns in your destination. If you still need the original source, you can Multicast it to create a duplicate.
At least I think this will work, based on the documentation. YMMV.
I would probably switch to using a Script Task and place your logic in there. You may still be able leverage the File Reading and other objects in SSIS to save some code.

Resources