SSIS - Group by and max of String-column

SSIS - Group by and max of String-column - database

I have in SSIS three columns from a database.
Price
Number
Title
I want to GROUP BY by "Price" and "Number". The Problem is, that there are rows with the same Number but different title. So I want to have the MAXIMUM of title.
In other ETL-tools like Pentaho or OWB it works. There are functions where I can GROUP BY by Price and Number and get the MAXIMUM of the Title.
Is there a workaround?

Have you looked at the Aggregate transformation?
Alternatively, you could do this operation in source SQL, with the advantage that the database engine would process it (the SSIS Aggregate transform is blocking, so it'll load all rows into memory before it spits out its results).
UPDATED: If pre-aggregating in raw SQL (before it gets into SSIS) isn't an option, you could add a surrogate key for the Title:
SELECT
Price,Number,Title,ROW_NUMBER() OVER
(PARTITION BY Price,Number ORDER BY Title ) AS TitleOrdinal
FROM ...
Then your SSIS Aggregate can use MAX(TitleOrdinal) (which is a numeric column) as a surrogate for MAX(Title).
To get the actual MAX(Title), you'd have to join the original dataset to this aggregated set, on Price=Price,Number=Number,TitleOrdinal=[MAX(TitleOrdinal) from aggregated set].

Related

Power BI Aggregation of End Tables

I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.

This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )

single or multiple tables in database?

My system collects a lot of data from different resources (each resource has text-ID), and send it to client bounded together in predefined groups. there some hundreds of different resources, each might set record in period of second up some hours. there less then hundred "view groups".
The data collector is single-threaded.
what is the best method to organize the data?
a. make different table for each source, where the name of the table is based on the source id?
b. make single table and add the source id as text-field (key if possible)?
c. table for each predefined display group, with the source id as text-field?
each record has value (float) and date (date). the query will be something like select * from ... where date < d1 and date > d2. In case of single table, it will be "and sourceId in(...)"
database type is unknown yet, it might be lightsql, postgres, mysql, mssql ...

Sorting the view based on frequency in SQL Server

I have a StockinHand view generated from stock_Outward & Stock_Inward tables right now needs the sorting based on frequency i.e most moving stock items should be on top of the table
My tables are like below:
tbl_StockInward:
ID, Stock_Code,Units,Rate, Description, Vendor, DateOfPurchase, DateOfUpdate, Purchased_By, WareHouse, Remarks,
vice versa tbl_StockOutward
Please help me
Thanks in advance

Just like in sub queries, you can't use ORDER BY in a view definition in sql server unless you also use TOP.
The reason for this is that Views are acted upon as if they where tables, and tables in sql server (in fact, in any relational database) are considered as not ordered sets.
Just like there is no meaning to the order of records stored in a table, there is also no meaning to the order of records fetched by a view.
You can use a dirty hack and write SELECT TOP 100 PERCENT ... and then use ORDER BY, but I doubt if it has any meaning at all.
Having said all that, you can of course use ORDER BY in any query that selects from a view.

How to group rows (bassed on CustomerID) using Pivot in SSIS?

I am practicing SSIS and currently working on Pivot transformation. Here's what i am working on.
I created a Data Source (Table name: Pivot) with the following data.
Using SSIS, i created a package for Pivoting the data to have the following columns
PersonID --- Product1 --- Product2 --- Product3.
Here's where am at, I was able to create the pivot data to text file. But The output is not grouped by PersonID.
My Current Output is
As we can see the Transformation does not group the based on
SetKey(PersonID : PivotUsage =1)
The output i am hoping to get is
Where the data is grouped based on PersonID.
What am i missing here?
Edit:
Going back to the example i was following, I re-ordered the input data as follows.
Does the Input data need to be in this order/pattern, every time? Most of the examples i came across follow the similar pattern.

Yes, the input data needs to be sorted by whatever you're pivoting on:
To pivot data efficiently, which means creating as few records in the
output dataset as possible, the input data must be sorted on the pivot
column. If the data is not sorted, the Pivot transformation might
generate multiple records for each value in the set key, which is the
column that defines set membership. For example, if the dataset is
pivoted on a Name column but the names are not sorted, the output
dataset could have more than one row for each customer, because a
pivot occurs every time that the value in Name changes.
That's a direct quote from the Pivot Transformation documentation on MSDN. (Emphasis added.)

When I first read this answer, I thought that the sorted column should be the one with PivotUsage=2 in the pivot. That's what I understood the pivot column to be. However, what finally worked for me was to sort by a column with pivot usage=1. It's a column I would group by if writing the sql by hand.

SQL Server Select Query

I have to write a query to get the following data as result.
I have four columns in my database. ID is not null, all others can have null values.
EMP_ID EMP_FIRST_NAME EMP_LAST_NAME EMP_PHONE
1 John Williams +123456789
2 Rodney +124568937
3 Jackson +124578963
4 Joyce Nancy
Now I have to write a query which returns the columns which are not null.
I do not want to specify the column name in my query.
I mean, I want to use SELECT * FROM TABLE WHERE - and add the filter, but I do not want to specify the column name after the WHERE clause.
This question may be foolish but correct me wherever necessary. I'm new to SQL and working on a project with c# and sql.
Why I do not want to use the column name because, I have more than 250 columns and 1500 rows. Now if I select any row, at least one column will have null value. I want to select the row, but the column which has null values for that particular row should not appear in the result.
Please advice. Thank you in advance.
Regards,
Vinay S

Every row returned from a SQL query must contain exactly the same columns as the other rows in the set. There is no way to select only those columns which do not return null unless all of the results in the set have the same null columns and you specify that in your select clause (not your where clause).
To Anders Abels's comment on your question, you could avoid a good deal of the query complexity by separating your data into tables which serve common purposes (called normalizing).
For example, you could put names in one table (Employee_ID, First_Name, Last_Name, Middle_Name, Title), places in another (Address_ID, Address_Name, Street, City, State), relationships in another, then tiny 2-4 column tables which link them all together. Structuring your data this way avoids duplication of individual facts, like, "who is John Williams's supervisor and how do I contact that person."

Your question reads:
I want to get all the columns that don't have a null value.
And at the same time:
But I don't want to specify column names in the WHERE clause.
These are conflicting goals. Your only option is to use the sys.tables and sys.columns DMVs to build a series of dynamic SQL statements. In the end, this is going to be more work that just writing one query by hand the first time.

You can do this with a dynamic PIVOT / UNPIVOT approach, assuming your version of SQL Server supports it (you'll need SQL Server 2005 or better), which would be based on the concepts found in these links:
Dynamic Pivot
PIVOT / UNPIVOT
Effectively, you'll select a row, transform your columns into rows in a pivot table, filter out the NULL entries, and then unpivot it back into a single row. It's going to be ugly and complex code, though.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SSIS - Group by and max of String-column - database

Related

Power BI Aggregation of End Tables

single or multiple tables in database?

Sorting the view based on frequency in SQL Server

How to group rows (bassed on CustomerID) using Pivot in SSIS?

SQL Server Select Query

Categories

Resources