How to group rows (bassed on CustomerID) using Pivot in SSIS? - sql-server

I am practicing SSIS and currently working on Pivot transformation. Here's what i am working on.
I created a Data Source (Table name: Pivot) with the following data.
Using SSIS, i created a package for Pivoting the data to have the following columns
PersonID --- Product1 --- Product2 --- Product3.
Here's where am at, I was able to create the pivot data to text file. But The output is not grouped by PersonID.
My Current Output is
As we can see the Transformation does not group the based on
SetKey(PersonID : PivotUsage =1)
The output i am hoping to get is
Where the data is grouped based on PersonID.
What am i missing here?
Edit:
Going back to the example i was following, I re-ordered the input data as follows.
Does the Input data need to be in this order/pattern, every time? Most of the examples i came across follow the similar pattern.

Yes, the input data needs to be sorted by whatever you're pivoting on:
To pivot data efficiently, which means creating as few records in the
output dataset as possible, the input data must be sorted on the pivot
column. If the data is not sorted, the Pivot transformation might
generate multiple records for each value in the set key, which is the
column that defines set membership. For example, if the dataset is
pivoted on a Name column but the names are not sorted, the output
dataset could have more than one row for each customer, because a
pivot occurs every time that the value in Name changes.
That's a direct quote from the Pivot Transformation documentation on MSDN. (Emphasis added.)

When I first read this answer, I thought that the sorted column should be the one with PivotUsage=2 in the pivot. That's what I understood the pivot column to be. However, what finally worked for me was to sort by a column with pivot usage=1. It's a column I would group by if writing the sql by hand.

Related

How can I merge duplicate rows in a Spreadsheet while summig the numerical columns?

My Project Reports come from two different sources as we have changed the Project Management tool middle of the year. Therefore, I have prepared two tabs within a Google Spreadsheet with the data from the two systems under same set of headings. Then I combined the two sheets into one using the following query,
=QUERY({'Sheet1'!A1:I1000;'Sheet2'!A2:I1000},"select * where Col1 <>''")
Some of my projects are present within the both the list as they were started early in the year. In order to avoid duplicates I need to merge the two rows representing the same project into one. The project names are identical. However, I need to get the sum of some of the columns such as the 'Time Spent' in order to receive the total value for the whole period. At the same time, columns such as ' Project Owner' are identical among the two rows.
How can I combine these duplicate rows into single rows while merging the selcted columns?
Thank you in advance for your support!
syntax is:
=QUERY({Sheet1!A1:B; Sheet2!A2:B},
"select Col1,sum(Col2)
where Col1 is not null
group by Col1
label sum(Col2)''")
where A column is text and B column are numeric values

Conditional Formatting on specific columns in pivot table v2 chart in Apache Superset

I wanted to know if there's any way I can apply conditional formatting(coloring) on specific columns in Pivot Table v2 chart in Apache Superset.
Tried:
In pivot table v2 chart when I try to apply formatting/coloring of data it applies on every column of the table in superset application.
Expected:
For example: In superset application I have a pivot table say with columns total orders, canceled orders, allocated orders and i want to apply coloring/formatting on total orders column only. So the total orders column data will only be displayed in colored format.
I am not sure this is possible in pivot tables (without customizing it in anyway, that is)
Since a pivot table shows the same metric (grouped on the basis of 2 or more columns), it might not make sense to show formatting on only one column.
If your pivot table is grouping by only two columns, you might try creating a table that shows the same data as your pivot table. And since you can write HTML in tables, you might be able to color only the column you need

SSIS/visual studio- using derived column transform to count distinct values and sort by two variables?

I have three columns of data- incident_ID, date, state.
The incident_ID are each unique, date is in year format only and ranges from 2013 to 2016, states are in no particular order and repeat if an incident occurs w/in the same state in the same year.
I'm going to be combining this data w/ another table, but first I need to organize the data to better match the other table's format- which is laid out showing the year, state, and dollars spent per state. So for 2013, I would have 51 rows (each state + DC) and each state would have a dollar amount- then rinse/repeat down the table through to 2016.
I'm pretty new to SSIS/Visual Studio, but from my understanding I should be able to use a Derived Column Transform to accomplish this.... but I don't know how to get there.
Is there a way to use Derived Column to 'count' and rearrange the data in order to show how many incidents occurred per state in the given year?
There is no aggregate (like SUM,Count,Min,Max) functionality in Derive Columnenter code here.For count, you can use ROW COUNT Task or you can insert the data into SQL table and then you can call the result value in derived column.

SSIS - Group by and max of String-column

I have in SSIS three columns from a database.
Price
Number
Title
I want to GROUP BY by "Price" and "Number". The Problem is, that there are rows with the same Number but different title. So I want to have the MAXIMUM of title.
In other ETL-tools like Pentaho or OWB it works. There are functions where I can GROUP BY by Price and Number and get the MAXIMUM of the Title.
Is there a workaround?
Have you looked at the Aggregate transformation?
Alternatively, you could do this operation in source SQL, with the advantage that the database engine would process it (the SSIS Aggregate transform is blocking, so it'll load all rows into memory before it spits out its results).
UPDATED: If pre-aggregating in raw SQL (before it gets into SSIS) isn't an option, you could add a surrogate key for the Title:
SELECT
Price,Number,Title,ROW_NUMBER() OVER
(PARTITION BY Price,Number ORDER BY Title ) AS TitleOrdinal
FROM ...
Then your SSIS Aggregate can use MAX(TitleOrdinal) (which is a numeric column) as a surrogate for MAX(Title).
To get the actual MAX(Title), you'd have to join the original dataset to this aggregated set, on Price=Price,Number=Number,TitleOrdinal=[MAX(TitleOrdinal) from aggregated set].

SQL Server Select Query

I have to write a query to get the following data as result.
I have four columns in my database. ID is not null, all others can have null values.
EMP_ID EMP_FIRST_NAME EMP_LAST_NAME EMP_PHONE
1 John Williams +123456789
2 Rodney +124568937
3 Jackson +124578963
4 Joyce Nancy
Now I have to write a query which returns the columns which are not null.
I do not want to specify the column name in my query.
I mean, I want to use SELECT * FROM TABLE WHERE - and add the filter, but I do not want to specify the column name after the WHERE clause.
This question may be foolish but correct me wherever necessary. I'm new to SQL and working on a project with c# and sql.
Why I do not want to use the column name because, I have more than 250 columns and 1500 rows. Now if I select any row, at least one column will have null value. I want to select the row, but the column which has null values for that particular row should not appear in the result.
Please advice. Thank you in advance.
Regards,
Vinay S
Every row returned from a SQL query must contain exactly the same columns as the other rows in the set. There is no way to select only those columns which do not return null unless all of the results in the set have the same null columns and you specify that in your select clause (not your where clause).
To Anders Abels's comment on your question, you could avoid a good deal of the query complexity by separating your data into tables which serve common purposes (called normalizing).
For example, you could put names in one table (Employee_ID, First_Name, Last_Name, Middle_Name, Title), places in another (Address_ID, Address_Name, Street, City, State), relationships in another, then tiny 2-4 column tables which link them all together. Structuring your data this way avoids duplication of individual facts, like, "who is John Williams's supervisor and how do I contact that person."
Your question reads:
I want to get all the columns that don't have a null value.
And at the same time:
But I don't want to specify column names in the WHERE clause.
These are conflicting goals. Your only option is to use the sys.tables and sys.columns DMVs to build a series of dynamic SQL statements. In the end, this is going to be more work that just writing one query by hand the first time.
You can do this with a dynamic PIVOT / UNPIVOT approach, assuming your version of SQL Server supports it (you'll need SQL Server 2005 or better), which would be based on the concepts found in these links:
Dynamic Pivot
PIVOT / UNPIVOT
Effectively, you'll select a row, transform your columns into rows in a pivot table, filter out the NULL entries, and then unpivot it back into a single row. It's going to be ugly and complex code, though.

Resources