SQL Server BIDS, SSIS aggregate and group by - sql-server

I have an employee table with an employee_id, name, and working_division, where the employee_id is the primary key. I have an Excel source with these columns and more where an employee has entered their hours, and what type of work they have done, what division of the company it was for and so forth..
So for any given day an employee I could have multiple rows showing their type of work, what division they worked for, and their charged hours to that division.
How do I get this into the OLE DB in which the employee_id is the primary key?
I am trying to use the aggregate transform to group by the employee_id, however the employee_id and working_divisions are not one-to-one. Thus, the group by operation on both of those columns will try to insert the same employee_id into the employee table (the employee_id is the primary key!) If I do not include the working_division for the aggregate transform, then I lose the data.
How can I group my data by the employee_id, and still keep all the other columns with that row?
Thanks for all the help!

I need the employee_id to be the PK. Basically I have a very large
unorganized data source, and I am breaking it apart into 4 to 5
separate tables to fit my model so I can make sense of the data with
some data mining algorithms
OK, then why don't you split employee_id and working_division in two separate tables? The second table should keep a FK to the employee table (so one to many).
In the SSIS package you can then add a Multicast component, right after the Aggregate on employee_id, in order to split your data source in the 2 target tables.
I think that without a modification in your target model you won't be able to achieve what you want. It basically violates the rules of RDBMS. That grouping you are talking about couldn't be done even in plain SQL and yield correct results.
Note: If you're worried about modifying your target data model, then perhaps you can normalize it as I mentioned before and then denormalize it back through a view. You maybe can even create an indexed view in order to speed things up at read time (as far as I can see an indexed view should be possible, since all you have is an inner join between the two tables).

Related

How to apply ROW ID in snowflake? Oracle code conversion to Snowflake

Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)
So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.
ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.

What kinds of indexes would you create to speed up queries to this table?

I had a question that I could really use someone's help with.
So suppose I have the following huge table with about one million rows:
ORDER (Order#, OrderDate, Customer#, OrderAmount, Product#, DiscountAmount, OrderStatus, OrderFullfillmentDate)
In this table, Order# is PK, and Customer# is a FK to the Customer Table and Product is a FK to the Product table. What kinds of indexes could I create to speed up queries to this table?
Thanks.
Depends what you need to do with this table.
1. Apply index on all fields
2. Pay attention on query because query are prepare relative to where close and you can ask in a query, that is not optimised, to load hole table in memory even the final result contain a few rows.
3. Create many tables with less fields (cols) instead few tables with many cols
I can help you if you can give me more detail and example how you extract data from this table. I am curios where is the unique Order_id and how you query a specific order number.
There are many methods to optimize tables, queries and quick output the results.

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

What is the advantage of avoiding nulls when outer join queries will just add them back in again?

Example: an Employee table with an optional DateOfBirth field can be normalized into two tables, one exclusively to hold the DateOfBirth field for employees with known dates of birth. But this adds no semantic meaning to the lack of a row in the DateOfBirth table, and when you query employees, you will almost certainly need to outer join those missing rows back into nulls.
You are talking about two completely different concepts here. One has to do with Normalization, and one has to do with query results. Having large numbers of nulls in query results is perfectly acceptable, and often desirable, to represent missing values. Having a large number of null values in a table column represents a design problem. Normally, if you find a table that contains a column with large numbers of null values, the table is denormalized. This might be ok for a reporting database, or a warehouse type database. However, for production data, you generally want to avoid this.

Best way to get distinct values from large table

I have a db table with about 10 or so columns, two of which are month and year. The table has about 250k rows now, and we expect it to grow by about 100-150k records a month. A lot of queries involve the month and year column (ex, all records from march 2010), and so we frequently need to get the available month and year combinations (ie do we have records for april 2010?).
A coworker thinks that we should have a separate table from our main one that only contains the months and years we have data for. We only add records to our main table once a month, so it would just be a small update on the end of our scripts to add the new entry to this second table. This second table would be queried whenever we need to find the available month/year entries on the first table. This solution feels kludgy to me and a violation of DRY.
What do you think is the correct way of solving this problem? Is there a better way than having two tables?
Using a simple index on the columns required (Year and Month) should greatly improve either a DISTINCT, or GROUP BY Query.
I would not go with a secondary table as this adds extra over head to maintaining the secondary table (inserts/updates deletes will require that you validate the secondary table)
EDIT:
You might even want to consider using Improving Performance with SQL Server 2005 Indexed Views
Make sure to have an Clustered Index on those columns.
and partition your table on these date columns an place the datafiles on different disk drives
I Believe keeping your index fragmentation low is your best shot.
I also Believe having a physical view with the desired select is not a good idea,
because it adds Insert/Update overhead.
on average there's 3,5 insert's per minute.
or about 17 seconds between each insert (on average please correct me if I'm wrong)
The question is are you selecting more often than every 17 seconds?
That's the key thought.
Hope it helped.
Use a 'Materialized View', also called an 'Indexed View with Schema Binding', and then index this view. When you do this SQL server will essentially create and maintain the data in a secondary table behind the scenes and choose to use the index on this table when appropriate.
This is similar to what your co-worker suggested, the advantage being you won't need to add logic to your query to take advantage of it, SQL Server will do this when it creates a query plan and SQL Server will also automatically maintain the data in the Indexed View.
Here is how you would accomplish this: create a view that returns the distinct [month] [year] values and then index [year] [month] on the view. Again SQL Server will use the tiny index on the view and avoid the table scan on the big table.
Because SQL server will not let you index a view with the DISTINCT keyword, instead use GROUP BY [year],[month] and use BIG_COUNT(*) in the SELECT. It will look something like this:
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month
ON [dbo].[vwMonthYear](Year,Month)
Now when you SELECT DISTINCT [Year],[Month] on the big table, the query optimizer will scan the tiny index on the view instead of scanning millions of records on the big table.
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
This technique took me from 5 million reads with an estimated I/O of 10.9 to 36 reads with an estimated I/O of 0.003. The overhead on this will be that of maintaining an additional index, so each time the large table is updated the index on the view will also be updated.
If you find this index is substantially slowing down your load times. Drop the index, perform your data load and then recreate it.
Full working example:
CREATE TABLE YourBigTable(
YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY,
[Year] INT,
[Month] INT)
GO
CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS
SELECT
[year],
[month],
COUNT_BIG(*) [MonthCount]
FROM [dbo].[YourBigTable]
GROUP BY [year],[month]
GO
CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)
SELECT DISTINCT
[year],
[month]
FROM YourBigTable
-- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month
create a materialized indexed view of:
SELECT DISTINCT
MonthCol, YearCol
FROM YourTable
you will now get access to the pre-computed distinct values without going through the work every time.
Make the date the first column in the table's clustered index key. This is very typical for historic data, because most, if not all, queries are interested in specific ranges and a clustered index on time can address this. All queries like 'month of May' need to be addressed as ranges, eg: WHERE DATECOLKEY BETWEEN '05/01/2010' AND '06/01/2001'. Answering a question like 'are there any records in May' will involve a simple seek into the clustered index.
While this seems complicated for a programmer mind, it is the optimal way to approach a database design problem.

Resources