SQL make rows into columns, PIVOT maybe - sql-server

I have an MS SQL Server with a database for an E-commerce storefront.
This is some of the tables I have:
Products:
Id | Name | Price
ProductAttributeTypes: -Color, Size, Format
Id | Name
ProductAttributes: --Red, Green, 12x20 cm, Mirrored
Id | ProductAttributeTypeId | Name
Orders:
Id | DateCreated
OrderItems:
Id | OrderId | ProductId
OrderItemsToProductAttributes: --Relates an OrderItem to its product and selected attributes
OrderItemId | ProductAttributeId | ProductAttributeTypeId | ProductId
I want to select from the OrderItems table, to see which items have been purchased.
To see what kind of variants (ProductAtriibutes) was selected, I want those as "dynamic" columns in the resultset.
So the resultset should look like this:
OrderItemId | ProductId | ProductName | Color | Size | Format
1234 123 Mount. Bike Red 2x20 Mirror
I don't know if PIVOT is the thing to use? I'm not using any aggregate functions, so I guess not...
Is there any SQL Ninjas that can help me out?

If you are using sql2005 or 2008 you can use the pivot command. See here.
In the example below the OrderAttributes set will look like:
OrderItemId AttName AttValue
----- ------ -----
100 Color Red
100 Size Small
101 Color Blue
101 Size Small
102 Color Red
102 Size Small
103 Color Blue
103 Size Large
The final results after the PIVOT will be:
OrderItemId Size Color
----- ------ -----
100 Small Red
101 Small Blue
102 Small Red
103 Large Blue
WITH OrderAttributes(OrderItemId, AttName, AttValue)
AS (
SELECT
OrderItemId,
pat.Name AS AttName,
pa.Name AS AttValue
FROM OrderItemsToProductAttributes x
INNER JOIN ProductAttributes pa
ON x.ProductAttributeId = pa.id
INNER JOIN ProductAttributeTypes pat
ON pa.ProductAttributeTypeId = pat.Id
)
SELECT AttrPivot.OrderItemId,
[Size] AS [Size],
[Color] AS Color
FROM OrderAttributes
PIVOT (
MAX([AttValue])
FOR [AttName] IN ([Color],[Size])
) AS AttrPivot
ORDER BY AttrPivot.OrderItemId
There is a way to dynamically build the columns (i.e. the Color and Size columns), as can be seen here. Make sure your database compatibility level on your database is set to something greater than 2000 or you will get strange errors.

In the past, I've created physical tables for read purposes only. The structure you have above is GREAT for storage, but terrible for reporting.
So you could do the following:
Write a script (that is scheduled nightly) or a trigger (on data change) that does the following tasks:
First, you would dynamically go through each Product and build a static table "Product_[ProductName]"
Then go through each ProductAttributeTypes for each product and create/update/delete a physical column on the corresponding Product table.
Then, fill that table with the proper values based on OrderItemsToProductAttributes and ProductAttributes
This is just a rough idea. Make sure you are storing OrderID in the "Static"/"Flattened" tables. And make sure you do everything else you need to do. But after that, you should be able to start pulling from those flattened tables to get the data you need.

Pivot is your best bet, but what I did for reporting purposes, and to make it work well with SSIS is to create a view, which then has this query:
SELECT [InputSetID], [InputSetName], CAST([470] AS int) AS [Created By], CAST([480] AS datetime) AS [Created], CAST([479] AS int) AS [Updated By], CAST([460] AS datetime)
AS [Updated]
FROM (SELECT st.InputSetID, st.InputSetName, avt.InputSetID AS avtID, avt.AttributeID, avt.Value
FROM app.InputSetAttributeValue avt JOIN
app.InputSets st ON avt.InputSetID = st.InputSetID) AS p PIVOT (MAX(Value) FOR AttributeID IN ([470], [480], [479], [460])) AS pvt
Then I can just interact with the view, but, I have a trigger on the table that any new dynamic attributes must be added to, which recreates this view, so I can assume the view is always correct.

Related

SQL - aggregate related accounts - maually set up ID

I have two tables:
Account & Amount column
list of related accounts
Data samples:
Account | Amount
--------+---------
001 | $100
002 | $150
003 | $200
004 | $300
Account | Related Account
--------+------------------
001 | 002
002 | 003
003 | 002
My goal is to be able to aggregate all related accounts. From table two - 001,002 & 003 are actually all related to each other. What I would like to be able to do is to get a sum of all related accounts. Possibly ID 001 to 003 as Account #1, so I can aggregate them.
Result below
ID | Account | Amount
-----+-----------+--------
#1 | 001 | $100
#1 | 002 | $150
#1 | 003 | $200
#2 | 004 | $300
I can then manipulate the above table as below (final result)
ID | Amount
-----+--------
#1 | $450
#2 | $300
I tried doing a join, but it doesn't quite achieve what I want. I still have a problem relating account 001 with 003 (they are indirectly related because 002 is related with both 001 and 003.
If anyone can point me to the right direction, will be much appreciated.
Well, you really made this harder then it should be.
If you could change the data in the second table, so it will not contain reversed duplicates (in your sample data - 2,3 and 3,2) it would simplify the solution.
If you could refactor both tables into a single table, where the related column is a self referencing nullable foreign key, it would simplify the solution even more.
Let's assume for a minute you can't do either, and you have to work with the data as provided. So the first thing you want to do is to ignore the reversed duplicates in the second table. This can be done using a common table expression and a couple of case expressions.
First, create and populate sample tables (Please save us this step in your future questions):
DECLARE #TAccount AS TABLE
(
Account int,
Amount int
)
INSERT INTO #TAccount (Account, Amount) VALUES
(1, 100),
(2, 150),
(3, 200),
(4, 300)
DECLARE #TRelatedAccounts AS TABLE
(
Account int,
Related int
)
INSERT INTO #TRelatedAccounts (Account, Related) VALUES
(1,2),
(2,3),
(3,2)
You want to get only the first two records from the #TRelatedAccounts table.
This is the AccountAndRelated CTE.
Now, you want to left join the #TAccount table with the results of this query, so for each Account we will have the Account, the Amount, and the Related Account or NULL, if the account is not related to any other account or it's the first on the relationship chain.
This is the CTERecursiveBase CTE.
Then, based on that you can create a recursive CTE (called CTERecursive), and finally select the sum of amount from the recursive CTE based on the root of the recursion.
Here is the entire script:
;WITH AccountAndRelated AS
(
SELECT DISTINCT CASE WHEN Account > Related THEN Account Else Related END As Account,
CASE WHEN Account > Related THEN Related Else Account END As Related
FROM #TRelatedAccounts
)
, CTERecursiveBase AS
(
SELECT A.Account, Related, Amount
FROM #TAccount As A
LEFT JOIN AccountAndRelated As R ON A.Account = R.Account
)
, CTERecursive AS
(
SELECT Account As Id, Account, Related, Amount
FROM CTERecursiveBase
WHERE Related IS NULL
UNION ALL
SELECT Id, B.Account, B.Related, B.Amount
FROM CTERecursiveBase AS B
JOIN CTERecursive AS R ON B.Related = R.Account
)
SELECT Id, SUM(Amount) As TotalAmount
FROM CTERecursive
GROUP BY Id
Results:
Id TotalAmount
1 450
4 300
You can see a live demo on rextester.
Now, Let's assume you can modify the data of the second table. You can use the AccountAndRelated cte to get only the records you need to keep in the #TRelatedAccounts table - This means you can skip the AccountAndRelated cte and use the #TRelatedAccounts directly in the CTERecursiveBase cte.
You can see a live demo of that as well.
Finally, let's assume you can refactor your database. In that case, I would recommend joining the two tables together - so your #TAccount table would look like this:
Account Amount Related
1 100 NULL
2 150 1
3 200 2
4 300 NULL
Then you only need the recursive cte.
Here is a live demo of that option as well.

Group By with Where clause in SQL

I got some difference in the results when using a DateTime column in groupby. Can someone explain why?
Query:
Select Name, Source, Description, CreatedDate
From testTable
Where Source like '%Validating err%'
And CreatedDate >='2016-12-01'
Group By Name , Source, Description, CreatedDate
Result : 15 rows
The above query return me some 15 results. But when i remove the CreatedDate column from groupby clause it returns only 4 results.
Query:
Select Name, Source, Description
From testTable
Where Source like '%Validating err%'
And CreatedDate >='2016-12-01'
Group By Name , Source, Description
Result : 4 rows
I am adding this answer just for the benefit of #John so he can visually understand why his two result sets have differing numbers of records.
Imagine a table called shirts, which has only two columns, size and color. Here is some sample data:
size | color
S | red
S | green
S | blue
M | red
M | green
M | blue
L | red
L | green
L | blue
In other words, there are three sizes of shirts, and each size has three possible colors.
Now, if you execute the following query:
SELECT size
FROM shirts
GROUP BY size
you will get three records back, containing only the three sizes. However, if you do the following:
SELECT size, color
FROM shirts
GROUP BY size, color
Then you would get back nine records, or groups. All that is happening here is that the addition of another column creates new possible group combinations, and hence more groups. And the same concept applies to what you are seeing in your two queries.

Metadata database design

I am trying to store meta data about a document into a SQL Server. The document are stored into a document archive, and returns back an identifier so I can get back that document by asking the archive to get the document by identifier.
Our user would like to be able to search for this document based on different meta data. The meta data could be 1 attribute or 5 depending on the document type, and the users should be able to create new document types from a admin site.
I can see two solution here. One is that each documenttype gets it's own metadata table, where all metadata attributes are predefined, and if one should be added a new column needs to be created. And if a new documenttype is created a new metadata table needs to be created. Our DBA will freak out with a solution like this, and I also see a problem with indexes. Because if the documenttype has 5 different meta data attributes it needs to be searchable with 1 or 4 of them specified in the search. Then I would need to write index for all the different combinations of possible searchs.
here is an example (fictiv)
|documentId | Name | InsertDate | CustomerId | City
| 1 | John | 2014-01-01 | 2 | London
| 2 | John | 2014-01-20 | 5 | New York
| 3 | Able | 2014-01-01 | 10 | Paris
I could here say:
Give me all documents where Name = 'John'
Give me all documets where Name = 'John' And CustomerId = 5
Give me all document where InserDate = '2014-01-01' and City = 'London'
This will be 3 differnet indexes and then I haven't coverd all possible combinations. This isn't practical.
So I am look in to the evil 'EAV' (anti)pattern.
So instead of having the metadata as columns I can have the as rows.
|documentId | MetaAttribute | MetaValue
| 1 | Name | John
| 1 | InsertDate | 2014-01-01
| 1 | CustomerId | 2
| 1 | City | London
| 2 | Name | John
| 2 | InsertDate | 2014-01-20
| 2 | CustomerId | 5
| 2 | City | New York
| 3 | Name | Able
| 3 | InserDate | 2014-01-01
| 3 | CustomerId | 10
| 3 | City | Paris
Here it's simple to create one index om MetaAttribute och metaValue, and it's covered. If a new documenttype is created, new metadata can be created with that documenttype into a MetaAttributeTable (that contains all MetaAttribute for the different documenttype). So no need to create new tables or coulms if a new documenttype is added or if a new attribute is added to a documenttype. Instead all MetaValues most be strings :( and the SQL Query to find the document id is a bit more complicated.
This is what I figured out. (In this example the MetaAttribute is a string, but would be an ID to the MetaAttribute Table)
SELECT * FROM [Document]
WHERE ID IN (SELECT documentId FROM [MetaData]
WHERE ((MetaAttribute = 'Name' AND MetaValue = 'John')
OR (MetaAttribute = 'CustomerId' and MetaValue = '5'))
GROUP BY [documentId]
HAVING Count(1) = 2)
Here I need to ask if the Name = 'John' and CustomerId = 5. I do that by finding all records that have Name = 'John' and CustomerId = '5' and the Group it on the documentId and count number of items in the group. If I got 2 then both Name = 'John' and CustomerId = '5' is true for this search. Return the documentId and use that to retrive information about the document, like the document archive storage id.
There should be a better SQL statement for this isn't there?
So my question is. Is there a better approche than these 2. Is the EAV-pattern so bad that I should stick with the first approche and have a Freaked out DBA and "ten millions of indexes"
We are talking about a system that will have around 10-20 millions of new records each month, and contain data for at least 3 years.... So the tables will be preatty big and good indexes are neccasary for performance.
Best Regards
Magnus
The EAV model is appealing if you have unbounded attributes--that is, anyone can set up anything as an attribute. However, it sounds from your description that this is not the case--the possible document attributes come from a known and fairly limited set. If this is the case, routine normalization suggests the following:
-- One per document
CREATE TABLE Document
(
DocumentId -- primary key
,DocumentType
,<etc>
)
-- One per "type" of document
CREATE TABLE DocumentType
(
DocumentTypeId -- pirmary key
,Name
)
-- One per possible document attribute.
-- Note that multiple document types can reference the same attribute
CREATE TABLE DocumentAttributes
(
AttributeId -- primary key
,Name
)
-- This lists which attributes are used by a given type
CREATE TABLE DocumentTypeAttributes
(
DocumentTypeId
,AttributeId
-- compound primary key on both columns
-- foeign keys on both columns
)
-- This contains the final association of document and attributes
CREATE TABLE DocumentAttributeValues
(
DocumentId
,AttributeId
,Value
-- compound primary key on DocumentId, AttributeId
-- foeign keys on both columns ot their respective parent tables
)
A tighter model with more robust keys could be implemented to ensure at the database level that an attribute cannot be assigned to a document with an “inappropriate” type.
Queries have to use joins, but (presumably) only the Documents and DocumentAttributes tables will ever be large. An index on on (AttributeId + Value) facilitiate lookups by attribute type, and depending on cardinality an index on (Value + AttributeId) could make searches for specific attributes quite efficient.
(Edit)
Ooh, clever, I created two tables with the same name. I've renamed the last one to DocumentAttributeValues. (Free advice is clearly worth what you paid for it!)
This shows how ugly these systems can get in SQL, as you have to “look up” both attributes separately. On the plus side you don’t have to worry about “does this type go with this document”, as those rules have (better had) been applied when the data was loaded. Two examples:
This one spells everything out in joins, and as such I think it might perform worse than the next:
-- Top-down
SELECT do.DocumentId
from Documents do
inner join DocumentAttributes da1
on da.Name = 'Name'
inner join DocumentAttributeValues dav1
on dav1.AttributeId = da1.AttributeId
and dav1.Value = 'John'
inner join DocumentAttributes da2
on da2.Name = 'CustomerId'
inner join DocumentAttributeValues dav2
on dav2.AttributeId = da2.AttributeId
and dav2.Value = '5'
This one picks out the attributes, then finds which documents have all of them. It might perform better, as there’s one less table to process:
-- Bottom-up
SELECT xx.DocumentId
from (-- All documents with name "John"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'Name'
and dav.Value = 'John'
-- This combines the two sets, with "all" keeping any duplicate entries
union all
-- All documents with CustomerId = "5"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'CustomerId'
and dav.Value = '5') xx -- Have to give the subquery an alias
group by xx.DocumentId
having count(*) = 2
While further refinements might be possible, the more more attributes you’re filtering on, the uglier the queries will be. Five attributes max might work ok in SQL, but if you’ve got tons of attributes, a NoSQL solution might be what you’re looking for.
(Please note that, as with my original post, I have not tested this code, so there may be typos or subtle--or not so subtle--errors in here.)
SQL Server 2008+ offers three related features for dealing with such cases:
Sparse Columns which allow you to define hundreds of columns even if only a subset are used at a time
Column Sets allow you to group these columns and treat them as a group
Filtered indexes can index only the rows that actually have values in them.
These features allow you to work with more-or-less normal SQL statements to handle all metadata columns.
These features were specifically added to address the EAV/metadata scenario.
EDIT
If you have a limited set of attributes that are always filled, there is no need for Sparse Columns or the EAV anti-pattern either.
You can create your tables as you normally would and add indexes to optimize the real workload you encounter. Certain types of queries will occur far more often than others and SQL Server's Index tuning advisor can propose the indexes and statistics to use based on a trace captured using SQL Server's Profiler.
It's quite possible that only a subset of the columns will accelerate searches and the rest can be added as include columns in the index.
Full Text Search
A more powerful option is to use SQL Server's Full Text Search. This will allow you to execute queries using arbitrary attributes. This is another technique using by document/content management systems, ERPs and CRMs to handle arbitrary attributes.
With FTS you simply specify the columns to include in one FTS index and don't have to create separate indexes for each attribute.
You can use FTS predicates in SELECT queries like this:
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain')
This can result in much simpler queries (you just write a modified select) and administration (no worries about column order in indexes, only one FTS index to manage)

SQL View where rows not mapped

Basically what I'm trying to figure out is,
Say I have
table 1tbl1
ID | Name
and table2tbl2
ID | Name
Then I have a mapping table mt
ID | tbl1ID | tbl2ID
Data really isn't important here, and these tables are examples.
How to make a view that will grab all the items in tbl1 that aren't mapped to mt.
I'm using Microsoft SQL-server 2008 by the way.
CREATE VIEW v_unmapped
AS
SELECT *
FROM tbl1
WHERE id NOT IN
(
SELECT tbl1Id
FROM mt
)

Detecting Correlated Columns in Data

Suppose I have the following data:
OrderNumber | CustomerName | CustomerAddress | CustomerCode
1 | Chris | 1234 Test Drive | 123
2 | Chris | 1234 Test Drive | 123
How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.
Thanks in advance.
UPDATE:
By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.
The input of the operation would be the name of the table, and the output would be something like :
Column 1 | Column 2 | Certainty
CustomerName | CustomerAddress | 100%
CustomerAddress | CustomerCode | 100%
There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:
http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx
I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.
What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:
select distinct
case when a.OrderNumber < b.OrderNumber then a.OrderNumber
else b.OrderNumber
end as FirstOrderNumber,
case when a.OrderNumber < b.OrderNumber then b.OrderNumber
else a.OrderNumber
end as SecondOrderNumber
from
MyTable a
inner join MyTable b on
a.CustomerName = b.CustomerName
and a.CustomerAddress = b.CustomerAddress
and a.CustomerCode = b.CustomerCode
This would return you:
FirstOrderNumber | SecondOrderNumber
1 | 2
Correlation is defined on metric spaces, and your values are not metric.
This will give you percent of customers that don't have customerAddress uniquely defined by customerName:
SELECT AVG(perfect)
FROM (
SELECT
customerName,
CASE
WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress)
THEN 0
ELSE 1
END AS perfect
FROM orders
GROUP BY
customerName
) q
Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Resources