Detecting Correlated Columns in Data - sql-server

Suppose I have the following data:
OrderNumber | CustomerName | CustomerAddress | CustomerCode
1 | Chris | 1234 Test Drive | 123
2 | Chris | 1234 Test Drive | 123
How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.
Thanks in advance.
UPDATE:
By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.
The input of the operation would be the name of the table, and the output would be something like :
Column 1 | Column 2 | Certainty
CustomerName | CustomerAddress | 100%
CustomerAddress | CustomerCode | 100%

There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:
http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx
I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.

What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:
select distinct
case when a.OrderNumber < b.OrderNumber then a.OrderNumber
else b.OrderNumber
end as FirstOrderNumber,
case when a.OrderNumber < b.OrderNumber then b.OrderNumber
else a.OrderNumber
end as SecondOrderNumber
from
MyTable a
inner join MyTable b on
a.CustomerName = b.CustomerName
and a.CustomerAddress = b.CustomerAddress
and a.CustomerCode = b.CustomerCode
This would return you:
FirstOrderNumber | SecondOrderNumber
1 | 2

Correlation is defined on metric spaces, and your values are not metric.
This will give you percent of customers that don't have customerAddress uniquely defined by customerName:
SELECT AVG(perfect)
FROM (
SELECT
customerName,
CASE
WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress)
THEN 0
ELSE 1
END AS perfect
FROM orders
GROUP BY
customerName
) q
Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Related

SQL If Else Logic

I have a staging table with the following structure
ID | BookID |Title | Cost |
----------------------------
1 | Test |1234 | 1234 |
This is populated through my system when an excel file is picked up, and I place all of the values inside this sheet into my staging table
I also have another table, for this example I'll call my Specials tables. It has an identical structure to my staging table.
ID | BookID |Title | Cost |
----------------------------
1 | Test |Mr Men | 4,99 |
What I'm doing now is amending a proc that is doing a whole host of calculations based on the data inside my staging table. A typical call to this looks like this:
BookTitle = dbo.StagingTable.Title
My amendment needs to check to see if in the books name in my staging table is also in the specials table. If it is, then I should bring back that data instead of the data inside of my staging table.
The BookId values are the same in both and I'm doing an Left Outer Join to tie them both together. What I'm struggling with is figuring out the correct syntax to do what I want.
LEFT OUT JOIN dbo.Specials s on dbo.StagingTable.BookId = s.BookId
Could someone point me in the right direction please?
The above is just small snippets from a larger proc that I can't share. So if things seem odd, that's why. I've simply taken the bits I could to help better explain my issue.
In T-SQL, you can do:
SELECT * FROM dbo.StagingTable
WHERE StagingTable.Title
IN (SELECT Specials.Title FROM dbo.Specials)
to get all the rows in the StagingTable that have a Title that is also present in the Specials table.
Please test following SELECT statement
I hope that is what you require
select
staging.ID,
staging.BookID,
Title = case when staging.Title <> isnull(Specials.Title,staging.Title) then Specials.Title else staging.Title end,
staging.Cost
from staging
left outer join Specials on staging.BookID = Specials.BookID

SQL Server Insert Using View

This is a simplified example of what I want to do. Assume there is table named contractor that looks like this:
name | paid_adjustment_amount | adj_date
Bob | 1000 | 4/7/2016
Mary | 2000 | 4/8/2016
Bill | 5000 | 4/8/2016
Mary | 4000 | 4/10/2016
Bill | (1000) | 4/12/2016
Ann | 3000 | 4/30/2016
There is a view of the contractor table, let's call it v_sum, that is just a SUM of the paid_adustment_amount grouped by name. So it looks like this:
name | total_paid_amount
Bob | 1000
Mary | 6000
Bill | 4000
Ann | 3000
Finally, there is another table called to_date_payment that looks like this:
name | paid_to_date_amount
Bob | 1000
Mary | 8000
Bill | 3000
Ann | 3000
Joe | 4000
I want to compare the information in the to_date_payment table to the v_sum view and insert a new row in the contractor table to show an adjustment. Something like this:
INSERT INTO contractor
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount - v_sum.total_paid_amount,
GETDATE()
FROM to_date_payment
LEFT JOIN v_sum ON to_date_payment.name = v_sum.name
WHERE to_date_payment.paid_to_date_amount - v_sum.total_paid_amount <> 0
OR v_sum.name IS NULL
Are there any issues with using a view for this? My understanding, please correct me if I'm wrong, is that a view is just a result set of a query. And, since the view is of the table I'm inserting new records into, I'm afraid there could be data integrity problems.
Thanks for the help!
In order to fully understand what you are doing, you should also provide the definition for v_sum. Generally speaking, views might provide some advantages, especially when indexed. More details can be found here and here.
Simple usage of views do not provide performance benefits, but they are very good of providing abstraction over tables.
In your particular case, I do not see any problem in JOINing with the view, but I would worry about potential problems related to:
1) JOIN using VARCHARs instead of integer - ON to_date_payment.name = v_sum.name - if possible, try to JOIN on integer (ids or foreign keys ids) values, as it is faster (indexes applied on integer columns will have a smaller key, comparisons are slightly faster).
2) OR in queries - usually leads to performance problems. One thing to try is to change the SELECT like this:
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount - v_sum.total_paid_amount,
GETDATE()
FROM to_date_payment
JOIN v_sum ON to_date_payment.name = v_sum.name
WHERE to_date_payment.paid_to_date_amount - v_sum.total_paid_amount <> 0
UNION ALL
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount, -- or NULL if this is really intended
GETDATE()
FROM to_date_payment
-- NOT EXISTS is usually faster than LEFT JOIN ... IS NULL
WHERE NOT EXISTS (SELECT 1 FROM v_sum V WHERE V.name = to_date_payment.name)
3) Possible undesired result - by default, arithmetic involving NULL returns NULL. When there is no match in v_sum, then v_sum.total_paid_amount is NULL and to_date_payment.paid_to_date_amount - v_sum.total_paid_amount will evaluate to NULL. Is this correct? Maybe to_date_payment.paid_to_date_amount - ISNULL(v_sum.total_paid_amount, 0) is intended.

Preventing SQL injection in a report generator with custom formulas

For my customers, I am building a custom report generator, so they can create their own reports.
The concept is this: In a control table, they fill in the content of report columns. Each column can either consist of data from DIFFERENT DATA SOURCES (=tables), or of a FORMULA.
Here is a reduced sample how this looks:
Column | Source | Year | Account | Formula
----------------------------------------------
col1 | TAB1 | 2015 | SALES | (null)
col2 | TAB2 | 2014 | SALES | (null)
col3 | FORMULA | (null) | (null) | ([col2]-[col1])
So col1 and col2 get data from tables tab1 and tab2, and col3 calculates the difference.
A stored procedure then creates a dynamic SQL, and delivers the report data.
The resulting SQL query looks like this:
SELECT
(SELECT sum(val) from tab1 where Year=2015 and Account='SALES') as col1,
(SELECT sum(val) from tab2 where Year=2014 and Account='SALES') as col2,
(
(SELECT sum(val) from tab1 where Year=2015 and Account='SALES')
-
(SELECT sum(val) from tab2 where Year=2014 and Account='SALES')
) as col3 ;
In reality it is far more complex, because there are more parameters, and I'm using coalesce(), etc.
My main headache are the formulas. While they give users a very flexible tool at hand, it is total vulnerable for SQL injections.
Just wanted to know if there is some simple way to check a parameter for a possible SQL injection.
Otherwise I think that I need to limit the flexibility of the system for normal users, and only "super users" get access to the full flexible reports.
not really - many injections involve comments (to comment out the rest of the regulare statment) so you could check for comments (-- and /*) and the ; sign (end of statment).
On the other side if you allow your users to put anything into the filters - why should not someone write a filter as 1 = (select password from users where username = 'admin') to provoke an error message Error converting 'ReallyStrongPassword' to integer'?
Furthermore I guess that performance will be a much bigger problem as injection if I see your queries (it will read tab1 and tab2 twice instead only once if you would write it 'regular').
Edit:
You could check for SQL codewords as select, update, delete, exec ... in the filter parameter, to harden your code / queries.

Efficiently counting strength of relationship between rows in Postgres

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;
If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

SQL make rows into columns, PIVOT maybe

I have an MS SQL Server with a database for an E-commerce storefront.
This is some of the tables I have:
Products:
Id | Name | Price
ProductAttributeTypes: -Color, Size, Format
Id | Name
ProductAttributes: --Red, Green, 12x20 cm, Mirrored
Id | ProductAttributeTypeId | Name
Orders:
Id | DateCreated
OrderItems:
Id | OrderId | ProductId
OrderItemsToProductAttributes: --Relates an OrderItem to its product and selected attributes
OrderItemId | ProductAttributeId | ProductAttributeTypeId | ProductId
I want to select from the OrderItems table, to see which items have been purchased.
To see what kind of variants (ProductAtriibutes) was selected, I want those as "dynamic" columns in the resultset.
So the resultset should look like this:
OrderItemId | ProductId | ProductName | Color | Size | Format
1234 123 Mount. Bike Red 2x20 Mirror
I don't know if PIVOT is the thing to use? I'm not using any aggregate functions, so I guess not...
Is there any SQL Ninjas that can help me out?
If you are using sql2005 or 2008 you can use the pivot command. See here.
In the example below the OrderAttributes set will look like:
OrderItemId AttName AttValue
----- ------ -----
100 Color Red
100 Size Small
101 Color Blue
101 Size Small
102 Color Red
102 Size Small
103 Color Blue
103 Size Large
The final results after the PIVOT will be:
OrderItemId Size Color
----- ------ -----
100 Small Red
101 Small Blue
102 Small Red
103 Large Blue
WITH OrderAttributes(OrderItemId, AttName, AttValue)
AS (
SELECT
OrderItemId,
pat.Name AS AttName,
pa.Name AS AttValue
FROM OrderItemsToProductAttributes x
INNER JOIN ProductAttributes pa
ON x.ProductAttributeId = pa.id
INNER JOIN ProductAttributeTypes pat
ON pa.ProductAttributeTypeId = pat.Id
)
SELECT AttrPivot.OrderItemId,
[Size] AS [Size],
[Color] AS Color
FROM OrderAttributes
PIVOT (
MAX([AttValue])
FOR [AttName] IN ([Color],[Size])
) AS AttrPivot
ORDER BY AttrPivot.OrderItemId
There is a way to dynamically build the columns (i.e. the Color and Size columns), as can be seen here. Make sure your database compatibility level on your database is set to something greater than 2000 or you will get strange errors.
In the past, I've created physical tables for read purposes only. The structure you have above is GREAT for storage, but terrible for reporting.
So you could do the following:
Write a script (that is scheduled nightly) or a trigger (on data change) that does the following tasks:
First, you would dynamically go through each Product and build a static table "Product_[ProductName]"
Then go through each ProductAttributeTypes for each product and create/update/delete a physical column on the corresponding Product table.
Then, fill that table with the proper values based on OrderItemsToProductAttributes and ProductAttributes
This is just a rough idea. Make sure you are storing OrderID in the "Static"/"Flattened" tables. And make sure you do everything else you need to do. But after that, you should be able to start pulling from those flattened tables to get the data you need.
Pivot is your best bet, but what I did for reporting purposes, and to make it work well with SSIS is to create a view, which then has this query:
SELECT [InputSetID], [InputSetName], CAST([470] AS int) AS [Created By], CAST([480] AS datetime) AS [Created], CAST([479] AS int) AS [Updated By], CAST([460] AS datetime)
AS [Updated]
FROM (SELECT st.InputSetID, st.InputSetName, avt.InputSetID AS avtID, avt.AttributeID, avt.Value
FROM app.InputSetAttributeValue avt JOIN
app.InputSets st ON avt.InputSetID = st.InputSetID) AS p PIVOT (MAX(Value) FOR AttributeID IN ([470], [480], [479], [460])) AS pvt
Then I can just interact with the view, but, I have a trigger on the table that any new dynamic attributes must be added to, which recreates this view, so I can assume the view is always correct.

Resources