SQL Server query to non normalized table - sql-server

I have an unnormalized table of customer orders. I want count how many products are sold and display it in table by type.
OnlineSalesKey SalesOrderNumber ProductKey
--------------------------------------------
1 20121018 778
2 20121018 774
3 20121018 665
4 20121019 772
5 20121019 778
9 20121019 434
10 20121019 956
11 20121020 772
12 20121020 965
15 20121020 665
16 20121020 778
17 20121021 665
My query:
SELECT
s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY
s.ProductKey
Question #1.
That query does a job. But now I want display and take into account only those orders where more than one item is purchased. Not sure how do i do that in one query. Any ideas?
Question #2
Is it possible to normalize results and get back data separated by semi column?
20121018 | 778; 774; 665
Thanks!

You don't say which SQL database you're using, and there will be different, more-or-less efficient answers for each database. (Sorry, just noticed MSSQL is in the question title.)
Here's a solution that will work in all or most databases:
SELECT s.ProductKey, COUNT (*) As Purchased
FROM Sales s
WHERE SalesOrderNum IN
(SELECT SalesOrderNum FROM Sales GROUP BY SalesOrderNum HAVING COUNT(*) > 1)
GROUP BY s.ProductKey
This is not the most efficient, but should work across the most products.
Also, please note that you're using the terms normalized and unnormalized in reverse. The table you have is normalized, the results you want are de-normalized.
There is no standard SQL statement to get the de-normalized results you want using SQL alone, but some databases (MySQL and SQLite) provide the group_concat function to do just this.

Q1: Look at HAVING clause
display and take into account only those orders...
SELECT s.SalesOrderNumber, COUNT (*) As Purchased
FROM Sales s
GROUP BY s.SalesOrderNumber
HAVING COUNT(*) > 1
So we group by the orders, apply the condition in HAVING, then display the SalesOrderNumber in the SELECT clause.
Q2: Look at several group concatenation techniques.
MySQL:
SELECT s.SalesOrderNumber, GROUP_CONCAT(DISTINCT ProductKey
ORDER BY ProductKey SEPARATOR '; ')
FROM Sales s
GROUP BY s.SalesOrderNumber
SQL Server: See this answer to a duplicate question. Basically, using FOR XML.

SELECT s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY s.ProductKey
having count(*) > 1
EDIT -
Answer 1 - To display products for which orders had more than one product purchased -
SELECT s.ProductKey, COUNT (*) As Purchased
FROM Sales s
WHERE SalesOrderNum in
(
select SalesOrderNum from Sales
group by SalesOrderNum having count(*) > 1
)
GROUP BY s.ProductKey

1) Try this one:
SELECT s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY s.ProductKey
HAVING COUNT(*) > 1

Related

Google BigQuery SQL: working with an array?

I'm a newbie at SQL and Google BigQuery.
I'm trying to run the following query to get a list of names and counts, however I see that I am getting an array error and don't know how to fix it. Any help appreciated.
ERROR MESSAGE:
Cannot access field harmonized on a value with type ARRAY at [5:27]
#standardSQL
-- Applications_Per_Assignee
SELECT assignee_harmonized.name AS Assignee_Name, COUNT(*) AS Number_of_Patent_Apps
FROM (
SELECT ANY_VALUE(assignee.harmonized.name) AS Assignee_Name
FROM `patents-public-data.patents.publications` AS patentsdb
GROUP BY Number_of_Patent_Apps
)
GROUP BY assignee_harmonized.name
ORDER BY Number_of_Patent_Apps DESC;
Below is for BigQuery Standard SQL
#standardSQL
SELECT
ah.name AS Assignee_Name,
COUNT(*) AS Number_of_Patent_Apps
FROM `patents-public-data.patents.publications`,
UNNEST(assignee_harmonized) ah
GROUP BY Assignee_Name
HAVING Number_of_Patent_Apps < 1000
ORDER BY Number_of_Patent_Apps DESC
-- LIMIT 10
with output
Row Assignee_Name Number_of_Patent_Apps
1 SAMSUNG ELECTRONICS CO LTD 600678
2 CANON KK 579731
3 MATSUSHITA ELECTRIC IND CO LTD 560644
4 HITACHI LTD 531286
5 SIEMENS AG 486276
6 MITSUBISHI ELECTRIC CORP 461673
7 IBM 438822
8 SONY CORP 438039
9 FUJITSU LTD 384270
10 NEC CORP 357193
Looks like there are a few things wrong with your query.
assignee is a string, I think you want to look at assignee_harmonized.name
You will want to UNNEST() assignee_harmonized
ANY_VALUE() only selects a random value, which does not sound like what you want
You have a GROUP BY in your inner select, which will not give you the results you want
You don't really need a subquery for this type of query.
#standardSQL
SELECT ah.name AS Assignee_Name, COUNT(*) AS Number_of_Patent_Apps
FROM `patents-public-data.patents.publications` AS patentsdb
LEFT JOIN UNNEST(assignee_harmonized) ah
GROUP BY 1
ORDER BY 2 DESC

PostgreSQL - Filter column 2 results based on column 1

Forgive a novice question. I am new to postgresql.
I have a database full of transactional information. My goal is to iterate through each day since the first transaction, and show how many unique users made a purchase on that day, or in the 30 days previous to that day.
So the # of unique users on 02/01/2016 should show all unique users from 01/01/2016 through 02/01/2016. The # of unique users on 02/02/2016 should show all unique users from 01/02/2016 through 02/02/2016.
Here is a fiddle with some sample data: http://sqlfiddle.com/#!15/b3d90/1
The result should be something like this:
December 17 2014 -- 1
December 18 2014 -- 2
December 19 2014 -- 3
...
January 13 2015 -- 16
January 19 2015 -- 15
January 20 2015 -- 15
...
The best I've come up with is the following:
SELECT
to_char(S.created, 'YYYY-MM-DD') AS my_day,
COUNT(DISTINCT
CASE
WHEN S.created > S.created - INTERVAL '30 days'
THEN S.user_id
END)
FROM
transactions S
GROUP BY my_day
ORDER BY my_day;
As you can see, I have no idea how I could reference what exists in column one in order to specify what date range should be included in the filter.
Any help would be much appreciated!
I think if you do a self-join, it would give you the results you seek:
select
t1.created,
count (distinct t2.user_id)
from
transactions t1
join transactions t2 on
t2.created between t1.created - interval '30 days' and t1.created
group by
t1.created
order by
t1.created
That said, I think this is going to do form of a cartesian join in the background, so for large datasets I doubt it's very efficient. If you run into huge performance problems, there are ways to make this a lot faster... but before you address that, find out if you need to.
-- EDIT 8/20/16 --
In response to your issue with the performance of this... yes, it's a pig. I admit it. I encountered a similar issue here:
PostgreSQL Joining Between Two Values
The same concept for your example is this:
with xtrans as (
select created, created + generate_series(0, 30) as create_range, user_id
from transactions
)
select
t1.created,
count (distinct t2.user_id)
from
transactions t1
join xtrans t2 on
t2.create_range = t1.created
group by
t1.created
order by
t1.created
It's not as easy to follow, but it should yield identical results, only it will be significantly faster because it's not doing the "glorified cross join."

Delete latest entry in SQL Server without using datetime or ID

I have a basic SQL Server delete script that goes:
Delete from tableX
where colA = ? and colB = ?;
In tableX, I do not have any columns indicating sequential IDs or timestamp; just varchar. I want to delete the latest entry that was inserted, and I do not have access to the row number from the insert script. TOP is not an option because it's random. Also, this particular table does not have a primary key, and it's not a matter of poor design. Is there any way I can do this? I recall mysql being able to call something like max(row_number) and also something along the lines of limit one.
ROW_NUMBER exists in SQL Server, too, but it must be used with an OVER (order_by_clause). So... in your case it's impossible for you unless you come up with another sorting algo.
MSDN
Edit: (Examples for George from MSDN ... I'm afraid his company has a Firewall rule that blocks MSDN)
SQL-Code
USE AdventureWorks2012;
GO
SELECT ROW_NUMBER() OVER(ORDER BY SalesYTD DESC) AS Row,
FirstName, LastName, ROUND(SalesYTD,2,1) AS "Sales YTD"
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0;
Output
Row FirstName LastName SalesYTD
--- ----------- ---------------------- -----------------
1 Linda Mitchell 4251368.54
2 Jae Pak 4116871.22
3 Michael Blythe 3763178.17
4 Jillian Carson 3189418.36
5 Ranjit Varkey Chudukatil 3121616.32
6 José Saraiva 2604540.71
7 Shu Ito 2458535.61
8 Tsvi Reiter 2315185.61
9 Rachel Valdez 1827066.71
10 Tete Mensa-Annan 1576562.19
11 David Campbell 1573012.93
12 Garrett Vargas 1453719.46
13 Lynn Tsoflias 1421810.92
14 Pamela Ansman-Wolfe 1352577.13
Returning a subset of rows
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
Using ROW_NUMBER() with PARTITION
USE AdventureWorks2012;
GO
SELECT FirstName, LastName, TerritoryName, ROUND(SalesYTD,2,1),
ROW_NUMBER() OVER(PARTITION BY TerritoryName ORDER BY SalesYTD DESC) AS Row
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0
ORDER BY TerritoryName;
Output
FirstName LastName TerritoryName SalesYTD Row
--------- -------------------- ------------------ ------------ ---
Lynn Tsoflias Australia 1421810.92 1
José Saraiva Canada 2604540.71 1
Garrett Vargas Canada 1453719.46 2
Jillian Carson Central 3189418.36 1
Ranjit Varkey Chudukatil France 3121616.32 1
Rachel Valdez Germany 1827066.71 1
Michael Blythe Northeast 3763178.17 1
Tete Mensa-Annan Northwest 1576562.19 1
David Campbell Northwest 1573012.93 2
Pamela Ansman-Wolfe Northwest 1352577.13 3
Tsvi Reiter Southeast 2315185.61 1
Linda Mitchell Southwest 4251368.54 1
Shu Ito Southwest 2458535.61 2
Jae Pak United Kingdom 4116871.22 1
Your current table design does not allow you to determine the latest entry. YOu have no field to sort on to indicate which record was added last.
You need to redesign or pull that information from the audit tables. If you have a database without audit tables, you might have to find a tool to read the transaction logs and it will be a very time-consuming and expensive process. Or if you know the date the records you want to remove were added, you could possibly use a backup from just before this happened to find the records that were added. Just be awwre that you might be looking at records changed after this date that you want to keep.
If you need to do this on a regular basis instead of one-time to fix some bad data, then you need to properly design your database to include an identity field and possibly a dateupdated field (maintained through a trigger) or audit tables. (In my opinion no database containing information your company is depending on should be without audit tables, one of the many reasons why you should never allow an ORM to desgn a database, but I digress.) If you need to know the order records were added to a table, it is your responsiblity as the developer to create that structure. Databases only store what is deisnged for tehm to store, if you didn't design it in, then it is not available easily or at all
If (colA +'_'+ colB) can not be dublicate try this.
declare #delColumn nvarchar(250)
set #delColumn = (select top 1 DeleteColumn from (
select (colA +'_'+ colB) as DeleteColumn ,
ROW_NUMBER() OVER(ORDER BY colA DESC) as Id from tableX
)b
order by Id desc
)
delete from tableX where (colA +'_'+ colB) =#delColumn

SQL Statement to total all employee records

I have a sql statement that is missing all employee names.
Table employee_list contains all employees for the company.
Table apps contain the employee that is assigned to the app
Table details contains the total dollar amount for the order
My query will not group and total for employees that did not have any apps. For example employee John had 5 apps for $250, Bill had 2 apps for $75 and Henry had 0 apps for $0 (no rows in apps or details table for Henry).
My query returns:
John 5 250.00
Bill 2 75.00
I need it to return
John 5 250.00
Bill 2 75.00
Henry 0 0.00
Any ideas? Here is my current code
SELECT employee_list.Fullname,
count(apps.acntnum),
sum(details.cost)
FROM employee_list
left join apps on employee_list.Fullname=apps.EmployeeName
LEFT JOIN details ON (apps.ID=details.ObjOwner_ID AND details.Active=1)
Group BY
employee_list.Fullname
The important thing is to be using a LEFT JOIN from your employee_list table and any subsequent tables you're joining to, and to not do anything that will filter out NULLs from the right-hand tables (because the NULLs would be for the 'missing' rows).
Your query is fine, but I suspect you're using it in a wider query, where you may inadvertently have an INNER JOIN or mention one of the columns in a WHERE clause.
I agree with all the other answers, however, you could also try this....
SELECT employee_list.Fullname,
(SELECT count(apps.acntnum) FROM apps WHERE employee_list.Fullname=apps.EmployeeName) AS Cnt,
(SELECT sum(details.cost) FROM apps LEFT JOIN details ON (apps.ID=details.ObjOwner_ID AND details.Active=1) WHERE employee_list.Fullname=apps.EmployeeName) AS cost
FROM employee_list
This will always return the full list of employees, and separately go and count/sum the other values.
This answer does not take performance into account.

For Loop in SQL Server - is this the right logic?

I'm trying to take transactional data and cleanse it to meet my analysis needs. There are some limitations to how transactions are recorded into the database, and I am trying to get around those limitations.
When a customer places an order with more than 1 product, the transactional database doesn't link the multiple products together. Each product will have a unique sales ID, but there is no way to group multiple sales ID into 1 order. Here is a sample:
OrderID MultOrderID CustomerID SalesDate SalesTime ProductID ProductCost ShippingCost
6082346 7661X0A 2012-06-12 959 105 99.99 7.95
6082347 5809812YY6Y T891002 2012-06-12 1005 222 99.95 7.95
6082348 5809812YY6Z T891002 2012-06-12 1005 273 22.95 1.00
6082349 5809812YY71 T891002 2012-06-12 1005 285 499.95 1.00
6082350 5809812YY72 T891002 2012-06-12 1005 172 49.95 1.00
6082351 5809812YY73 T891002 2012-06-12 1005 105 99.99 7.95
6082352 5809812YY74 X637251 2012-06-12 1010 285 499.95 7.95
6082353 5809812YY75 X637251 2012-06-12 1010 30 1024.99 1.00
6082354 T512AT0 2012-06-12 1017 172 49.95 7.95
An additional limitation to this transaction system is that it can not ship more than 4 products together. If the customer places an order for 5 products, 4 products are shipped together (and charged 1 shipping charge), the remaining product is shipped separately and charged another shipping charge (yes, the overall business wants to rebuild this entire legacy system....).
What I am trying to determine is the number of products shipped per order, and the aggregate product costs and shipping costs.
If you look at the last 4 characters of the MultOrderID, you'll see that it's sequential, YY6Y becomes YY6Z, then rolls over to YY71, YY72. The logic is standardized - I know that if the CustomerID, SalesDate and SalesTime are the same, then I can pair off the products together. What I don't know is HOW I can accomplish this.
I believe the way to accomplish this is to break out the orders by CustomerID, SalesDate and SalesTime. Then, I get a for-loop, or something like that to cycle through the individual entries. Then, I look for the last 4 characters of the MultOrderID and say - If 1,2 and 3 are the same, and the 4th character is after the 4th character of the previous order, then pair it together, up to 4 orders. If the orderID is the 5th to 8th order in the range, then that's shipment 2, etc.
Can this be done in SQL Server? If not in that, what should I write this in? And is a for-loop what I should be using in this case?
Edit: Here is the output I am trying to get to. Keep in mind that after the 4th product was shipped, I need to restart the ordering (so, 6 products get broken into 2 shipments [4 products and 2 products], 9 products into 3 shipments [4, 4, and 1].
PRODUCTSSHIPPED SALESDATE SALESTIME CUSTOMERID PRODUCTCOST SHIPPINGCOST
4 6/12/12 1005 T891002 672.8 10.95
1 6/12/12 1005 T891002 99.99 7.95
2 6/12/12 1010 X637251 1524.94 8.95
1 6/12/12 1017 T512AT0 49.95 7.95
1 6/12/12 959 7661X0A 99.99 7.95
Well from this statement it seems like you want this:
What I am trying to determine is the number of products shipped per
order, and the aggregate product costs and shipping costs.
http://sqlfiddle.com/#!3/e0e71/30
So I'm not sure what you mean by using a foreach loop?
UPDATE:
Got it working using a subquery and ceiling function
Updated the fiddle
FYI SQL is:
SELECT
SalesDate,
SalesTime,
CustomerID,
SUM(ProductCost),
SUM(ShippingCost)
FROM
(
SELECT
SalesDate,
SalesTime,
CustomerID,
ProductCost,
ShippingCost,
ROW_NUMBER() OVER (PARTITION BY salesdate, salestime, customerid ORDER BY CustomerID) as ProdNumber
FROM Orders
) as Summary
group by SalesDate, SalesTime, CustomerID, ceiling(ProdNumber / 4.0)
I used ROW_NUMBER to get a running count of products for each order, then made this a subquery so I could do the grouping. The grouping just used the number of products divided by 4 (as float) and uses the ceiling function to round up to the nearest int to get it grouping into 4
This should give you the number of orders for that customer/date/time in the NumOrders field. It uses my new favorite function, Row_Number:
SELECT [CUSTOMERID], [SALESDATE], [SALESTIME], MAX(NumOrders)
FROM (
SELECT [CUSTOMERID],
[SALESDATE],
[SALESTIME],
ROW_NUMBER() OVER(PARTITION BY [CUSTOMERID], [SALESDATE], [SALESTIME] ORDER BY [CUSTOMERID]) AS NumOrders
) t1
GROUP BY [CUSTOMERID], [SALESDATE], [SALESTIME]
I dont think you need a loop here. Usually its considered as a bad practice in sql, unless completly unavoidable. Can you assume that if an order is made by a user at the same datetime exactly it belongs to the same logical order (order group)?
Anyway, the whole problem can probably be solved using SQL server's partition and over clauses. Look at sample D there, i think its doing something close to what you need.
EDIT
The range clause is availibale only in sql 2012, however you can still use partioning and rownumber, and then group by ur results by using simple calculation (ROWNUMBER / 4) on the returned rownumber
I'm not sure why a loop is needed at all..
Select count(*) as ProductsOnOrder, LEFT(CustomerID,4), as CID,
SalesDate, SalesTime, sum(productCost), sum(ShippingCost)
FROM YOUR_TABLENAME
GROUP BY left(CustomerID,4), salesdate, salestime
What order number do you want to display? Min? Max? All of em? what? same question on products do you want to list the products or just count them?
Select count(*) as ProductsOnOrder, LEFT(CustomerID,4), as CID,
SalesDate, SalesTime, sum(productCost), sum(ShippingCost),
min(orderID), Max(orderID)
FROM YOUR_TABLENAME
GROUP BY left(CustomerID,4), salesdate, salestime
Since you know orderID is sequential for each line on the order you could return min/max and subtract the two as well to get a count.

Resources