I'm trying to take transactional data and cleanse it to meet my analysis needs. There are some limitations to how transactions are recorded into the database, and I am trying to get around those limitations.
When a customer places an order with more than 1 product, the transactional database doesn't link the multiple products together. Each product will have a unique sales ID, but there is no way to group multiple sales ID into 1 order. Here is a sample:
OrderID MultOrderID CustomerID SalesDate SalesTime ProductID ProductCost ShippingCost
6082346 7661X0A 2012-06-12 959 105 99.99 7.95
6082347 5809812YY6Y T891002 2012-06-12 1005 222 99.95 7.95
6082348 5809812YY6Z T891002 2012-06-12 1005 273 22.95 1.00
6082349 5809812YY71 T891002 2012-06-12 1005 285 499.95 1.00
6082350 5809812YY72 T891002 2012-06-12 1005 172 49.95 1.00
6082351 5809812YY73 T891002 2012-06-12 1005 105 99.99 7.95
6082352 5809812YY74 X637251 2012-06-12 1010 285 499.95 7.95
6082353 5809812YY75 X637251 2012-06-12 1010 30 1024.99 1.00
6082354 T512AT0 2012-06-12 1017 172 49.95 7.95
An additional limitation to this transaction system is that it can not ship more than 4 products together. If the customer places an order for 5 products, 4 products are shipped together (and charged 1 shipping charge), the remaining product is shipped separately and charged another shipping charge (yes, the overall business wants to rebuild this entire legacy system....).
What I am trying to determine is the number of products shipped per order, and the aggregate product costs and shipping costs.
If you look at the last 4 characters of the MultOrderID, you'll see that it's sequential, YY6Y becomes YY6Z, then rolls over to YY71, YY72. The logic is standardized - I know that if the CustomerID, SalesDate and SalesTime are the same, then I can pair off the products together. What I don't know is HOW I can accomplish this.
I believe the way to accomplish this is to break out the orders by CustomerID, SalesDate and SalesTime. Then, I get a for-loop, or something like that to cycle through the individual entries. Then, I look for the last 4 characters of the MultOrderID and say - If 1,2 and 3 are the same, and the 4th character is after the 4th character of the previous order, then pair it together, up to 4 orders. If the orderID is the 5th to 8th order in the range, then that's shipment 2, etc.
Can this be done in SQL Server? If not in that, what should I write this in? And is a for-loop what I should be using in this case?
Edit: Here is the output I am trying to get to. Keep in mind that after the 4th product was shipped, I need to restart the ordering (so, 6 products get broken into 2 shipments [4 products and 2 products], 9 products into 3 shipments [4, 4, and 1].
PRODUCTSSHIPPED SALESDATE SALESTIME CUSTOMERID PRODUCTCOST SHIPPINGCOST
4 6/12/12 1005 T891002 672.8 10.95
1 6/12/12 1005 T891002 99.99 7.95
2 6/12/12 1010 X637251 1524.94 8.95
1 6/12/12 1017 T512AT0 49.95 7.95
1 6/12/12 959 7661X0A 99.99 7.95
Well from this statement it seems like you want this:
What I am trying to determine is the number of products shipped per
order, and the aggregate product costs and shipping costs.
http://sqlfiddle.com/#!3/e0e71/30
So I'm not sure what you mean by using a foreach loop?
UPDATE:
Got it working using a subquery and ceiling function
Updated the fiddle
FYI SQL is:
SELECT
SalesDate,
SalesTime,
CustomerID,
SUM(ProductCost),
SUM(ShippingCost)
FROM
(
SELECT
SalesDate,
SalesTime,
CustomerID,
ProductCost,
ShippingCost,
ROW_NUMBER() OVER (PARTITION BY salesdate, salestime, customerid ORDER BY CustomerID) as ProdNumber
FROM Orders
) as Summary
group by SalesDate, SalesTime, CustomerID, ceiling(ProdNumber / 4.0)
I used ROW_NUMBER to get a running count of products for each order, then made this a subquery so I could do the grouping. The grouping just used the number of products divided by 4 (as float) and uses the ceiling function to round up to the nearest int to get it grouping into 4
This should give you the number of orders for that customer/date/time in the NumOrders field. It uses my new favorite function, Row_Number:
SELECT [CUSTOMERID], [SALESDATE], [SALESTIME], MAX(NumOrders)
FROM (
SELECT [CUSTOMERID],
[SALESDATE],
[SALESTIME],
ROW_NUMBER() OVER(PARTITION BY [CUSTOMERID], [SALESDATE], [SALESTIME] ORDER BY [CUSTOMERID]) AS NumOrders
) t1
GROUP BY [CUSTOMERID], [SALESDATE], [SALESTIME]
I dont think you need a loop here. Usually its considered as a bad practice in sql, unless completly unavoidable. Can you assume that if an order is made by a user at the same datetime exactly it belongs to the same logical order (order group)?
Anyway, the whole problem can probably be solved using SQL server's partition and over clauses. Look at sample D there, i think its doing something close to what you need.
EDIT
The range clause is availibale only in sql 2012, however you can still use partioning and rownumber, and then group by ur results by using simple calculation (ROWNUMBER / 4) on the returned rownumber
I'm not sure why a loop is needed at all..
Select count(*) as ProductsOnOrder, LEFT(CustomerID,4), as CID,
SalesDate, SalesTime, sum(productCost), sum(ShippingCost)
FROM YOUR_TABLENAME
GROUP BY left(CustomerID,4), salesdate, salestime
What order number do you want to display? Min? Max? All of em? what? same question on products do you want to list the products or just count them?
Select count(*) as ProductsOnOrder, LEFT(CustomerID,4), as CID,
SalesDate, SalesTime, sum(productCost), sum(ShippingCost),
min(orderID), Max(orderID)
FROM YOUR_TABLENAME
GROUP BY left(CustomerID,4), salesdate, salestime
Since you know orderID is sequential for each line on the order you could return min/max and subtract the two as well to get a count.
Related
I have 3 tables as follows:
Product: (product_id, product_description)
Seller: (seller_id, seller_name)
Association: (seller_id, product_id, price)
Many sellers sell many products. I need to find the two cheapest prices for each product (ordered by increasing price) and their corresponding vendors. The ideal column outputs are:
product_id, product_description, seller_id, seller_name, price
p01, milk, s04, walmart, 1.50
p01, milk, s02, target, 2.25
p02, rice, s05, safeway, 1.30
p02, rice, s03, dillons, 1.75
Here's what I've tried on SQL-server; it's an intermediate step towards the answer. I'm triggering an error but don't understand why:
SELECT TOP 2 *
FROM
(SELECT A.seller_id, A.product_id, min(price) AS A.price
FROM Association A
GROUP BY A.seller_ID, A.product_id)
ORDER BY A.price ASC
And the error:
Msg 102, Level 15, State 1, Line 8
Incorrect syntax near '.'.
Edit: I used the solution proposed by Benjamin; it's near correct. Here's the query output:
seller_id, product_id, price, m
1 1 7.89 1
3 1 8.00 1
6 1 8.50 1
1 2 12.05 1
6 2 12.50 1
1 3 13.67 1
6 3 15.00 1
1 4 7.66 1
3 4 7.50 1
6 4 8.24 1
Of note, some product_id values, such as 1 and 4, occurred 3 times, where I only need the two lowest prices, not the third (or higher.) So I believe that this code is ordering by price, but not removing entries with a price higher than the second lowest.
Its easier to do it with a CTE:
with min1 as (
SELECT A.seller_id, A.product_id, A.price,
row_number() over (partition by A.seller_id, A.product_id order by a.price asc) as rn
FROM Association A
)
select * from min1
where rn <3
order by seller_id, product_id, price;
To answer your question, you have a derived table in your query (which some might call a subquery). It must have an alias - so give it one.
In the query used to form the derived table, you have used incorrect syntax. You should not attempt to give the aggregated column the name of a.price (and you SHOULD be consistent with your names and their spelling - one day this inconsistency will bite you). Why? Well, first it is the source of your error. If you want the column to be named "a.price", then you need to delimit it since it violates the rules for regular identifiers. But don't - the period (or dot) has a specific meaning / usage and using it in the column name is very, VERY misleading. So just give it an alias without the period.
... (select A.seller_id, A.product_id, min(A.price) as minprice
from ... ) as MinAssoc
As you can see in this snippet, I gave the derived table the alias "MinAssoc" - which is the first thing I mentioned. If you leave it out, you will encounter an error if you just fix the column alias problem.
Next, stop using single letter aliases. That is just lazy. Sure, this is a short example and it is easy to see what your code does NOW. But you are building and reinforcing habits that will not serve you well and it reflects poorly on your work when others see it and need to decipher more complicated queries (because a single letter doesn't provide any help to understanding the "thing" a row represents).
These will fix your errors but you will need to use a different approach to your goal - as already suggested.
You can archive it using ROW_NUMBER () OVER clause either in SUB-QUERY or CTE, following is the sub-query example:
and the error due to AS A.price which supposed to be AS price in your example.
SELECT *
FROM
(SELECT A.seller_id,
A.product_id,
price,
ROW_NUMBER() OVER (PARTITION BY A.seller_id, A.product_id order by a.price) as RN
FROM Association A
) as T
Where T.RN <=2
ORDER BY price ASC
I have a table that i use to figure out what sites/shops are due a visit, based on the date of the last visit to it.
There's a quick process to get your head around to understand one of the requirements;
A visit is documented by the value: Task Type = CASH. A review of a visit is shown as: Task Type = SALE.
What i need is:
The most recent row in the table related to a Asset ID for either the most recent SALE or CASH line. (Sometimes CASH lines do not occur, but a SALE line is manually populated on the table instead).
I've included all the columns i would like visible on the final table.
Here's a mock up of the data - i'm still learning how to use SQLFiddle - and all the links from this site i take to it end up in an error! :(
TASK_TYPE AVERAGE_REVENUE ASSET_ID POSTING_DATE
SALE 25 A001 01/05/2017
CASH 20 A002 27/04/2017
SALE 20 A003 25/04/2017
TESTING 0 A002 28/04/2017
REPAIR 0 A002 27/04/2017
SALE 22 A004 30/04/2017
CASH 25 A001 22/04/2017
CASH 22 A004 01/05/2017
Here's what i would be expecting from the above example:
TASK_TYPE AVERAGE_REVENUE ASSET_ID POSTING_DATE
SALE 25 A001 01/05/2017
CASH 20 A002 27/04/2017
SALE 20 A003 25/04/2017
CASH 22 A004 01/05/2017
Any examples ive found on stackoverflow seem to solve part of the problem, but not all of it, and my knowledge isnt strong enough to fill in the gaps.
Any help is much appreciated.
In SQL server, you can team up row_number with top 1 with ties to find latest rows:
select top 1
with ties *
from your_table
where task_type in ('SALE', 'CASH')
order by row_number() over (
partition by asset_id order by posting_date desc
)
Demo
One solution is a LEFT JOIN on the table itself. What this query does is join all relevant rows with all other relevant rows (same ASSET_ID and type cash/sale) if the date of the latter is newer. Then we only retrieve those rows which do not have a row which is newer.
SELECT
A.*
FROM
mytable A LEFT JOIN mytable B ON (A.ASSET_ID = B.ASSET_ID AND
B.TASK_TYPE IN ('SALE','CASH') AND
A.POSTING_DATE < B.POSTING_DATE)
WHERE
A.TASK_TYPE IN ('SALE','CASH') AND
B.ASSET_ID IS NULL
You might try the following:
SELECT task_type,
average_revenue,
asset_id,
posting_date
FROM my_table first
WHERE task_type IN ('SALE', 'CASH')
AND posting_date = (SELECT MAX(posting_date)
FROM my_table second
WHERE second.task_type = first.task_type
AND second.asset_id = first.asset_id)
ORDER BY asset_id;
A common beef I get when trying to evangelize the benefits of learning freehand SQL to MS Access users is the complexity of creating the effects of a crosstab query in the manner Access does it. I realize that strictly speaking, in SQL it doesn't work that way -- the reason it's possible in Access is because it's handling the rendering the of the data.
Specifically, when I have a table with entities, dates and quantities, it's frequent that we want to see a single entity on one line with the dates represented as columns:
This:
entity date qty
------ -------- ---
278700-002 1/1/2016 5
278700-002 2/1/2016 3
278700-002 2/1/2016 8
278700-002 3/1/2016 1
278700-003 2/1/2016 12
Becomes this:
Entity 1/1/16 2/1/16 3/1/16
---------- ------ ------ ------
278700-002 5 11 1
278700-003 12
That said, the common way we've approached this is something similar to this:
with vals as (
select
entity,
case when order_date = '2016-01-01' then qty else 0 end as q16_01,
case when order_date = '2016-02-01' then qty else 0 end as q16_02,
case when order_date = '2016-03-01' then qty else 0 end as q16_02
from mydata
)
select
entity, sum (q16_01) as q16_01, sum (q16_02) as q16_02, sum (q16_03) as q16_03
from vals
group by entity
This is radically oversimplified, but I believe most people will get my meaning.
The main problem with this is not the limit on the number of columns -- the data is typically bounded, and I can make due with a fixed number of date columns -- 36 months, or whatever, depending on the context of the data. My issue is the fact that I have to change the dates every month to make this work.
I had an idea that I could leverage arrays to dynamically assign the quantity to the index of the array, based on the month away from the current date. In this manner, my data would end up looking like this:
Entity Values
---------- ------
278700-002 {5,11,1}
278700-003 {0,12,0}
This would be quite acceptable, as I could manage the rendering of the actual columns within whatever rendering tool I was using (Excel, for example).
The problem is I'm stuck... how do I get from my data to this. If this were Perl, I would loop through the data and do something like this:
foreach my $ref (#data) {
my ($entity, $month_offset, $qty) = #$ref;
$values{$entity}->[$month_offset] += $qty;
}
By this isn't Perl... so far, this is what I have, and now I'm at a mental impasse.
with offset as (
select
entity, order_date, qty,
(extract (year from order_date ) - 2015) * 12 +
extract (month from order_date ) - 9 as month_offset,
array[]::integer[] as values
from mydata
)
select
prod_id, playgrd_dte, -- oh my... how do I load into my array?
from fcst
The "2015" and the "9" are not really hard-coded -- I put them there for simplicity sake for this example.
Also, if my approach or my assumptions are totally off, I trust someone will set me straight.
As with all things imaginable and unimaginable, there is a way to do this with PostgreSQL. It looks like this:
WITH cte AS (
WITH minmax AS (
SELECT min(extract(month from order_date))::int,
max(extract(month from order_date))::int
FROM mytable
)
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
UNION
SELECT entity, extract(month from order_date)::int, qty FROM mytable
)
SELECT entity, array_agg(sum) AS values
FROM (
SELECT entity, mon, sum(qty) FROM cte
GROUP BY 1, 2) sub
GROUP BY 1
ORDER BY 1;
A few words of explanation:
The standard way to produce an array inside a SQL statement is to use the array_agg() function. Your problem is that you have months without data and then array_agg() happily produces nothing, leaving you with arrays of unequal length and no information on where in the time period the data comes from. You can solve this by adding 0's for every combination of 'entity' and the months in the period of interest. That is what this snippet of code does:
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
All those 0's are UNIONed to the actual data from 'mytable' and then (in the main query) you can first sum up the quantities by entity and month and subsequently aggregate those sums into an array for each entity. Since it is a double aggregation you need the sub-query. (You could also sum the quantities in the UNION but then you would also need a sub-query because UNIONs don't allow aggregation.)
The minmax CTE can be adjusted to include the year as well (your sample data doesn't need it). Do note that the actual min and max values are immaterial to the index in the array: if min is 743 it will still occupy the first position in the array; those values are only used for GROUPing, not indexing.
SQLFiddle
For ease of use you could wrap this query up in a SQL language function with parameters for the starting and ending month. Adjust the minmax CTE to produce appropriate min and max values for the generate_series() call and in the UNION filter the rows from 'mytable' to be considered.
I have a basic SQL Server delete script that goes:
Delete from tableX
where colA = ? and colB = ?;
In tableX, I do not have any columns indicating sequential IDs or timestamp; just varchar. I want to delete the latest entry that was inserted, and I do not have access to the row number from the insert script. TOP is not an option because it's random. Also, this particular table does not have a primary key, and it's not a matter of poor design. Is there any way I can do this? I recall mysql being able to call something like max(row_number) and also something along the lines of limit one.
ROW_NUMBER exists in SQL Server, too, but it must be used with an OVER (order_by_clause). So... in your case it's impossible for you unless you come up with another sorting algo.
MSDN
Edit: (Examples for George from MSDN ... I'm afraid his company has a Firewall rule that blocks MSDN)
SQL-Code
USE AdventureWorks2012;
GO
SELECT ROW_NUMBER() OVER(ORDER BY SalesYTD DESC) AS Row,
FirstName, LastName, ROUND(SalesYTD,2,1) AS "Sales YTD"
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0;
Output
Row FirstName LastName SalesYTD
--- ----------- ---------------------- -----------------
1 Linda Mitchell 4251368.54
2 Jae Pak 4116871.22
3 Michael Blythe 3763178.17
4 Jillian Carson 3189418.36
5 Ranjit Varkey Chudukatil 3121616.32
6 José Saraiva 2604540.71
7 Shu Ito 2458535.61
8 Tsvi Reiter 2315185.61
9 Rachel Valdez 1827066.71
10 Tete Mensa-Annan 1576562.19
11 David Campbell 1573012.93
12 Garrett Vargas 1453719.46
13 Lynn Tsoflias 1421810.92
14 Pamela Ansman-Wolfe 1352577.13
Returning a subset of rows
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
Using ROW_NUMBER() with PARTITION
USE AdventureWorks2012;
GO
SELECT FirstName, LastName, TerritoryName, ROUND(SalesYTD,2,1),
ROW_NUMBER() OVER(PARTITION BY TerritoryName ORDER BY SalesYTD DESC) AS Row
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0
ORDER BY TerritoryName;
Output
FirstName LastName TerritoryName SalesYTD Row
--------- -------------------- ------------------ ------------ ---
Lynn Tsoflias Australia 1421810.92 1
José Saraiva Canada 2604540.71 1
Garrett Vargas Canada 1453719.46 2
Jillian Carson Central 3189418.36 1
Ranjit Varkey Chudukatil France 3121616.32 1
Rachel Valdez Germany 1827066.71 1
Michael Blythe Northeast 3763178.17 1
Tete Mensa-Annan Northwest 1576562.19 1
David Campbell Northwest 1573012.93 2
Pamela Ansman-Wolfe Northwest 1352577.13 3
Tsvi Reiter Southeast 2315185.61 1
Linda Mitchell Southwest 4251368.54 1
Shu Ito Southwest 2458535.61 2
Jae Pak United Kingdom 4116871.22 1
Your current table design does not allow you to determine the latest entry. YOu have no field to sort on to indicate which record was added last.
You need to redesign or pull that information from the audit tables. If you have a database without audit tables, you might have to find a tool to read the transaction logs and it will be a very time-consuming and expensive process. Or if you know the date the records you want to remove were added, you could possibly use a backup from just before this happened to find the records that were added. Just be awwre that you might be looking at records changed after this date that you want to keep.
If you need to do this on a regular basis instead of one-time to fix some bad data, then you need to properly design your database to include an identity field and possibly a dateupdated field (maintained through a trigger) or audit tables. (In my opinion no database containing information your company is depending on should be without audit tables, one of the many reasons why you should never allow an ORM to desgn a database, but I digress.) If you need to know the order records were added to a table, it is your responsiblity as the developer to create that structure. Databases only store what is deisnged for tehm to store, if you didn't design it in, then it is not available easily or at all
If (colA +'_'+ colB) can not be dublicate try this.
declare #delColumn nvarchar(250)
set #delColumn = (select top 1 DeleteColumn from (
select (colA +'_'+ colB) as DeleteColumn ,
ROW_NUMBER() OVER(ORDER BY colA DESC) as Id from tableX
)b
order by Id desc
)
delete from tableX where (colA +'_'+ colB) =#delColumn
I have an unnormalized table of customer orders. I want count how many products are sold and display it in table by type.
OnlineSalesKey SalesOrderNumber ProductKey
--------------------------------------------
1 20121018 778
2 20121018 774
3 20121018 665
4 20121019 772
5 20121019 778
9 20121019 434
10 20121019 956
11 20121020 772
12 20121020 965
15 20121020 665
16 20121020 778
17 20121021 665
My query:
SELECT
s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY
s.ProductKey
Question #1.
That query does a job. But now I want display and take into account only those orders where more than one item is purchased. Not sure how do i do that in one query. Any ideas?
Question #2
Is it possible to normalize results and get back data separated by semi column?
20121018 | 778; 774; 665
Thanks!
You don't say which SQL database you're using, and there will be different, more-or-less efficient answers for each database. (Sorry, just noticed MSSQL is in the question title.)
Here's a solution that will work in all or most databases:
SELECT s.ProductKey, COUNT (*) As Purchased
FROM Sales s
WHERE SalesOrderNum IN
(SELECT SalesOrderNum FROM Sales GROUP BY SalesOrderNum HAVING COUNT(*) > 1)
GROUP BY s.ProductKey
This is not the most efficient, but should work across the most products.
Also, please note that you're using the terms normalized and unnormalized in reverse. The table you have is normalized, the results you want are de-normalized.
There is no standard SQL statement to get the de-normalized results you want using SQL alone, but some databases (MySQL and SQLite) provide the group_concat function to do just this.
Q1: Look at HAVING clause
display and take into account only those orders...
SELECT s.SalesOrderNumber, COUNT (*) As Purchased
FROM Sales s
GROUP BY s.SalesOrderNumber
HAVING COUNT(*) > 1
So we group by the orders, apply the condition in HAVING, then display the SalesOrderNumber in the SELECT clause.
Q2: Look at several group concatenation techniques.
MySQL:
SELECT s.SalesOrderNumber, GROUP_CONCAT(DISTINCT ProductKey
ORDER BY ProductKey SEPARATOR '; ')
FROM Sales s
GROUP BY s.SalesOrderNumber
SQL Server: See this answer to a duplicate question. Basically, using FOR XML.
SELECT s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY s.ProductKey
having count(*) > 1
EDIT -
Answer 1 - To display products for which orders had more than one product purchased -
SELECT s.ProductKey, COUNT (*) As Purchased
FROM Sales s
WHERE SalesOrderNum in
(
select SalesOrderNum from Sales
group by SalesOrderNum having count(*) > 1
)
GROUP BY s.ProductKey
1) Try this one:
SELECT s.ProductKey, COUNT (*) As Purchased
FROM
Sales s
GROUP BY s.ProductKey
HAVING COUNT(*) > 1