PostgreSQL Crosstab - variable number of columns - arrays

A common beef I get when trying to evangelize the benefits of learning freehand SQL to MS Access users is the complexity of creating the effects of a crosstab query in the manner Access does it. I realize that strictly speaking, in SQL it doesn't work that way -- the reason it's possible in Access is because it's handling the rendering the of the data.
Specifically, when I have a table with entities, dates and quantities, it's frequent that we want to see a single entity on one line with the dates represented as columns:
This:
entity date qty
------ -------- ---
278700-002 1/1/2016 5
278700-002 2/1/2016 3
278700-002 2/1/2016 8
278700-002 3/1/2016 1
278700-003 2/1/2016 12
Becomes this:
Entity 1/1/16 2/1/16 3/1/16
---------- ------ ------ ------
278700-002 5 11 1
278700-003 12
That said, the common way we've approached this is something similar to this:
with vals as (
select
entity,
case when order_date = '2016-01-01' then qty else 0 end as q16_01,
case when order_date = '2016-02-01' then qty else 0 end as q16_02,
case when order_date = '2016-03-01' then qty else 0 end as q16_02
from mydata
)
select
entity, sum (q16_01) as q16_01, sum (q16_02) as q16_02, sum (q16_03) as q16_03
from vals
group by entity
This is radically oversimplified, but I believe most people will get my meaning.
The main problem with this is not the limit on the number of columns -- the data is typically bounded, and I can make due with a fixed number of date columns -- 36 months, or whatever, depending on the context of the data. My issue is the fact that I have to change the dates every month to make this work.
I had an idea that I could leverage arrays to dynamically assign the quantity to the index of the array, based on the month away from the current date. In this manner, my data would end up looking like this:
Entity Values
---------- ------
278700-002 {5,11,1}
278700-003 {0,12,0}
This would be quite acceptable, as I could manage the rendering of the actual columns within whatever rendering tool I was using (Excel, for example).
The problem is I'm stuck... how do I get from my data to this. If this were Perl, I would loop through the data and do something like this:
foreach my $ref (#data) {
my ($entity, $month_offset, $qty) = #$ref;
$values{$entity}->[$month_offset] += $qty;
}
By this isn't Perl... so far, this is what I have, and now I'm at a mental impasse.
with offset as (
select
entity, order_date, qty,
(extract (year from order_date ) - 2015) * 12 +
extract (month from order_date ) - 9 as month_offset,
array[]::integer[] as values
from mydata
)
select
prod_id, playgrd_dte, -- oh my... how do I load into my array?
from fcst
The "2015" and the "9" are not really hard-coded -- I put them there for simplicity sake for this example.
Also, if my approach or my assumptions are totally off, I trust someone will set me straight.

As with all things imaginable and unimaginable, there is a way to do this with PostgreSQL. It looks like this:
WITH cte AS (
WITH minmax AS (
SELECT min(extract(month from order_date))::int,
max(extract(month from order_date))::int
FROM mytable
)
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
UNION
SELECT entity, extract(month from order_date)::int, qty FROM mytable
)
SELECT entity, array_agg(sum) AS values
FROM (
SELECT entity, mon, sum(qty) FROM cte
GROUP BY 1, 2) sub
GROUP BY 1
ORDER BY 1;
A few words of explanation:
The standard way to produce an array inside a SQL statement is to use the array_agg() function. Your problem is that you have months without data and then array_agg() happily produces nothing, leaving you with arrays of unequal length and no information on where in the time period the data comes from. You can solve this by adding 0's for every combination of 'entity' and the months in the period of interest. That is what this snippet of code does:
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
All those 0's are UNIONed to the actual data from 'mytable' and then (in the main query) you can first sum up the quantities by entity and month and subsequently aggregate those sums into an array for each entity. Since it is a double aggregation you need the sub-query. (You could also sum the quantities in the UNION but then you would also need a sub-query because UNIONs don't allow aggregation.)
The minmax CTE can be adjusted to include the year as well (your sample data doesn't need it). Do note that the actual min and max values are immaterial to the index in the array: if min is 743 it will still occupy the first position in the array; those values are only used for GROUPing, not indexing.
SQLFiddle
For ease of use you could wrap this query up in a SQL language function with parameters for the starting and ending month. Adjust the minmax CTE to produce appropriate min and max values for the generate_series() call and in the UNION filter the rows from 'mytable' to be considered.

Related

How to use the datebucket filter

Trying to use the :datebucket filter but it doesn't seem to work.
select date, address from database.table where address = 'xyz' group by :datebucket(date)
This returns the error that date isn't in the group by statement, but it is. If it add it separately to the group by statement, it just groups by the individual date instead of respecting the date bucket selection.
Not finding anything in the Snowflake documentation about how this filter is suppose to work, just that it exists.
In this site: https://www.webagesolutions.com/blog/querying-data-in-snowflake was example like this about databucket function
SELECT COUNT(ORDER_DATE) as COUNT_ORDER_DATE, ORDER_DATE
FROM ORDERS
GROUP BY :datebucket(ORDER_DATE), ORDER_DATE
ORDER BY COUNT_ORDER_DATE DESC;
So could your query work if it was modified like this:
SELECT
date,
address
FROM
database.table
WHERE
address = 'xyz'
GROUP BY :datebucket(date), date
Datebucket is truncating the date, to buckets. But you have selected the raw date.
This is like grouping by decade '60,'70,'80 of what great years, but want the actual year.
SELECT column1 as year,
truncate(year,-1) as decade
FROM VALUES (1),(2),(3),(14),(15),(16),(27),(28),(29);
gives:
YEAR
DECADE
1
0
2
0
3
0
14
10
15
10
16
10
27
20
28
20
29
20
so if I try select
SELECT column1 as year
FROM VALUES (1),(2),(3),(14),(15),(16),(27),(28),(29)
GROUP BY truncate(year,-1)
ORDER BY 1;
gives the error
Error: 'VALUES.COLUMN1' in select clause is neither an aggregate nor in the group by clause. (line 15)
So if we move the decade into the selection, it makes sense:
SELECT truncate(column1,-1) as decade
FROM VALUES (1),(2),(3),(14),(15),(16),(27),(28),(29)
GROUP BY decade
ORDER BY 1;
and we get the
DECADE
0
10
20
So the problem is not :datebucket(date) but the fact while :datebucket(date) and date are related, from the perspective of GROUPING they are unrelated.
I've been trying to use datebucket(date) and daterange, and I also needed the results in a Snowflake graph.
It was a bit trick, because the value returned by datebucket(date) is actually a truncated date based on the selected date part. For that, I had to convert it to a char, and it worked!
select
to_char(:datebucket(start_time), 'YYYY.MM.DD # HH24') as start_time_bucket,
sum(credits_used) as credits_used
from snowflake.account_usage.warehouse_metering_history wmh
where
start_time = :daterange
group by :datebucket(start_time)
And if you're an ACCOUNTADMIN, you can now use the query to get the total credits usage by date :)
Last, to answer the main query by Tony, the query should be:
select date, address
from database.table
where address = 'xyz'
group by :datebucket(date), date, address
// or
select :datebucket(date), address
from database.table
where address = 'xyz'
group by :datebucket(date), address
Try adding the :datebucket(date) in the select part as well (not only in group by). Also, you will probably need an aggregate function for the field address (for example any_value(address):
select :datebucket(date), any_value(address)
from database.table
where address = 'xyz'
group by :datebucket(date)

how to always return rows in the same order

I have a table (call it "DimMonth") that I often want to select a subset of successive rows from by some numeric column (say "Month"). I always specify the min / max row in this subset, as well as the number of rows in the subset. DimMonth.Month is an integer that represents year and month (in format YYYYMM), with values like 202001, 202012, 202103, etc. There are no keys or indexes defined for the table (although, Month is a foreign key to other tables). What is the best way to go about selecting this subset of rows?
For example, say #month = 202103, and that I want to select it and the 3 months before it. So, I expect a result like:
202103
202102
202101
202012
As far as I know, due to order of execution, even though the following solution works sometimes, I can't rely upon it to work all the time:
SELECT TOP 4
Month
FROM
dbo.DimMonth
WHERE
Month <= #month
ORDER BY
Month DESC
....since SELECT is executed before ORDER BY.
A solution which I know works but is tedious to write every time (and is costly for the CTE, since the result set will grow over time) is:
WITH
all_months_before_desired_month AS
(
SELECT
Month
ROW_NUMBER() OVER(
ORDER BY
Month DESC
) AS RowNum
FROM
dbo.DimMonth
WHERE
Month <= #month
)
,SELECT
Month
FROM
all_months_before_desired_month
WHERE
RowNum BETWEEN 1 AND 4
;
I think the right answer here is to define a key or an index (so that I can use my first solution, but without the ORDER BY), but I'm not sure.
ORDER BY is always executed first, and then SELECT.
It does not matter that ORDER BY is at the end.
If you have no ORDER BY, then the results are in unpredictable order, that is also subject to change.

Measure all distances return shortest value

I have one table with a list of stores, approximately 100 or so with lat/long. The second table I have a list of customers, with lat/long and has more than 500k.
I need to find the closest store to each customer. Currently I am using the geography data type with the STDistance function to calculate the distance between two points. This is functioning fine, but I am getting hung up on the most efficient ways to process this.
Option #1 - Cartesian join Customer_table to Store_table, process the distance calculation, rank the results and filter to #1. Concern with this is that if you have a 1 million row customer list, and 100 stores, you are created a 100 million row table and the rank function then thereafter may be taxing.
Option #2 - With some dynamic sql, create a pivoted table that has each customer in the first column, and each subsequent column has the calculated distance to each branch. From there, I can unpivot and then do the same rank/over function described in the first.
EXAMPLE
CUST_ID LAT LONG STORE1DIST STORE2DIST STORE3DIST
1 20.00 30.00 4.5 5.6 7.8
2 20.00 30.00 7.4 8.1 8.5
I'm not clear which would be the most efficient, and will keep the DBA's from wanting to come find me.
Thanks for the input in advance!
You can unpivot the data into multiple rows for each store distance then use simple pivot (Group by) to get the minimum value of StoreDistance.
select CUST_ID, MIN(STOREDIST) StoreDistance, MIN(STORES) StoreName
from
(select CUST_ID, LAT, LONG, STORE1DIST, STORE2DIST, STORE3DIST from Cus/*Your table*/) p
UNPIVOT
(
STOREDIST FOR STORES IN (STORE1DIST, STORE2DIST, STORE3DIST)
) as unpvt
Group by CUST_ID
This will give you the result as:
CUST_ID StoreDistance StoreName
-----------------------------------
1 4.5 STORE1DIST
2 7.4 STORE1DIST
I have a similar situation on my job. I use a distance function like this (returns kms, use 3960* to return miles):
CREATE Function MySTDistance(#lat1 float, #lon1 float, #lat2 float, #lon2 float)
returns smallmoney
as
return IsNull(6373*acos((sin(radians(#lat1))*sin(radians(#lat2)))
+(cos(radians(#lat1))*cos(radians(#lat2))*cos(radians(#lon1-#lon2)))),0)
then you look for the closest store by doing something like...
select C.Cust_Id
,Store_id=
(select top (1) Store_id
from Store_Table S
order by dbo.MySTDistance(S.lat, S.long, C.lat, C.long)
)
from Customer_Table C
Now you have each customer id with his closest store id. It's quite fast with a huge volume of customers (at least in my case).

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

MS Access : Average and Total Calculation in Single Query

INTRODUCTION TO DATABASE TABLE BEING USED -
I am working on a “Stock Market Prices” based Database Table. My table has got the data for the following FIELDS –
ID
SYMBOL
OPEN
HIGH
LOW
CLOSE
VOLUME
VOLUME CHANGE
VOLUME CHANGE %
OPEN_INT
SECTOR
TIMESTAMP
New data gets added to the table daily “Monday to Friday”, based on the stock market price changes for that day. The current requirement is based on the VOLUME field, which shows the volume traded for a particular stock on daily basis.
REQUIREMENT –
To get the Average and Total Volume for last 10,15 and 30 Days respectively.
METHOD USED CURRENTLY -
I created these 9 SEPARATE QUERIES in order to get my desired results –
First I have created these 3 queries to take out the most recent last 10,15 and 30 dates from the current table:
qryLast10DaysStored
qryLast15DaysStored
qryLast30DaysStored
Then I have created these 3 queries for getting the respective AVERAGES:
qrySymbolAvgVolume10Days
qrySymbolAvgVolume15Days
qrySymbolAvgVolume30Days
And then I have created these 3 queries for getting the respective TOTALS:
qrySymbolTotalVolume10Days
qrySymbolTotalVolume15Days
qrySymbolTotalVolume30Days
PROBLEM BEING FACED WITH CURRENT METHOD -
Now, my problem is that I have ended up having these so many different queries, whereas I wanted to get the output into One Single Query, as shown in the Snapshot of the Excel Sheet:
http://i49.tinypic.com/256tgcp.png
SOLUTION NEEDED -
Is there some way by which I can get these required fields into ONE SINGLE QUERY, so that I do not have to look into multiple places for the required fields? Can someone please tell me how to get all these separate queries into one -
A) Either by taking out or moving the results from these separate individual queries to one.
B) Or by making a new query which calculates all these fields within itself, so that these separate individual queries are no longer needed. This would be a better solution I think.
One Clarification about Dates –
Some friend might think why I used the method of using Top 10,15 and 30 for getting the last 10,15 and 30 Date Values. Why not I just used the PC Date for getting these values? Or used something like -
("VOLUME","tbl-B", "TimeStamp BETWEEN Date() - 10 AND Date()")
The answer is that I require my query to "Read" the date from the "TIMESTAMP" Field, and then perform its calculations accordingly for LAST / MOST RECENT "10 days, 15 days, 30 days” FOR WHICH THE DATA IS AVAILABLE IN THE TABLE, WITHOUT BOTHERING WHAT THE CURRENT DATE IS. It should not depend upon the current date in any way.
If there is any better method or more efficient way to create these queries, then please enlighten.
You have separate queries to compute 10DayTotalVolume and 10DayAvgVolume. I suspect you can compute both in one query, qry10DayVolumes.
SELECT
b.SYMBOL,
Sum(b.VOLUME) AS 10DayTotalVolume,
Avg(b.VOLUME) AS 10DayAvgVolume
FROM
[tbl-B] AS b INNER JOIN
qryLast10DaysStored AS q
ON b.TIMESTAMP = q.TIMESTAMP
GROUP BY b.SYMBOL;
However, that makes me wonder whether 10DayAvgVolume can ever be anything other than 10DayTotalVolume / 10
Similar considerations apply to the 15 and 30 day values.
Ultimately, I think you want something based on a starting point like this:
SELECT
q10.SYMBOL,
q10.[10DayTotalVolume],
q10.[10DayAvgVolume],
q15.[15DayTotalVolume],
q15.[15DayAvgVolume],
q30.[30DayTotalVolume],
q30.[30DayAvgVolume]
FROM
(qry10DayVolumes AS q10
INNER JOIN qry15DayVolumes AS q15
ON q10.SYMBOL = q15.SYMBOL)
INNER JOIN qry30DayVolumes AS q30
ON q10.SYMBOL = q30.SYMBOL;
That assumes you have created qry15DayVolumes and qry30DayVolumes following the approach I suggested for qry10DayVolumes.
If you want to cut down the number of queries, you could use subqueries for each of the qry??DayVolumes saved queries, but try it this way first to make sure the logic is correct.
In that second query above, there can be a problem due to field names which start with digits. Enclose those names in square brackets or re-alias them in qry10DayVolumes, qry15DayVolumes, and qry30DayVolumes using alias names which begin with letters instead of digits.
I tested the query as written above with the "2nd Upload.mdb" you uploaded, and it ran without error from Access 2007. Here is the first row of the result set from that query:
SYMBOL 10DayTotalVolume 10DayAvgVolume 15DayTotalVolume 15DayAvgVolume 30DayTotalVolume 30DayAvgVolume
ACC-1 42909 4290.9 54892 3659.46666666667 89669 2988.96666666667
Access doesn't support most advanced SQL syntax and clauses, so this is a bit of a hack, but it works, and is fast on your small sample. You're basically running 3 queries but the Union clauses allow you to combine into one:
select
Symbol,
sum([10DayTotalVol]) as 10DayTotalV,
sum([10DayAvgVol]) as 10DayAvgV,
sum([15DayTotalVol]) as 15DayTotalV,
sum([15DayAvgVol]) as 15DayAvgV,
sum([30DayTotalVol]) as 30DayTotalV,
sum([30DayAvgVol]) as 30DayAvgV
from (
select
Symbol,
sum(volume) as 10DayTotalVol, avg(volume) as 10DayAvgVol,
0 as 15DayTotalVol, 0 as 15DayAvgVol,
0 as 30DayTotalVol, 0 as 30DayAvgVol
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 10 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
sum(volume), avg(volume),
0, 0
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 15 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
0, 0,
sum(volume), avg(volume)
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 30 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
) s
group by
Symbol

Resources