Postgresql inner select with distinct - database

I'm using Postgresql 9.2 and have a simple students table as follow
id | proj_id | mark | name | test_date
I have 2 queries which is described below
select * from (select distinct on (proj_id) proj_id , mark, name,
test_date from students )
t
where t.mark <= 1000
VS
select distinct on (proj_id) proj_id , mark, name, test_date from
students where mark <= 1000
when I run each query for more than 10000 records each query returns different result especially result count although for less than 3000 records the result would be the same.
is this postgresql 9.2 bug or I'm missing something ?

Your queries are producing two different sets of results because they are applying the logic differently.
The first query is getting a distinct set of results, and then applying the 'mark' filter.
The second query is applying the 'mark' filter, and then getting a distinct set of results.
As you don't have any ordering applied the first query could potential return a different number of rows each time it is run - as the mark field could contain any of the values that relate to the proj_id.

Related

SQL Selecting all but newest result per id

I need to set a "waived" flag in my table for all but the newest result per id. I thought I had a query that will work here, but when I run a select on the query, I'm getting incorrect results - I saw one case where it selected both of the only two results for a particular id. I'm also getting multiple results with the same exact data.
What am I doing wrong here?
Here's my select statement:
select t.test_row_id, t.test_result_id, t.waived, t.pass, t.comment
from EV.Test_Result
join EV.Test_Result as t on EV.Test_Result.test_row_id = t.test_row_id and EV.Test_Result.start_time < t.start_time and t.device_id = 1219 and t.waived = 0
order by t.test_row_id
Here's the actual query I want to run:
update EV.Test_Result
set waived = 1
from EV.Test_Result
join EV.Test_Result as t on EV.Test_Result.test_row_id = t.test_row_id and EV.Test_Result.start_time < t.start_time and t.device_id = 1219 and t.waived = 0
If I understand this correctly, you are having problems because the Cardinality of the ON predicate returns all matching rows.
EV.Test_Result.test_row_id = t.test_row_id
and EV.Test_Result.start_time < t.start_time
This ON will compare all of the start_time values that have the same id and return every combination of result sets where start_time is lesser than the t.start_time. Clearly, this is not what you want.
and t.device_id = 1219
and t.waived = 0
This is actually a predicate (ON technically is one), but I would prefer to use this in a subquery/CTE for several reasons: You limit the number of rows SQL has to retrieve and compare.
Something like the following might be what you needed:
SELECT A.test_row_id
, A.test_result_id
, A.waived
, A.pass
, A.comment
FROM EV.Test_Result A
INNER JOIN (SELECT MAX(start_time) AS start_time
, test_row_id
FROM EV.Test_Result
WHERE device_id = 1219
AND waived = 0
GROUP BY test_row_id
) AS T ON A.test_row_id = T.test_row_id
AND A.start_time < T.start_time
ORDER BY A.test_row_id
This query then returns a 1:M relationship between the values in the ON predicate, unlike the M:M query you had run.
UPDATE:
Since I sheepishly screwed up trying to alter my Query on SO, I'll redeem myself by explaining the physical and logical orders of basic SQL Query operators:
As you know, you write a simple SELECT statement like the following:
SELECT <aggregate column>, SUM(<non-aggregate column>) AS Cost
FROM <table_name>
WHERE <column> = 'some_value'
GROUP BY <aggregate column>
HAVING SUM(<non-aggregate column>) > some_value
ORDER BY <column>
Note that if you use a aggregate function, all other columns MUST appear in the GROUP BY or another function.
Now, SQL Server requires them to be written in that order although it actually processes this logically by the following order that is worth memorizing:
FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY
There are more details found on SELECT - MSDN, but this is why any columns in the SELECT operator must be in the group by or in a aggregate function (SUM, MIN, MAX, etc)...and also why my lazy code failed on your first attempt. :/
Note also that the ORDER BY is last (technically TOP operator occurs after this), and that without it the result is not deterministic unless a function such as DENSE_RANK enforces it (thought this occurs in the SELECT statement).
Hope this helps solve the problem and better yet how SQL works. Cheers
Can you try ROW_NUMBER () function order by timestamp descending and filtering out values having ROW_NUMBER 1 ;
Below query should fetch all records per id except the latest one
I tried below query in Oracle with a table having fields : id,user_id, record_order adn timestamp and it worked :
select
<table_name_alias>.*
from
(
select
id,
user_id,
row_number() over (partition by id order by record_order desc) as record_number
from
<your_table_name>
) <table_name_alias>
where
record_number <>1;
If you are using Teradata DB, you can also try QUALIFY statement. I'm not sure if all DBs support this.
Select
table_name.*
from table_name
QUALIFY row_number() over (partition by id order by record_order desc) <>1;

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

SSIS - Merging rows with aggregate determinations

I'm attempting to determine the best way, either in SSIS or straight TSQL, to merge two rows based on a given key, but taking specific data from each row based on various aggregate rules (MAX and SUM specifically). As an example, given the following dataset:
Customer Name Total Date Outstanding
12345 A 100 7/15/2015 500
12345 200 1/1/2015 300
456 B 500 1/2/2010 100
456 B 250 2/1/2015 900
78 C 100 9/15/2015 500
I wish to consolidate those to a single row per customer key, with the following rules as an example:
If any name is null, use a corresponding value for that customer that is not null
MAX(Total)
MAX(Date)
SUM(Outstanding)
The result set would be:
Customer Name Total Date Outstanding
12345 A 200 7/15/2015 800
456 B 500 2/1/2015 1000
78 C 100 9/15/2015 500
What's the best approach here? My first instinct is to query the table to join to itself on customer to get all values on a single row, and then use formulas in a Derived Column task in SSIS to determine the values to use. My concern there is that is not scalable - it works fine if I have a customer occur only twice in the main dataset, but the goal would be for the logic to work for N number of rows without needing to do a ton of rework. I'm sure there's also a TSQL approach that I'm missing here. Any help would be appreciated.
If name column in your query is not empty then you can do that simply by using aggregate function in one query
DECLARE #Customer TABLE
(
Customer INT, Name varchar(10), Total INT , PurchaseDate DATE , Outstanding INT
)
INSERT INTO #Customer
SELECT 12345,'A',100,'7/15/2015',500 UNION
SELECT 12345,'A',200,'1/1/2015',300 UNION
SELECT 456,'B',500,'1/2/2010',100 UNION
SELECT 456,'B',250,'2/1/2015',900 UNION
SELECT 78,'C',100,'9/15/2015',500
SELECT Customer,NAME ,MAX(Total), MAX(PurchaseDate), SUM(outstanding)
FROM #Customer
GROUP BY Customer, NAME
Demo
Now, if your name column is empty in few cases like you have mentioned in your example then you can update name table with correct name value

MS Access : Average and Total Calculation in Single Query

INTRODUCTION TO DATABASE TABLE BEING USED -
I am working on a “Stock Market Prices” based Database Table. My table has got the data for the following FIELDS –
ID
SYMBOL
OPEN
HIGH
LOW
CLOSE
VOLUME
VOLUME CHANGE
VOLUME CHANGE %
OPEN_INT
SECTOR
TIMESTAMP
New data gets added to the table daily “Monday to Friday”, based on the stock market price changes for that day. The current requirement is based on the VOLUME field, which shows the volume traded for a particular stock on daily basis.
REQUIREMENT –
To get the Average and Total Volume for last 10,15 and 30 Days respectively.
METHOD USED CURRENTLY -
I created these 9 SEPARATE QUERIES in order to get my desired results –
First I have created these 3 queries to take out the most recent last 10,15 and 30 dates from the current table:
qryLast10DaysStored
qryLast15DaysStored
qryLast30DaysStored
Then I have created these 3 queries for getting the respective AVERAGES:
qrySymbolAvgVolume10Days
qrySymbolAvgVolume15Days
qrySymbolAvgVolume30Days
And then I have created these 3 queries for getting the respective TOTALS:
qrySymbolTotalVolume10Days
qrySymbolTotalVolume15Days
qrySymbolTotalVolume30Days
PROBLEM BEING FACED WITH CURRENT METHOD -
Now, my problem is that I have ended up having these so many different queries, whereas I wanted to get the output into One Single Query, as shown in the Snapshot of the Excel Sheet:
http://i49.tinypic.com/256tgcp.png
SOLUTION NEEDED -
Is there some way by which I can get these required fields into ONE SINGLE QUERY, so that I do not have to look into multiple places for the required fields? Can someone please tell me how to get all these separate queries into one -
A) Either by taking out or moving the results from these separate individual queries to one.
B) Or by making a new query which calculates all these fields within itself, so that these separate individual queries are no longer needed. This would be a better solution I think.
One Clarification about Dates –
Some friend might think why I used the method of using Top 10,15 and 30 for getting the last 10,15 and 30 Date Values. Why not I just used the PC Date for getting these values? Or used something like -
("VOLUME","tbl-B", "TimeStamp BETWEEN Date() - 10 AND Date()")
The answer is that I require my query to "Read" the date from the "TIMESTAMP" Field, and then perform its calculations accordingly for LAST / MOST RECENT "10 days, 15 days, 30 days” FOR WHICH THE DATA IS AVAILABLE IN THE TABLE, WITHOUT BOTHERING WHAT THE CURRENT DATE IS. It should not depend upon the current date in any way.
If there is any better method or more efficient way to create these queries, then please enlighten.
You have separate queries to compute 10DayTotalVolume and 10DayAvgVolume. I suspect you can compute both in one query, qry10DayVolumes.
SELECT
b.SYMBOL,
Sum(b.VOLUME) AS 10DayTotalVolume,
Avg(b.VOLUME) AS 10DayAvgVolume
FROM
[tbl-B] AS b INNER JOIN
qryLast10DaysStored AS q
ON b.TIMESTAMP = q.TIMESTAMP
GROUP BY b.SYMBOL;
However, that makes me wonder whether 10DayAvgVolume can ever be anything other than 10DayTotalVolume / 10
Similar considerations apply to the 15 and 30 day values.
Ultimately, I think you want something based on a starting point like this:
SELECT
q10.SYMBOL,
q10.[10DayTotalVolume],
q10.[10DayAvgVolume],
q15.[15DayTotalVolume],
q15.[15DayAvgVolume],
q30.[30DayTotalVolume],
q30.[30DayAvgVolume]
FROM
(qry10DayVolumes AS q10
INNER JOIN qry15DayVolumes AS q15
ON q10.SYMBOL = q15.SYMBOL)
INNER JOIN qry30DayVolumes AS q30
ON q10.SYMBOL = q30.SYMBOL;
That assumes you have created qry15DayVolumes and qry30DayVolumes following the approach I suggested for qry10DayVolumes.
If you want to cut down the number of queries, you could use subqueries for each of the qry??DayVolumes saved queries, but try it this way first to make sure the logic is correct.
In that second query above, there can be a problem due to field names which start with digits. Enclose those names in square brackets or re-alias them in qry10DayVolumes, qry15DayVolumes, and qry30DayVolumes using alias names which begin with letters instead of digits.
I tested the query as written above with the "2nd Upload.mdb" you uploaded, and it ran without error from Access 2007. Here is the first row of the result set from that query:
SYMBOL 10DayTotalVolume 10DayAvgVolume 15DayTotalVolume 15DayAvgVolume 30DayTotalVolume 30DayAvgVolume
ACC-1 42909 4290.9 54892 3659.46666666667 89669 2988.96666666667
Access doesn't support most advanced SQL syntax and clauses, so this is a bit of a hack, but it works, and is fast on your small sample. You're basically running 3 queries but the Union clauses allow you to combine into one:
select
Symbol,
sum([10DayTotalVol]) as 10DayTotalV,
sum([10DayAvgVol]) as 10DayAvgV,
sum([15DayTotalVol]) as 15DayTotalV,
sum([15DayAvgVol]) as 15DayAvgV,
sum([30DayTotalVol]) as 30DayTotalV,
sum([30DayAvgVol]) as 30DayAvgV
from (
select
Symbol,
sum(volume) as 10DayTotalVol, avg(volume) as 10DayAvgVol,
0 as 15DayTotalVol, 0 as 15DayAvgVol,
0 as 30DayTotalVol, 0 as 30DayAvgVol
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 10 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
sum(volume), avg(volume),
0, 0
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 15 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
0, 0,
sum(volume), avg(volume)
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 30 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
) s
group by
Symbol

Performant way to get the maximum value of a running total in TSQL

We have a table of transactions which is structured like the following :
TranxID int (PK and Identity field)
ItemID int
TranxDate datetime
TranxAmt money
TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by. Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the TranxDate when this occurred. Note that TranxDate is not unique, and due to some backdating the ID field is not necessarily in the same sequence as TranxDate for a given Item.
Currently we're doing something like this (#tblTranx is a table variable containing just the transactions for a given Item) :
SELECT Top 1 #HighestTotal = z.TotalToDate, #DateHighest = z.TranxDate
FROM
(SELECT a.TranxDate, a.TranxID, Sum(b.TranxAmt) AS TotalToDate
FROM #tblTranx AS a
INNER JOIN #tblTranx AS b ON a.TranxDate >= b.TranxDate
GROUP BY a.TranxDate, a.TranxID) AS z
ORDER BY z.TotalToDate DESC
(The TranxID grouping removes the issue caused by duplicate date values)
This, for one Item, gives us the HighestTotal and the TranxDate when this occurred. Rather than run this on the fly for tens of thousands of entries, we only calculate this value when the app updates the relevant entry and record the value in another table for use in reporting.
The question is, can this be done in a better way so that we can work out these values on the fly (for multiple items at once) without falling into the RBAR trap (some ItemIDs have hundreds of entries). If so, could this then be adapted to get the highest values of subsets of transactions (based on a TransactionTypeID not included above). I'm currently doing this with SQL Server 2000, but SQL Server 2008 will be taking over soon here so any SQL Server tricks can be used.
SQL Server sucks in calculating running totals.
Here's a solution for your very query (which groups by dates):
WITH q AS
(
SELECT TranxDate, SUM(TranxAmt) AS TranxSum
FROM t_transaction
GROUP BY
TranxDate
),
m (TranxDate, TranxSum) AS
(
SELECT MIN(TranxDate), SUM(TranxAmt)
FROM (
SELECT TOP 1 WITH TIES *
FROM t_transaction
ORDER BY
TranxDate
) q
UNION ALL
SELECT DATEADD(day, 1, m.TranxDate),
m.TranxSum + q.TranxSum
FROM m
CROSS APPLY
(
SELECT TranxSum
FROM q
WHERE q.TranxDate = DATEADD(day, 1, m.TranxDate)
) q
WHERE m.TranxDate <= GETDATE()
)
SELECT TOP 1 *
FROM m
ORDER BY
TranxSum DESC
OPTION (MAXRECURSION 0)
You need to have an index on TranxDate for this to work fast.

Resources