Select additional column of row matched by `percentile_disc` - database

Say, I want to select the the value of the 50th percentile of a table in Postgres, which works fine like this:
SELECT percentile_disc(0.5) within group (order by value) FROM foo;
But now, I want to know the value of another column, say, created_at for the same row that was matched:
SELECT created_at, percentile_disc(0.5) within group (order by value) FROM foo;
But this raises an error:
column "foo.created_at" must appear in the GROUP BY clause or be used in an aggregate function.
In theory, Postgres should be able to know which created_at I'm talking about, since, percentile_disc refers to one row at most. But I can't see a way to reference the value in the select query. Is it possible?

You cannot use the aggregate function for this directly, but perhaps in a sub select like this:
SELECT f.created, f.value
FROM foo f
WHERE value = (SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY f2.value) FROM foo f2)

Related

How to identify which column(s) have different value in SQL Server

I have a table which has more than 100 columns, in normal case the contract_id should be unique in this table, but sometimes there are duplicate values. I use this SQL statement to retrieve data from this table:
select distinct contract_id, col1, col2,...colM
from the_table;
but I found contract_id values, I know there should be some values are different in the same column(s), can I have a way to find out all these columns which have different value result in I saw duplicate contract_id even though I use distinct, because there are lots of fields and only a few columns have different values. It is difficult to compare each column one by one manually.
Try something along
SELECT contract_id
FROM the_table
GROUP BY contract_id
HAVING COUNT(contract_id)>1;
or
WITH NumberedRows AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY contract_id ORDER BY(SELECT NULL)) AS RowNumber
,*
FROM the_table
)
SELECT *
FROM NumberedRows
WHERE RowNumber>1;
The first will show you all the contract_id values, which occur at least twice, the second will show you all the rows you might want to manipulate (delete/change).
attention: I used SELECT NULL in the ORDER BY of the OVER() clause. It is very important to use a fitting ORDER BY clause here. This will be responsible for Which row gets the number 1 and which rows get increasing numbers and will show up in the result due to >1?

T-SQL: aggregate function for calculating Nth percentile

I am trying to calculate the Nth percentile of all of the values in a single column in a table. All I want is a scalar, aggregate value for which N percent of the values are below. For instance, If the table has 100 rows where the value is the same as the row index plus one (1 to 100 consecutively), then I'd want this value to tell me that 95% of the values are below 95.
The PERCENTILE_CONT analytic function looks closest to what I want. But if I try to use it like this:
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
I get one row per row in the table, all with the same value. I could use TOP 1 to just give me one of those rows, but now I've done an additional table scan.
I am not trying to create a wizbang table of results partitioned by some other column in the original table. I just want an aggregate, scalar value.
Edit: I have been able to use PERCENTILE_CONT in a query with a WHERE clause. For example:
DECLARE #P95 INT
SELECT TOP 1 #P95 = (PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER ())
FROM ExampleTable
WHERE LOWER(Color) = 'blue'
SELECT #P95
Including the WHERE clause gives a different result than I got without it.
From what I can tell, you will need to do a subquery here. For example, to find the number of records strictly below the 95 percentile we can try:
WITH cte AS (
SELECT ValueColumn,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
FROM yourTable
)
SELECT COUNT(*)
FROM cte
WHERE ValueColumn < P95;

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

In T-SQL how to select only the top(not max) value in a group of record

I have some sample data as follows
Name Value Timestamp
a 23 2016/12/23 11:23
a 43 2016/12/23 12:55
b 12 2016/12/23 12:55
I want to select the latest value for a and b. When I used Last_Value, I used the following query
Select Name, Last_Value(Value) over (partition by Name order by timestamp) from table
This returned 2 rows for a, but I wanted it grouped so that I get only the last entered value for each name. So I had to use sub queries.
select x.Name,x.Value from (Select Name, Last_Value(Value) over (partition by Name order by timestamp) ) as x group by x.Name,x.Value
This again returns 2 records for a...I just wanted to do a group by and orderby and instaed of selelcting the max() wanted to select the top record.
Can anybody tell me how to solve this problem?
One method doesn't use window functions:
select t.*
from table t
where t.timestamp = (select max(t2.timestamp) from table t2 where t2.name = t.name);
Otherwise, the subquery method is fine, although I would often use row_number() and conditional aggregation rather than last_value() (or first_value() with a descending order by).
Unfortunately, SQL Server does not support first_value() or last_value() as an aggregation function, only as a window function.

advantage of select query in from clause

What is the advantage of using select clause in the from clause over normal select clause ?. For ex.
select name, age, address from emp where
empid = 12.
what is the advantage of below query over above query.
select A.name, A.age, A.address from (select
* from emp where empid = 12) as A.
The inner query creates a temp view and from that result, the fields in the first query selected. Right ?. But the query mentioned in the top of this question can also be used to get the same result.
What is the advantage? Thanks.
One way this technique can be used to derive results in the inner query that you don't want presented in the outer query. A simple example, I want to see the oldest person in each household, here is one way to do it:
SELECT name, age, address FROM
(
SELECT name, age, address,
rn = ROW_NUMBER() OVER (PARTITION BY address ORDER BY age DESC)
FROM dbo.emp
) AS x
WHERE rn = 1;
This way the derived column (a ranking, essentially, of household members by age, oldest first) does not need to be a column in the result set (and this also makes it convenient for filtering).
And I'm sure there are plenty of others (such as not having to repeat elaborate expressions); this was just the first one that came to mind.
In the case you posted in your question, I don't see an advantage at all. If you post a real example, where someone said there was an advantage, we might be able to explain why (or at least why they may have thought that).

Resources