SQl Query with even distribution of samples - sql-server

Is there a way to query SQL to get an even distribution of samples. For example if one of my fields is a State field... I want to query top 5000 results with (100 from each state)... Or another example, if I have a field that says whether a client is a new client or an existing client, and I want the top 500 results where 250 are new clients and 250 are existing clients.
I am trying to avoid two different queries that I have to manually combine the results.

You can do this by using ROW_NUMBER. You partition your data on one or more columns, so the row numbering starts from 1 in every partition. You then select the top x rows and ORDER BY the row number column.
e.g.
WITH cte
AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY StateName ORDER BY NEWID() ) AS RN
FROM dbo.Sales
)
SELECT TOP 5 *
FROM cte
ORDER BY RN;

Related

How to identify which column(s) have different value in SQL Server

I have a table which has more than 100 columns, in normal case the contract_id should be unique in this table, but sometimes there are duplicate values. I use this SQL statement to retrieve data from this table:
select distinct contract_id, col1, col2,...colM
from the_table;
but I found contract_id values, I know there should be some values are different in the same column(s), can I have a way to find out all these columns which have different value result in I saw duplicate contract_id even though I use distinct, because there are lots of fields and only a few columns have different values. It is difficult to compare each column one by one manually.
Try something along
SELECT contract_id
FROM the_table
GROUP BY contract_id
HAVING COUNT(contract_id)>1;
or
WITH NumberedRows AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY contract_id ORDER BY(SELECT NULL)) AS RowNumber
,*
FROM the_table
)
SELECT *
FROM NumberedRows
WHERE RowNumber>1;
The first will show you all the contract_id values, which occur at least twice, the second will show you all the rows you might want to manipulate (delete/change).
attention: I used SELECT NULL in the ORDER BY of the OVER() clause. It is very important to use a fitting ORDER BY clause here. This will be responsible for Which row gets the number 1 and which rows get increasing numbers and will show up in the result due to >1?

Using Top in T-SQL

A question on using Top. For example, we have this SQL statement:
SELECT TOP (5) WITH TIES orderid, orderdate, custid, empid
FROM Sales.Orders
ORDER BY orderdate DESC;
It orders return rows by orderdate first then select the top most five rows.
But isn't that ORDER clause happens after SELECT clause, which means that the first five order in random will be returned first then those five rows are ordered by orderdate?
The order of commands in the statement doesn't reflect the actual order of operations that SQL follows. See this article which shows the order to be:
from
where
group by
having
select
order by
limit
As you can see, the TOP operation (limit) is the last to be executed.
Question has already an accepted answer. But I would like to quote content from Microsoft Documentation.
Logical Processing Order of the SELECT statement
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
But isn't that ORDER clause happens after SELECT clause, which means
that the first five order in random will be returned first then those
five rows are ordered by orderdate ?
No. ORDER BY is processed after the SELECT, but limiting the result set to 5 rows happens even later.
The physical details of actual query processing may vary, but the end result would be as if the server sorted the whole table by orderdate, then picked the top 5 (or more if needed due to ties) rows, return those rows and discard the rest.

Retrieving X rows from an ordered CTE, TOP vs Range

Objective:
Want to know which is faster/better performance when trying to retrieve a finite number of rows from CTE that is already ordered.
Example:
Say I have a CTE(intentionally simplified) that looks like this, and I only want the top 5 rows :
WITH cte
AS (
SELECT Id = RANK() OVER (ORDER BY t.ActionID asc)
, t.Name
FROM tblSample AS t -- tblSample is indexed on Id
)
Which is faster:
SELECT TOP 5 * FROM cte
OR
SELECT * FROM cte WHERE Id BETWEEN 1 AND 5 ?
Notes:
I am not a DB programmer, so to me the TOP solution seems better as
once SS finds the 5th row, it will stop executing and "return" (100%
assumption) while in the other method, i feel it will unnecessarily
process the whole cte.
My question is for a CTE, would the answer to this question be the same if it were a table?
The most important thing to note is that both queries are not going to always produce the same result set. Consider the following data:
CREATE TABLE #tblSample (ActionId int not null, name varchar(10) not null);
INSERT #tblSample VALUES (1,'aaa'),(2,'bbb'),(3,'ccc');
Both of these will produce the same result:
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT TOP(2) * FROM CTE;
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT * FROM CTE WHERE id BETWEEN 1 AND 2;
Now let's do this update:
UPDATE #tblSample SET ActionId = 1;
After this update the first query still returns two rows, the second query returns 3. Keep in mind too that, without an ORDER BY in the TOP query the results are not guaranteed because there is no default order in SQL.
With that out of the way - which performs better? It depends. It depends on your indexing, your statistics, number of rows, and the execution plan that the SQL Engine goes with.
Top 5 selects any 5 rows as per Index defined on the table whereas Id between 1 and 5 tries to fetch data based on Id column whether by Index seek or scan depends on the selected attributes. Both are two different queries.. 'Id between' query might be slow if you do not have any index on Id,
Let me try to explain with an example...
Consider this is your data..
create index nci_name on yourcte(id) include(name)
--drop index nci_name on yourcte
;with cte as (
select * from yourcte )
select top 5 * from cte
;with cte as (
select * from yourcte )
select * from cte where id between 1 and 5
First i am creating index on id with name included, Now if you see your second query does Index seek and first one does index scan and selects top 5, so in this case second approach is better
See the execution plan:
Now i am removing the index
Executing
--drop index nci_name on yourtable
Now it does table scan on both the approaches
If you notice in both the table scans, in the first one it reads only 5 rows and second approach it reads 10 rows and applies predicate
See execution plan properties for first plan
For second approach it reads 10 rows
Now first approach is better..
In your case this index needs to be on ActionId which determines the id.
Hence performance depends on how you index on your base table.
In order to get the RANK() which you are calculating in your cte it must sort all the data by t.ActionID. Sorting is a blocking operation: the entire input must be processed before a single row is output.
So in this case whether you select any five rows, or if you take the five that sorted to the top of the pile is probably irrelevant.

TSQL (SQL Server) Sorting and paging with row_number

I have a database (SQL Server) and app which fetches data and converts them into JSON.
I wrote a T-SQL query to order data by userid column (DESC) and take only first 10 rows, but it causes problem returning wrong results.
For example if I have following table:
UserID
---
User1
User2
User3
...
User10
..
User25
I want to to UserID to be DESC and get first ten results (then second ten results, etc). Simple saying I am looking for MySQL LIMIT substitute in SQL Server.
My query
SELECT * FROM
(SELECT
system_users_ranks.RankName,
system_users.userid,
system_users.UserName,
system_users.Email,
system_users.LastIP,
system_users.LastLoginDate,
row_number() OVER (ORDER BY system_users.userid) as myrownum
FROM
system_users
INNER JOIN
system_users_ranks
ON system_users.UserRank = system_users_ranks.rankid
) as dertbl
WHERE myrownum BETWEEN #startval AND #endval
ORDER BY userid DESC
I can't move ORDER BY to inner SELECT.
You don't need it in the inner SELECT.
ROW_NUMBER has its own ORDER BY, and the final presentation is defined by the outermost ORDER BY anyway.
Your current query will work just fine.

Variation on Select top n

Is is possible to do a variation of select top n rows to select top n rows starting at a row other than 0.
My (mobile) app has limited resources and no server side caching available. The maximum rows returned is 100. I get the first 100 by select top 100. I would then like the user to be then able to request rows 101-200 and so on. The database data is static and the the re-query time negligible.
Platform SQL Server 2008
Here's an article which demonstrates such queries using the ROW_NUMBER function.
;With CTETable AS
(
SELECT ROW_NUMBER() OVER (ORDER BY Column_Name DESC) AS ROW_NUM, * FROM TABLENAME WHERE <CONDITION>
)
SELECT Column_List FROM CTETable WHERE ROWN_NUM BETWEEN <StartNum> AND <EndNum>
Use your [startNum] and [EndNum] to be any series you want maybe 123 - 147 ! This will work well !

Resources