Query to select rows where JSONB arrays have variable lengths - arrays

I have a table (t) which contains a column (c) of JSONB objects, each containing an array (a) with a varying length between 1-10 (inclusive). I need to build a query that selects 1000 rows from t where c includes 100 random rows of each possible length of a. What would be the most concise way to write this query? My current query structure looks like this:
WITH length_1 AS (
SELECT *
FROM t
WHERE JSONB_ARRAY_LENGTH(c -> 'a') = 1
ORDER BY RANDOM()
LIMIT 100
),
length_2 AS (
SELECT *
FROM t
WHERE JSONB_ARRAY_LENGTH(c -> 'a') = 2
ORDER BY RANDOM()
LIMIT 100
)
...
SELECT *
FROM length_1
UNION
SELECT *
FROM length_2
...

You can use a window function to label each row with a row number within its partition, then use an outer select to limit to 100 for each partition.
select * from (
select t.*, row_number() over (partition by jsonb_array_length(c->'a') order by random()) as rn from t
) foo where rn<=100;
The two levels are needed because you can't use a window function in a WHERE or a HAVING.

Related

Show records in three tables with defiate number of rows in each table in SSRS report

I have a table in report like
I want to show the records in three tables on every page, each table contains only 20 records.
Page1:
Page2:
How can I achieve this type of pattern?
I can think of 2 ways to do this, as a MATRIX style report where the column group is your columns, and as a normal table where you JOIN the data to produce 3 copies of name, ID, and any other fields you want. The MATRIX style is definitely more elegant and flexible, but the normal table might be easier for customers to modify if you're turning the report over to power users.
Both solutions start with tagging the data with PAGE, ROW, and COLUMN information. Note that I'm sorting on NAME, but you could sort on any field. Also note that this solution does not depend on your ID being sequential and in the order you want, it generates it's own sequence numbers based on NAME or whatever else you choose.
In this demo I'm setting RowsPerPage and NumberofColumns as hard coded constants, but they could easily be user selected parameters if you use the MATRIX format.
DECLARE #RowsPerPage INT = 20
DECLARE #Cols INT = 3
;with
--Fake data generation BEGIN
cteSampleSize as (SELECT TOP 70 ROW_NUMBER () OVER (ORDER BY O.name) as ID
FROM sys.objects as O
), cteFakeData as (
SELECT N.ID, CONCAT(CHAR(65 + N.ID / 26), CHAR(65 + ((N.ID -1) % 26))
--, CHAR(65 + ((N.ID ) % 26))
) as Name
FROM cteSampleSize as N
),
--Fake data generation END, real processing begins below
cteNumbered as ( -- We can't count on ID being sequential and in the order we want!
SELECT D.*, ROW_NUMBER () OVER (ORDER BY D.Name) as SeqNum
--Replace ORDER BY D.Name with ORDER BY D.{Whatever field}
FROM cteFakeData as D --Replace cteFakeData with your real data source
), ctePaged as (
SELECT D.*
, 1+ FLOOR((D.SeqNum -1) / (#RowsPerPage*#Cols)) as PageNum
, 1+ ((D.SeqNum -1) % #RowsPerPage) as RowNum
, 1+ FLOOR(((D.SeqNum-1) % (#RowsPerPage*#Cols) ) / #RowsPerPage) as ColNum
FROM cteNumbered as D
)
--FINAL - use this for MATRIX reports (best)
SELECT * FROM ctePaged ORDER BY SeqNum
If you want to use the JOIN method to allow this in a normal table, replace the --FINAL query above with this one. Note that it's pretty finicky, so test it with several degrees of fullness in the final report. I tested with 70 and 90 rows of sample data so I had a partial first column and a full first and partial second.
--FINAL - use this for TABLE reports (simpler)
SELECT C1.PageNum , C1.RowNum , C1.ID as C1_ID, C1.Name as C1_Name
, C2.ID as C2_ID, C2.Name as C2_Name
, C3.ID as C3_ID, C3.Name as C3_Name
FROM ctePaged as C1 LEFT OUTER JOIN ctePaged as C2
ON C1.PageNum = C2.PageNum AND C1.RowNum = C2.RowNum
AND C1.ColNum = 1 AND (C2.ColNum = 2 OR C2.ColNum IS NULL)
LEFT OUTER JOIN ctePaged as C3 ON C1.PageNum = C3.PageNum
AND C1.RowNum = C3.RowNum AND (C3.ColNum = 3 OR C3.ColNum IS NULL)
WHERE C1.ColNum = 1
1) Add the dataset with the below query to get Page number and Table number. You can change the number 20 and 60 as per requirement. In my case, I need 20 records per section and having 3 sections, so total records per page are 60.
Select *,(ROW_NUMBER ( ) OVER ( partition by PageNumber order by Id )-1)/20 AS TableNumber from (
Select (ROW_NUMBER ( ) OVER ( order by Id )-1)/60 AS PageNumber
,* from Numbers
)Src
2)Add the table of one column and select the prepared dataset.
3)Add PageNumber in Group expression for Details group.
4)Add the Column parent group by right-clicking on detail row. Select Group by TableNumber.
5) Delete the first two rows. Select Delete rows only.
6) Add one more table and select the ID and Name.
7) Drag this newly created table into the cell of the previously created table. And increase the size of the table.
Result:
Each table section contains 20 records. and it will continue in next pages also.

Get random data from SQL Server without performance impact

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

Generate a row range based on "FromRange" "ToRange" field of each row

I have a table with following fields:
DailyWork(ID, WorkerID, FromHour, ToHour) assume that, all of the fields are of type INT.
This table needs to be expanded in a T_SQL statement to be part of a JOIN.
By expand a row, I mean, generate a hour for each number in range of FromHour and ToHour. and then join it with the rest of the statement.
Example:
Assume, I have another table like this: Worker(ID, Name). and a simple SELECT statement would be like this:
SELECT * FROM
Worker JOIN DailyWork ON Worker.ID = DailyWork.WorkerID
The result has columns similar to this: WorkerID, Name, DailyWorkID, WorkerID, FromHour, ToHour
But, what i need, has columns like this: WorkerID, Name, Hour.
In fact the range of FromHour and ToHour is expanded. and each individual hour placed in separate row, in Hour column.
Although i read a similar question to generate a range of number , but it didn't really help.
I you start with a list of numbers, then this is pretty easy. Often, the table master.spt_values is used for this purpose:
with nums as (
select row_number() over (order by (select null)) - 1 as n
from master.spt_values
)
select dw.*, (dw.fromhour + nums.n) as specifichour
from dailywork dw join
nums
on dw.tohour >= dw.fromhour + nums.n;
The table master.spt_values generally has a few thousand rows at least.
Another solution would be...
WITH [DayHours] AS (
SELECT 1 AS [DayHour]
UNION ALL
SELECT [DayHour] + 1 FROM [DayHours] WHERE [DayHour] + 1 <= 24
)
SELECT [Worker]
JOIN [DayHours] ON [Worker].[FromHour] <= [DayHours].[DayHour]
AND [Worker].[ToHour] >= [DayHours].[DayHour]

Transact sql query perfomance advice

I have 2 select queries. The first performs selection for grid paging(25 rows per page) (using TOP #pagesize*#pagenum EXCEPT TOP #pagesize*(#pagenum-1) construction). Second query total count of rows.
So, will WITH AS construction increase performance versus two separated queries, if so why? Note, that query contains multiple columns, INNER JOINs and WHERE conditions.
WITH part for itself does not help performance. It makes the query easier to understand.
If I understand correctly, the count is for all the rows. You can save the second call if you get all the rows to the client, but in most cases it will be more expensive.
Using top X to get only the last X/Y rows is a bad idea. You should do add an auto number and select the rows where the auto number is in the desired range.
SELECT ROW_NUMBER() OVER (ORDER BY [Something] DESC) FROM ...
WHERE [RowNumber] BETWEEN 10 AND 20
Alternatively, if you use ORDER BY ... OFFSET instead of TOP ... ORDER BY you can use COUNT(*) OVER () to get all rows regardless of paging. Else you have to isolate your data using WITH (as you do) and get paging with the 2nd table as well as anything else (row number, page, total pages and so on.
Example without offset:
DECLARE #page INT = 1, #rows INT = 5
;WITH data AS (SELECT * FROM mytable where id = 454545) --possible filters
,rows ([page], [pages], [rows]) AS
(
SELECT #page, CEILING(CAST(COUNT(*) AS float)/#rows), COUNT(*) FROM data
)
SELECT TOP (#rows) *
FROM ( SELECT row_number() OVER (ORDER BY data.id) rowNumber, * FROM rows, data ) pagination
WHERE rowNumber > (#page - 1) * #rows
order by rowNumber
Example with offset:
DECLARE #page INT = 1, #rows INT = 5
SELECT
row_number() OVER (order by id ) rowNumber,#page Page,(CEILING(CasT(COUNT(*) OVER () as float)/#rows)) Pages,
COUNT(*) OVER () Rows, *
from mytable where id = 454545 --possible filters
order by rowNumber
OFFset (#page-1)*#rows rows
FETCH NEXT #rows rows ONLY
In both cases make sure the order by is absolute else your paging is not guaranteed

How do I exclude outliers from an aggregate query?

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?
You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too
One way would be to exclude the outliers with a not in clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in clause for the bottom five percent.
NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do
select top 94.7368 percent *
from (
select top 95 percent *
from
order by .. ASC
) X
order by .. DESC
First create a view to match your table column names
create view main_table
as
select type unit, number as timeinminutes from master..spt_values
Try this instead
select Unit, COUNT(*), SUM(TimeInMinutes)
FROM
(
select *,
ROW_NUMBER() over (order by TimeInMinutes) rn,
COUNT(*) over () countRows
from main_table
) N -- Numbered
where rn between countRows * 0.05 and countRows * 0.95
group by Unit, N.countRows * 0.05, N.countRows * 0.95
having count(*) > 20
The HAVING clause is applied to the remaining set after removing outliers.
For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.
I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.

Resources