Order by number of matches in array - arrays

I'm trying to get a list of best matching items giving a list of tags with the data below:
DROP TABLE IF EXISTS testing_items;
CREATE TEMP TABLE testing_items(
id bigserial primary key,
tags text[]
);
CREATE INDEX ON testing_items using gin (tags);
INSERT INTO testing_items (tags) VALUES ('{123,456, abc}');
INSERT INTO testing_items (tags) VALUES ('{222,333}');
INSERT INTO testing_items (tags) VALUES ('{222,555}');
INSERT INTO testing_items (tags) VALUES ('{222,123}');
INSERT INTO testing_items (tags) VALUES ('{222,123,555,666}');
I have the tags 222,555 and 666. How can I get a list like this?
gin index must be used because there will be tons of records.
id
matches
5
3
3
2
2
1
4
1
id 1 should not be in the list because it doesn't match any tag:
id
matches
1
0

Unnest tags, filter unnested elements and aggregate remaining ones:
select id, count(distinct u) as matches
from (
select id, u
from testing_items,
lateral unnest(tags) u
where u in ('222', '555', '666')
) s
group by 1
order by 2 desc
id | matches
----+---------
5 | 3
3 | 2
2 | 1
4 | 1
(4 rows)
Considering all the answers, it seems that this query combines good sides of each of them:
select id, count(*)
from testing_items,
unnest(array['11','5','8']) u
where tags #> array[u]
group by id
order by 2 desc, 1;
It has the best performance in Eduardo's test.

Here's my two cents using unnest and array contains:
select id, count(*)
from (
select unnest(array['222','555','666']) as tag, *
from testing_items
) as w
where tags #> array[tag]
group by id
order by 2 desc
Results:
+------+---------+
| id | count |
|------+---------|
| 5 | 3 |
| 3 | 2 |
| 2 | 1 |
| 4 | 1 |
+------+---------+

This is how I tested with 10 million records with 3 tags each with random numbers between 0 and 100:
BEGIN;
LOCK TABLE testing_items IN EXCLUSIVE MODE;
INSERT INTO testing_items (tags) SELECT (ARRAY[trunc(random() * 99 + 1), trunc(random() * 99 + 1), trunc(random() * 99 + 1)]) FROM generate_series(1, 10000000) s;
COMMIT;
I've put an ORDER BY c DESC, id LIMIT 5 for not waiting for big responses.
#paqash and #klin solutions have similar performance. My laptop runs them in 12 seconds with the tags 11, 8 and 5.
But this runs in 4.6 seconds:
SELECT id, count(*) as c
FROM (
SELECT id FROM testing_items WHERE tags #> '{11}'
UNION ALL
SELECT id FROM testing_items WHERE tags #> '{8}'
UNION ALL
SELECT id FROM testing_items WHERE tags #> '{5}'
) as items
GROUP BY id
ORDER BY c DESC, id
LIMIT 5
But I still think there is a faster way.

Check it here: http://rextester.com/UTGO74511
If you are using a GIN index, use &&:
select *
from testing_items
where not (ARRAY['333','555','666'] && tags);
id | tags
--- -------------
1 123456abc
4 222123

Related

PostgreSQL - Filtering result set by array column

I have a function which returns a table. One of the columns happens to be a text array. Currently the values in this array column will only ever have at most 2 elements, however there is the instance when the same row will be returned twice with the duplicate row elements in the opposite order. I'm hoping to find a way to only return 1 of these rows and discard the other. To give an example, I run a function as
SELECT * FROM schema.function($1,$2,$3)
WHERE conditions...;
which returns me something like this
ID | ARRAY_COL | ...
1 | {'Good','Day'} | ...
2 | {'Day','Good'} | ...
3 | {'Stuck'} | ...
4 | {'with'} | ...
5 | {'array'} | ...
6 | {'filtering'} | ...
So in this example, I want to return the whole result set with the exception that I only want either row 1 or 2 as they have the same elements in the array (albeit inverted with respect to each other). I'm aware this is probably a bit of a messy problem, but it's something I need to get to the bottom of. Ideally I would like to stick a WHERE clause at the end of my function call which forced the result set to ignore any array value that had the same elements as a previous row. Pseudo code might be something like
SELECT * FROM schema.function($1,$2,$3)
WHERE NOT array_col #> (any previous array_col value);
Any pointers in the right direction would be much appreciated, thanks.
Not sure is the best solution, but could work especially in cases where you might have partially overlapping arrays.
The solution is in 3 steps:
unnest the array_col column and order by id and the item value
select
id,
unnest(array_col) array_col_val
from dataset
order by
id,
array_col_val
regroup by id, now row with id 1 and 2 have the same array_col value
select id,
array_agg(array_col_val) array_col
from ordering
group by id
Select the min id grouping by array_col
select array_col, min(id) from regrouping group by array_col
Full statement
with dataset as (
select 1 id, ARRAY['Good','Day'] array_col UNION ALL
select 2 id, ARRAY['Day','Good'] array_col UNION ALL
select 3 id, ARRAY['Stuck'] array_col UNION ALL
select 4 id, ARRAY['with'] array_col UNION ALL
select 5 id, ARRAY['array_colay'] array_col UNION ALL
select 6 id, ARRAY['filtering'] array_col
)
, ordering as
(select id,
unnest(array_col) array_col_val
from dataset
order by id,
array_col_val)
, regrouping as
(
select id,
array_agg(array_col_val) array_col
from ordering
group by id
)
select array_col, min(id) from regrouping group by array_col;
Result
array_col | min
---------------+-----
{Stuck} | 3
{array_colay} | 5
{filtering} | 6
{with} | 4
{Day,Good} | 1
(5 rows)

Stored procedure: set output parameter from first row of select

I have a stored procedure in SQL Server 2014 that selects some rows from a table with pagination, along with total row count:
SELECT
[...], COUNT(*) OVER () AS RowCount
FROM
[...]
WHERE
[...]
ORDER BY
[...]
OFFSET ([..]) ROWS FETCH NEXT 3 ROWS ONLY
Output:
+----+------+----------+
| ID | Name | RowCount |
+----+------+----------+
| 1 | Bob | 55 |
| 123| John | 55 |
| 99 | Jack | 55 |
+----+------+----------+
I would like to return results with actual data only, passing RowCount in an output parameter.
+----+------+
| ID | Name |
+----+------+
| 1 | Bob |
| 123| John |
| 99 | Jack |
+----+------+
#OutRowCount = 55
I tried with a CTE, but CTE is available only within the first SELECT:
WITH CTE AS
(
SELECT [...], COUNT(*) OVER () AS RowCount
FROM [...]
WHERE [...]
ORDER BY [...]
OFFSET ([..]) ROWS FETCH NEXT 3 ROWS ONLY
)
SELECT
ID, Name
FROM
CTE
SET #OutRowCount = (SELECT TOP 1 RowCount FROM CTE) -- here CTE is no longer defined
How can I do this? I think I can use temp table but I wonder if in this case performance might be an issue.
The "total row count" you have in mind is a bit unclear. Typically when paging you also display to total number of (filtered) rows, e.g. "Showing 3 of 42 Blue Widgets". That doesn't involve Max.
A CTE can have multiple queries, e.g.:
with
AllRows as ( -- All of the filtered rows.
select ..., Count(*) over (...) as RowCount
from ...
where ... -- Filter criteria. ),
SinglePage as ( -- One page of filtered rows.
select ...
from AllRows
order by ... -- Order here to get the correct rows in the page.
offset (...) rows fetch next 3 rows only )
select SP.Id, SP.Name,
( select Count(42) from AllRows ) as TotalRowCount -- Constant over all rows.
from SinglePage
order by ...; -- Keep the rows in the desired order.
Re: SET #OutRowCount = (SELECT TOP 1 RowCount FROM CTE)
Note that TOP 1 without order by isn't guaranteed to pick the row you have in mind.
Thanks to #Larnu and #Stu, I solved this using a table variable, this way:
CREATE PROCEDURE MyProc
#OutRowCount INT OUTPUT
AS
BEGIN
DECLARE #TempTbl TABLE (
ID INT,
Name VARCHAR(MAX),
RowCount INT
)
INSERT INTO
#TempTbl
SELECT
ID,
Name,
COUNT(*) OVER() AS TotRighe
FROM
MyTable
WHERE
[...]
ORDER BY
Name
OFFSET ([...]) ROWS FETCH NEXT 3 ROWS ONLY
SELECT
ID,
Name
FROM
#TempTbl
SET #OutRowCount = ISNULL((SELECT TOP 1 RowCount FROM #TempTbl), 0)
END

How To Avoid TempTable in Union All when queries contain DIFFERENT order by and inner join?

What i am trying to do is always sending Product with 0 quantity to the end of an already sorted temp Table without losing current sorting (as i described in the following question How to send Zero Qty Products to the end of a PagedList<Products>?)
I have one Sorted temptable which is filled (it is sorted by what user has selected like Alphabetic , by Price or by Newer product,sorting is based identity id) :
CREATE TABLE #DisplayOrderTmp
(
[Id] int IDENTITY (1, 1) NOT NULL,
[ProductId] int NOT NULL
)
sorted #DisplayOrderTmp :
+------------+---------------+
| id | ProductId |
+------------+---------------+
| 1 | 66873 | // Qty is 0
| 2 | 70735 | // Qty is not 0
| 3 | 17121 | // Qty is not 0
| 4 | 48512 | // Qty is not 0
| 5 | 51213 | // Qty is 0
+------------+---------------+
I want pass this data to web-page, but before it i need to send product with zero quantity to the end of this list without loosing current Sorting by)
My returned data should be like this (sorting doesn't changed just 0 quantity products went to the end of list by their order):
CREATE TABLE #DisplayOrderTmp4
(
[Id] int IDENTITY (1, 1) NOT NULL,
[ProductId] int NOT NULL
)
+------------+---------------+
| id | ProductId |
+------------+---------------+
| 1 | 70735 |
| 2 | 17121 |
| 3 | 48512 |
| 4 | 66873 |
| 5 | 51213 |
+------------+---------------+
P.S: Its My product Table which i have to inner join with tmptable to find qty of products.
Product Table is like this :
+------------+---------------+------------------+
| id | stockqty | DisableBuyButton |
+------------+---------------+------------------+
| 17121 | 1 | 0 |
| 48512 | 27 | 0 |
| 51213 | 0 | 1 |
| 66873 | 0 | 1 |
| 70735 | 11 | 0 |
+------------+---------------+------------------+
What i have tried so far is this : (it works with delay and has performance issue i almost have 30k products)
INSERT INTO #DisplayOrderTmp2 ([ProductId])
SELECT p2.ProductId
FROM #DisplayOrderTmp p2 with (NOLOCK) // it's already sorted table
INNER JOIN Product prd with (NOLOCK)
ON p2.ProductId=prd.Id
and prd.DisableBuyButton=0 // to find product with qty more than 0
group by p2.ProductId order by min(p2.Id) // to save current ordering
INSERT INTO #DisplayOrderTmp3 ([ProductId])
SELECT p2.ProductId
FROM #DisplayOrderTmp p2 with (NOLOCK) //it's already sorted table
INNER JOIN Product prd with (NOLOCK)
ON p2.ProductId=prd.Id
and prd.DisableBuyButton=1 // to find product with qty equal to 0
group by p2.ProductId order by min(p2.Id) // to save current ordering
INSERT INTO #DisplayOrderTmp4 ([ProductId]) // finally Union All this two data
SELECT p2.ProductId FROM
#DisplayOrderTmp2 p2 with (NOLOCK) // More than 0 qty products with saved ordering
UNION ALL
SELECT p2.ProductId FROM
#DisplayOrderTmp3 p2 with (NOLOCK) // 0 qty products with saved ordering
Is there any way To Avoid creating TempTable in this query? send 0
quantity products of first temptable to the end of data-list without
creating three other tempTable , without loosing current ordering based by Identity ID.
My query has performance problem.
I have to say again that the temptable has a identity insert ID column and it is sorted based sorting type which user passed to Stored-Procedure.
Thank You All :)
Make sure the temp table has an index or primary key with Id as the leading column. This will help avoid sort operators in the plan for the ordering:
CREATE TABLE #DisplayOrderTmp
(
[Id] int NOT NULL,
[ProductId] int NOT NULL
,PRIMARY KEY CLUSTERED(Id)
);
With that index, you should be able to get the result without additional temp tables with reasonable efficiency using a UNION ALL query, assuming ProductID is the Product table primary key:
WITH products AS (
SELECT p2.Id, p2.ProductId, prd.stockqty, 1 AS seq
FROM #DisplayOrderTmp p2
JOIN Product prd
ON p2.ProductId=prd.Id
WHERE prd.stockqty > 0
UNION ALL
SELECT p2.Id, p2.ProductId, prd.stockqty, 2 AS seq
FROM #DisplayOrderTmp p2
JOIN Product prd
ON p2.ProductId=prd.Id
WHERE prd.stockqty = 0
)
SELECT ProductId
FROM products
ORDER BY seq, Id;
You mentioned in comments that you ultimately want a paginated result. This can be done in T-SQL by adding OFFSET and FETCH to the ORDER BY clause as below. However, be aware that pagination over a large result set will become progressively slower the further into the result one queries.
WITH products AS (
SELECT p2.Id, p2.ProductId, prd.stockqty, 1 AS seq
FROM #DisplayOrderTmp p2
JOIN Product prd
ON p2.ProductId=prd.Id
WHERE prd.stockqty > 0
UNION ALL
SELECT p2.Id, p2.ProductId, prd.stockqty, 2 AS seq
FROM #DisplayOrderTmp p2
JOIN Product prd
ON p2.ProductId=prd.Id
WHERE prd.stockqty = 0
)
SELECT ProductId
FROM products
ORDER BY seq, Id
OFFSET #PageSize * (#PageNumber - 1) ROWS
FETCH NEXT #PageSize ROWS ONLY;
You could use ORDER BY without using UNION ALL:
SELECT p2.ProductId
FROM #DisplayOrderTmp p2
JOIN Product prd
ON p2.ProductId=prd.Id
ORDER BY prd.DisableBuyButton, p2.id;
DisableBuyButton = 0 - qnt > 0
DisableBuyButton = 1 - qnt = 0
Seems it only needs an extra something in the order by.
An IIF or CASE can be used to give a priority to the sorting.
SELECT tmp.ProductId
FROM #DisplayOrderTmp tmp
JOIN Product prd
ON prd.Id = tmp.ProductId
AND prd.DisableBuyButton IN (0,1)
ORDER BY IIF(prd.DisableBuyButton=0,1,2), tmp.id;

SQL Server : Bulk insert a Datatable into 2 tables

Consider this datatable :
word wordCount documentId
---------- ------- ---------------
Ball 10 1
School 11 1
Car 4 1
Machine 3 1
House 1 2
Tree 5 2
Ball 4 2
I want to insert these data into two tables with this structure :
Table WordDictionary
(
Id int,
Word nvarchar(50),
DocumentId int
)
Table WordDetails
(
Id int,
WordId int,
WordCount int
)
FOREIGN KEY (WordId) REFERENCES WordDictionary(Id)
But because I have thousands of records in initial table, I have to do this just in one transaction (batch query) for example using bulk insert can help me doing this purpose.
But the question here is how I can separate this data into these two tables WordDictionary and WordDetails.
For more details :
Final result must be like this :
Table WordDictionary:
Id word
---------- -------
1 Ball
2 School
3 Car
4 Machine
5 House
6 Tree
and table WordDetails :
Id wordId WordCount DocumentId
---------- ------- ----------- ------------
1 1 10 1
2 2 11 1
3 3 4 1
4 4 3 1
5 5 1 2
6 6 5 2
7 1 4 2
Notice :
The words in the source can be duplicated so I must check word existence in table WordDictionary before any insert record in these tables and if a word is found in table WordDictionary, the just found Word ID must be inserted into table WordDetails (please see Word Ball)
Finally the 1 M$ problem is: this insertion must be done as fast as possible.
If you're looking to just load the table the first time without any updates to the table over time you could potentially do it this way (I'm assuming you've already created the tables you're loading into):
You can put all of the distinct words from the datatable into the WordDictionary table first:
SELECT DISTINCT word
INTO WordDictionary
FROM datatable;
Then after you populate your WordDictionary you can then use the ID values from it and the rest of the information from datatable to load your WordDetails table:
SELECT WD.Id as wordId, DT.wordCount as WordCount, DT.documentId AS DocumentId
INTO WordDetails
FROM datatable as DT
INNER JOIN WordDictionary AS WD ON WD.word = DT.word
There a little discrepancy between declared table schema and your example data, but it was solved:
1) Setup
-- this the table with the initial data
-- drop table DocumentWordData
create table DocumentWordData
(
Word NVARCHAR(50),
WordCount INT,
DocumentId INT
)
GO
-- these are result table with extra information (identity, primary key constraints, working foreign key definition)
-- drop table WordDictionary
create table WordDictionary
(
Id int IDENTITY(1, 1) CONSTRAINT PK_WordDictionary PRIMARY KEY,
Word nvarchar(50)
)
GO
-- drop table WordDetails
create table WordDetails
(
Id int IDENTITY(1, 1) CONSTRAINT PK_WordDetails PRIMARY KEY,
WordId int CONSTRAINT FK_WordDetails_Word REFERENCES WordDictionary,
WordCount int,
DocumentId int
)
GO
2) The actual script to put data in the last two tables
begin tran
-- this is to make sure that if anything in this block fails, then everything is automatically rolled back
set xact_abort on
-- the dictionary is obtained by considering all distinct words
insert into WordDictionary (Word)
select distinct Word
from DocumentWordData
-- details are generating from initial data joining the word dictionary to get word id
insert into WordDetails (WordId, WordCount, DocumentId)
SELECT W.Id, DWD.WordCount, DWD.DocumentId
FROM DocumentWordData DWD
JOIN WordDictionary W ON W.Word = DWD.Word
commit
-- just to test the results
select * from WordDictionary
select * from WordDetails
I expect this script to run very fast, if you do not have a very large number of records (millions at most).
This is the query. I'm using temp table to be able to test.
if you use the 2 CTEs, you'll be able to generate the final result
1.Setting up a sample data for test.
create table #original (word varchar(10), wordCount int, documentId int)
insert into #original values
('Ball', 10, 1),
('School', 11, 1),
('Car', 4, 1),
('Machine', 3, 1),
('House', 1, 2),
('Tree', 5, 2),
('Ball', 4, 2)
2. Use cte1 and cte2. In your real database, you need to replace #original with the actual table name you have all initial records.
;with cte1 as (
select ROW_NUMBER() over (order by word) Id, word
from #original
group by word
)
select * into #WordDictionary
from cte1
;with cte2 as (
select ROW_NUMBER() over (order by #original.word) Id, Id as wordId,
#original.word, #original.wordCount, #original.documentId
from #WordDictionary
inner join #original on #original.word = #WordDictionary.word
)
select * into #WordDetails
from cte2
select * from #WordDetails
This will be data in #WordDetails
+----+--------+---------+-----------+------------+
| Id | wordId | word | wordCount | documentId |
+----+--------+---------+-----------+------------+
| 1 | 1 | Ball | 10 | 1 |
| 2 | 1 | Ball | 4 | 2 |
| 3 | 2 | Car | 4 | 1 |
| 4 | 3 | House | 1 | 2 |
| 5 | 4 | Machine | 3 | 1 |
| 6 | 5 | School | 11 | 1 |
| 7 | 6 | Tree | 5 | 2 |
+----+--------+---------+-----------+------------+

sql inserr based on maximum date

I have to just insert the value from one table into another but the condition is that out of same id I have to select that one having maximum date and then insert into another. like :
table 1
a | b
1 | 12/1/13
1 | 18/1/13
2 | 2/4/13
2 | 9/8/13
table 2
a | b
1 | 18/1/13
2 | 9/8/13
please suggest the SQL query for it
Could you try :
INSERT INTO Table2 (idcolumn, datecolumn)
SELECT DISTINCT idcolumn, datecolumn
FROM Table1
GROUP BY idcolumn
ORDER BY datecolumn DESC
INSERT INTO table2(a,b)
SELECT a, MAX(b) AS b
FROM table1
GROUP BY a;

Resources