SQL query optimization for select query over IN query

SQL query optimization for select query over IN query - sql-server

I have one view, i want to add pagination logic on this view. There are over 1.5 million records. It took longer time to get result if for my where condition that select only specific records mapped with one Id.
I am thinking of getting only those mapped records from main table and then select only those records from view, will this faster?
Select top 10 col1, col2, col3, ROW_NUMBER() OVER (ORDER BY col4 desc) from vMyView where someid=1
Then
Select top 10 col1, col2, col3 from vMyView where col1 in (Select col1 from tMyTable where someid=1)
FYI I am not expert

Assuming typical cardinality, I tend to write it more like this:
select top 10 col1, col2, col3
from vMyView v
inner join tMyTable t ON t.col1 = v.col1
WHERE t.someid = 1
However, if it's possible to match more than one row in tMyTable for each col1 value in vMyView, this could possibly result in duplicating rows from vMyView. If duplicating rows is possible, a solution based on row_number() is typically the fastest option.
i want to add pagination logic on this view
As for paging, you should look into OFFSET/FETCH syntax, rather than TOP n.
SELECT col1, col2, col3
FROM vMyView v
ORDER BY <need an order by clause for paging to work>
OFFSET <pagenumber * pagesize> FETCH NEXT <pagesize> ROWS ONLY

Related

Snowflake order by

Is there any way I can select from a table without specifying the order by column in the order by clause?
select col1 from table order by col2
This works in TSQL, but doesn't appear to be allowed in Snowflake.

Yes, this is possible:
CREATE OR REPLACE TABLE tab AS
SELECT 1 AS col1, 'B' AS col2 UNION ALL
SELECT 2, 'A';
SELECT col1
FROM tab
ORDER BY col2;
Output:

Effective way to delete duplicate rows from millions of records

I am looking to find an effective way to delete duplicated records from my database. First, I used a stored procedure that uses joins and such, which caused the query to execute very slow. Now, I am trying a different approach. Please consider the following queries:
/* QUERY A */
SELECT *
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value
This query just executed in 12 seconds, with a result of 182.400 records. The row count in the table is currently 420.930.407, and col1 and col3 are indexed.
The next query:
/* QUERY B */
WITH ALL_RECORDS AS
(SELECT id
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value)
SELECT *
FROM ALL_RECORDS
This query took less than 2 seconds, and gives me all the id's of the 182.400 records in the table (according to the where clause).
Then, my last query, is a query that selects the lowest (first) id of all records grouped on the columns I want to group on to check for duplicates:
/* QUERY C */
SELECT MIN(id)
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value
GROUP BY col1,
col2,
col3,
col4,
col5,
col6
Again, this query executes in less than 2 seconds. The result is 30.400, which means there are 30.400 unique records among the 182.400 records that are unique.
Now, I'd like to delete (or, first select to make sure I have my query right) all records that are not unique. So, I'd like to remove 182.400 - 30.400 = 152.000 records from my_table.
I thought I'd combine the two last queries: get all id's that belong to my dataset according to the where clause on col1, col2 and col3 (query B), and then delete/select all records from that dataset of which the id is not in the id list of the unique record id's (query C).
However, when I select all from query B where query B.id NOT IN query C, the query does not take 2, 4 or 12 (14 or 16) seconds, but seems to take forever (20.000 records shown after 1 minute, around 40.000 after 2 minutes, so I canceled the query since it'll find 152.000 records, which will take 8 minutes this way).
WITH ALL_RECORDS AS
(SELECT id
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value)
SELECT id
FROM ALL_RECORDS
WHERE id NOT IN
(SELECT MIN(id)
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value
GROUP BY col1,
col2,
col3,
col4,
col5,
col6)
I know NOT IN is slow, but I can't grasp how it's THIS slow (since both queries without the not in part execute in less than 2 seconds each).
Does anyone have some good advice for me on how to solve this puzzle?
------------------ Additional information ------------------
Previous solution was the following stored procedure. For some reason it executes perfectly on my acceptance environment, but not on my production environment. Currently, we have over 400 million records on production and a little over 2 million records on acceptance, so this might be a reason.
DELETE my_table
FROM my_table
LEFT OUTER JOIN
(SELECT MIN(id) AS RowId,
col1,
col2,
col3,
col4,
col5,
col6
FROM my_table
WHERE col1 = value
AND col2 = value
AND col3 = value
GROUP BY col1,
col2,
col3,
col4,
col5,
col6) AS KeepRows ON my_table.id = KeepRows.RowId
WHERE KeepRows.RowId IS NULL
AND my_table.col1 = value
AND my_table.col2 = value
AND my_table.col3 = value
I have based this solution on another answer on stackoverflow (can't find it at the moment), but I feel I should be able to create a query based on Query B and C that executes within a few seconds...

with dupl as (
select row_number() over(partition by col1,col2,col3,col4,col5,col6 order by id) rn,
id,col1,col2,col3,col4,col5,col6
from myTable
)
delete dupl where rn>1

Combining two 2-second queries together will not, generally, result in a single 4-second query, because queries, unlike their underlying tables, are rarely indexed.
Usual approach for this kind of tasks is to cache id's you want to keep in a temporary table, index it accordingly and then use it in the left join (or not in - I bet the resulting execution plans are practically the same).
You can probably get some more performance if you will play with indices on the main table. For example, I think that (col1, col2, col3) should give your code some boost (columns should not necessarily be mentioned in this order, it usually depends on their cardinalities).

mssql checksum on different tables

I need to find if two rows (one having the same id of the other +50000) are the same. Is there any way to make this query work?
select 1
from table1 c1,
table2 c2
where c1.id=c2.id+50000 and CHECKSUM(c1.*) = CHECKSUM(c2.*)
CHECKSUM() apparently does not accept "table.*" expressions. It accepts either "*" alone or list of columns, but I can't do that as this query needs to work also for other tables with other columns.
EDIT: I just realized that CHECKSUM() will not work as the value will always be different if the IDs are different....
The original question still holds out of curiosity.

Try something like this, it will work for most datatypes (not TEXT and some others):
SELECT 1
FROM
table1 c1
JOIN
table2 c2
ON
c1.id=c2.id+50000 and
EXISTS(SELECT c1.col1, col2, col3, col4 EXCEPT SELECT c2.col1, col2, col3, col4)

You can do it using derived tables:
SELECT
SUM(CASE
WHEN a.cs <> b.cs THEN 1
ELSE 0
END)
FROM (SELECT RowNumber, CHECKSUM(*) AS cs FROM #A) a
FULL OUTER JOIN (SELECT RowNumber, CHECKSUM(*) AS cs FROM #B) b
ON a.RowNumber=b.RowNumber;
This is an excerpt from a script I've written previously. I have not changed any of the object names to match your example. The result of this query is the number of differences between #A and #B where the RowNumber columns match.
To apply to your need, you can create two temporary tables, populating them from the originals, but replacing the ID column with a "RowNumber" column that matches between the rows you want to match (ie: c1.id=c2.id+50000). That way, you don't have mis-matched IDs to interfere with the CHECKSUM.

SQL for pulling targeted data from a table

friends.
Have a quick question.
I have a log table and I need to pull specific info from it.
There are lots of columns, include date/time stamp and some transaction codes.
Records are for multiple account numbers.
I would like to pull the following:
Pull a couple of fields from the record (for each account number) with transaction code of 100. There can be multiple records with this code.
Find the first transaction with transcode 101 AFTER that 100 code record and include the timestamp from this record.
Any help, as always, will be greatly appreciated!
Thanks.

Your question is very vague. Please edit and add extra information on it. With whatever I understood so far, this is what I could come up with:
SELECT AccountNumber, COL1, COL2, COL3 FROM YOURTABLE WHERE TRANSACTION_CODE = 100--NO AGGREGATION
I am assuming that by 'after' you mean the next time stamp. In that case..
SELECT COL1, COL2 FROM YOURTABLE WHERE DATETIME_TSTAMP=(
SELECT MIN(DATETIME_TSTAMP) FROM YOURTABLE WHERE DATETIME_TSTAMP > ( SELECT DATETIME_TSTAMP FROM YOURTABLE WHERE TRANSACTION_CODE = 100) AND TRANSACTION_CODE = 101)
OR
SELECT COL1, COL2, DATETIMESTAMP FROM (
SELECT COL1, COL2, DATETIMESTAMP, ROW_NUMBER() OVER (ORDER BY TIMESTAMP) ROWNUM FROM YOURTABLE WHERE TRANSACTION_CODE = 101 AND DATETIMESTAMP >
(SELECT DATETIMESTAMP FROM YOURTABLE WHERE TRANSACTION_CODE = 100))A
WHERE ROWNUM = 1

How can I use the plus operation for a column in SQL Server with the result in the next row?

I have a table in SQL Server that I want to plus amount of a specific column and have the result in next row.
How can I do that?

I want to plus amount of a specific column and have the result in next row.
In case you are looking to insert another from the previous row after adding certain amount, you could use the following:
INSERT INTO MyTable (Col1, Col2, Col3)
SELECT Col1, Col2 + <additional amount>, Col3
FROM MyTable
WHERE
<Criteria to select that row of interest>
In case you are looking to select all the rows in a table and aggregate the amount column and show the result in a separate row, then you could use the following:
SELECT Col1, Col2, Col3 FROM MyTable
UNION
SELECT '', SUM(Col2), '' FROM MyTable

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SQL query optimization for select query over IN query - sql-server

Related

Snowflake order by

Effective way to delete duplicate rows from millions of records

mssql checksum on different tables

SQL for pulling targeted data from a table

How can I use the plus operation for a column in SQL Server with the result in the next row?

Categories

Resources