SQL Server get delta items - sql-server

I have a many-million-records table having an integer primary key defined.
I also have a not-so-many-million IDs that are in the "black list". They are stored in memory (read from a file on disk).
I have to select the records that are NOT in the black list, that is, all the records whose ID is not in my black list.
I solved this using a temp table (single column: ID) to insert the unwanted IDs then select all records whose IDs are not in this table.
My main concern is performance:
Inserting so many record in temp table.
Selecting the item not being in temp table.
EDIT
At the moment I use a temp table like this:
create the temp table with a single column (ID)
fill the temp table with IDs
create a nonclustered index on column
get the delta items with a query similar to this:
select m.id from mytable m where m.id not in (seelct id from #tempTable)

The best option you have here is to add a column to flag each row whether it is blacklisted or not ( ex: call it isBlockListed), and keep this column up-to-date.
You can also add an unclustered index to this flag so you can quickly select your data where isBlockListed = ture / false.

Related

(Alembic, SQLAlchemy) Can I copy data from non partitioned key to a partitioned one in the migration script?

I have a table needs to be partitioned, but since the postgresql_partition_by wasn't added while the creation of the table so am trying to:
create a new partitioned table that is similar the origin one.
moving the data from the old one to the new one.
drop the original one.
rename the new one.
so what is the best-practice to move the data from the old table to the new one ??
I tried this and it didn't work
COPY partitioned_table
FROM original_table;
also tried
INSERT INTO partitioned_table (column1, column2, ...)
SELECT column1, column2, ...
FROM original_table;
but both didn't work :(
noting that I am using Alembic to generate the migration scripts also am using sqlalchemy from Python
Basically you have two scenarios described below.
- The table is large and you need to split the data in several partitions
- The table gets the first partition and you add new partition for new data
Lets use this setup for the not partitioned table
create table jdbn.non_part
(id int not null, name varchar(100));
insert into jdbn.non_part (id,name)
SELECT id, 'xxxxx'|| id::varchar(20) name
from generate_series(1,1000) id;
The table contains id from 1 to 1000 and for the first case you need to split them in two partition for 500 rows each.
Create the partitioned table
with identical structure and constraints as the original table
create table jdbn.part
(like jdbn.non_part INCLUDING DEFAULTS INCLUDING CONSTRAINTS)
PARTITION BY RANGE (id);
Add partitions
to cover current data
create table jdbn.part_500 partition of jdbn.part
for values from (1) to (501); /* 1 <= id < 501 */
create table jdbn.part_1000 partition of jdbn.part
for values from (501) to (1001);
for future data (as required)
create table jdbn.part_1500 partition of jdbn.part
for values from (1001) to (1501);
Use insert to copy data
Note that this approach copy the data that means you need twice the space and a possible cleanup of the old data.
insert into jdbn.part (id,name)
select id, name from jdbn.non_part;
Check partition pruning
Note that only the partition part_500 is accessed
EXPLAIN SELECT * FROM jdbn.part WHERE id <= 500;
QUERY PLAN |
----------------------------------------------------------------+
Seq Scan on part_500 part (cost=0.00..14.00 rows=107 width=222)|
Filter: (id <= 500) |
Second Option - MOVE Data to one Partition
If you can live with the one (big) initial partition, you may use the second approach
Create the partitioned table
same as above
Attach the table as a partition
ALTER TABLE jdbn.part ATTACH PARTITION jdbn.non_part
for values from (1) to (1001);
Now the original table gets the first partition of your partitioned table. I.e. no data duplication is performed.
EXPLAIN SELECT * FROM jdbn.part WHERE id <= 500;
QUERY PLAN |
---------------------------------------------------------------+
Seq Scan on non_part part (cost=0.00..18.50 rows=500 width=12)|
Filter: (id <= 500) |
Similar answer with some hints to automation of partition creation here
After trying a few things, the solution was:
INSERT INTO new_table(fields ordered as the result of the select statement) SELECT * FROM old_table
I don't know if there was an easier way to get the fields ordered, but I tried inserting a row in DBEver from these options:
Then got names like these steps:

Duplicate records in a table with identity column as PK

One of our developer inserted few million rows from a table to a target table.
He inserted using a while loop in batches and now in the target table there is some 5 million duplicate rows.The issue is that the PK is identity column and while inserting he didn't do
SET IDENTITY_INSERT DBO.TABLE_NAME ON
So now in the table there are duplicate entries with a distinct identity column value.
If i group by as shown below :
group by COL2,COL3,COL4,COL5,COL6,COL7
i can get a unique row.
Can someone help me to create script to delete the duplicate records.
Create #temp table of your final distinct records of your table. After that delete all the records from your table having duplicate id.
After that you may use this #temp table to re-enter this record in your existing table.
set identity insert on/off is upto you.
For hint search for checkident.

SQL Server Insert If Not Exists - No Primary Key

I have Table A and Table B.
Table A contains data from another source.
Table B contains data that is inserted from Table A along with data from other tables. I have done the initial insert of data from A to B but now what I am trying to do is insert the records that do not exist already in Table B from Table A on a daily basis. Unfortunately, there is no primary key or unique identifier in Table A which is making this difficult.
Table A contains a field called file_name which has values that looks like this:
this_is_a_file_name_01011980.txt
There can be duplicate values in this column (multiple files from the same date).
In Table B I created a column data_date which extracts the date from the table a.file_name field. There is also a load_date field which just uses GETDATE() at the time the data is inserted.
I am thinking I can somehow compare the dates in these tables to decide what needs to be inserted. For example:
If the file date from Table A (would need to extract again) is greater than the load_date of Table B, then insert these records into Table B.
Let me know if any clarification is needed.
You could use exists or except. With the explanation here it seems like except would make short work of this. Something like this.
insert tableB
select * from tableA
except
select * from tableB

Dependency in SQL

In SQL, I've to delete a data from table A which is dependent on table B.
The data to be deleted should satisfy two conditions WorkArea='123' and FileNo='45'.
Table B has WorkArea but it does not contain data for FileNo.
And Table A contains the record satisfying both the conditions.
There isn't any reference key. For more clarity, adding a query here:
Select * from table A where WorkArea='123' and FileNo='45';
This will generate the resulting record. But as it is dependent on Table B, I cannot delete it directly. Also, to delete it from table B isn't possible because data in WorkArea is a whole and it contains many files and I have to delete a specific File.
So how can I delete data from table A?
This is Table A with col1 and col2 as primary key.
This is Table B with col1 as a primary key.
If you have no Foreign Keys, the following sentence will work.
DELETE FROM [A] WHERE [WorkArea] = '123' AND [FileNo] = '45';
Then you can programmaticaly check if there are "orphans" on table B with the following request :
SELECT DISTINCT [B].[WorkArea]
FROM [B]
LEFT JOIN [A]
ON [A].[WorkArea] = [B].[WorkArea]
WHERE [A].[WorkArea] IS NULL
To enhance this last part and produce a DELETE sentence from it, just store the result of this request into a temporary table then use it as a WHERE statement with the IN keyword.

Merging SQLite3 tables with identical primary keys

I am trying to merge two tables with financial information about the same list of stocks: the first is the prices table (containing, daily, weekly, monthly, etc... price data) and the second is the ratios table (containing valuation and other ratios). Both tables have identical primary key numerical ID columns (referencing the same stock tickers). After creating a connection cursor cur, My code for doing this is:
CREATE TABLE IF NOT EXISTS prices_n_ratios AS SELECT * FROM
(SELECT * FROM prices INNER JOIN ratios ON prices.id = ratios.id);
DROP TABLE prices;
DROP TABLE ratios;
This works fine except that the new prices_n_ratios table contains an extra column named ID:1 whose name is causing problems during further processing.
How do I avoid the creation of this column, maybe by somehow excluding the second tables's first primary key ID column from * (listing all the column names is not an option), or if I can't, how can I get rid of this extra column from the generated table as I have found it very hard to delete it in SQLite3?
Just list all the columns you actually want in the SELECT clause, instead of using *.
Alternatively, join with the USING clause, which automatically removes the duplicate column:
SELECT * FROM prices JOIN ratios USING (id)

Resources