How to 'thin' a database table out? - sql-server

I have a large DB table which I am using for testing. It contains 7.3m phone call records. I would like to delete many of those, but still maintain a good spread over phone numbers and dates. Is there any way of achieving this? Maybe something to do with table sample?

Delete where the id finishes in a 1 or 6? Or similar, depending on exactly how many you need to remove.
i.e. to keep just 10% of the records for testing delete all the records that don't end in (say) 7.
(Note that a delete like this could take a while. You might be better doing a CREATE TABLE AS with just the records you need.)

Copy the data you want to keep:
SELECT TOP 1000 * INTO dbo.Buffer
FROM Data.Numbers
ORDER BY NewID()
Delete all data:
TRUNCATE TABLE Data.Numbers
Move back the kept data
INSET INTO Data.Numbers(column list) SELECT FROM dbo.Buffer

Related

INSERT after new id populated in base table

I have two pieces to this question, one will be bonus for me if you can answer.
Question #1
I have two SQLITE tables named tbl and tbl_search
Where tbl is a normal table and tbl_search is a FTS4 table
Let's say we are at run#0 of my app, the app runs and starts filling the tbl with data and then at the end, it copies everything from the tbl to tbl_search and job done but, this comes with a performance cost for incremental runs. For incremental runs(run#1), so far, i've been deleting and re-creating the fts table because during normal copy, it used to dump the run#0 items as well, creating duplicate data, plus this was costing a lot on the performance part.
After giving a thought i found 4 methods to tackle this situation:
1) Create a SQL Trigger for tbl to tbl_search copy.
2) Create a temp table in every run and copy its data to tbl_search to avoid duplication and then delete it
3) Use the unique ID's in the ID column of tbl and find the missing ones and replicate only those to tbl_search
4) create another table which will have last run details and use them to put everything after the --AFTER Date column. Let's say last run finished on 04-05-2018 9:40 PM, then replicate everything after that time from tbl to tbl_search.
I found option#3 to be most fruitful in terms of performance and would like to go for it, so how can i do that? Below is an example tbl for reference.
RUN#0
ID FILENAME LABEL_NUMBER
----------------------------------------------------------------------
1 C:/Test_Software/6.avi 11
2 C:/Test_Software/6.avi 10
3 C:/Test_Software/6.avi 8
4 C:/Test_Software/6.avi 6
5 C:/26.avi 10
6 C:/26.avi 8
RUN#1 (incremental)
ID FILENAME LABEL_NUMBER
----------------------------------------------------------------------
7 C:/Test_Software/36.avi 51
8 C:/Test_Software/556.avi 30
I would like to run a query like :
Select ID from tbl, If (ID not present in tbl_search) INSERT into tbl_search
Bonus Question: Out of all the methods that i shared, which one is the fastest and best method that i should choose? Please share WHY and How-To as well, I'll be grateful.
I'm going to answer this on my own as i did enough testing now :
Method #1 : SQL Triggers
Great method, if you don't have loops to take care of(nested loops)
Painful method if you have trillion loops going on before something can be executed into the tables hence, dis-qualified for me
Method #2 : Create a Temp Table
While this is a good method, it won't give you any performance bump, rather
there is going to be duplicating effect. Thus, not advisable.
Method #3 : Use ID
Best method because of the fact that ID's are created as unique key in the Table, so you can simply query the last id, store id somewhere and then compare the other table starting from the last_known id. Got the best performance benefit here.
Method #4 : Create another table - Aux Table
Naah, waste of efforts and resources. not recommended again.
Hope this helps fellow SO folks.
Thanks

Generate insert statements with foreign key constraints

I have an issue where an entire tables worth of data was deleted. It is a child table, and contains its own Primary Key, a Foreign Key to its parent and some other data.
I tried using Merge, generated from a stored procedure I found here:
https://github.com/readyroll/generate-sql-merge
This generates a giant merge statement for your whole table. That worked ok for a while, but I've since found that records from a parent table have since been deleted, and Merge doesn't handle this too well.
I've tried rewriting it, but I'm getting bogged down in it and it feels like something somebody else will have done before.
What I'd really like is a way to generate 1000's of insert statements with an If Exists above each one saying
IF NOT EXISTS (select PK from ChildTable where ID = <about to be inserted>) AND EXISTS (select FK from ParentTable where ID = <about to be inserted>)
INSERT RECORD
OUTPUT PK TO LOG TABLE
Theres about 20,000 records so its really something I don't want to have to do by hand, and because the delete event happened several times over a few months, I need to generate the data from several different databases to recreate the whole picture.
I'd like to keep the inserted Ids in a log table, so I can tell whats been inserted, and so the data could be restored to a prescript state for any reason.
Any advice on my approach would also be welcome.
Thanks :)
So long story short is I tried a few ways to fix this, and the best was to generate Insert statements for the table using SQL Generate scripts, and bring them into a temp table.
Because I only wanted to import 90% of the data, and exclude specific records based on a few conditions, I originally thought I should wrap each Insert with an IF, but 20,000 IFs broke down when trying to create a query plan.
Instead, I inserted all the records without a filter into a temp table. I then deleted all the records I didn't want from this table with a couple of Delete statements.
Lastly I then said for all remaining data in the temp table, insert it into the actual proper table, where the data was originally missing from.
This worked perfectly, and SQL Manager was able to run without crashing. It was also a lot clearer what I was doing, and I didn't have to add lots of complicated IFs in a string builder.
I also used OUTPUT on the insert into to log all the record Ids I'd inserted, to give the audit trail like I needed.
As for the string builder, the query was so long and complicated, and Excel was moaning about a 256 char limit, I ended up using 20+ columns to build up my query after concat'ing 26 columns of data. When I used the auto formula drag feature in Excel, it would crash my machine going across so many records, and I have a pretty grunty machine!
Using the Script Generate, rather than the string builder also had the benefit of not altering the data in anyway. It was purley like for like, so weird characters and new lines etc were no issue.

sql server to dump big table into other table

I'm currently changing the Id field of table to be an IDENTITY field. This is simple: Create a temp-table, copy all the data to the temp-table, adjust all the references from and to the table to point from and to the new temp-table, drop the old table, rename the temp-table to the original name.
Now I've got the problem that the copy step is taking too long. Actually the table doesn't have too many entries (~7.5 million rows), but it still takes multiple hours to do this.
I'm currently moving the data with a query like this:
SET IDENTITY_INSERT MyTable_Temp ON
INSERT INTO MyTable_Temp ([Fields]) SELECT [Fields] FROM MyTable
SET IDENTITY_INSERT MyTable_Temp OFF
I've had a look at bcp in combination with cmdshell and a following BULK INSERT, but I don't like the solution of first writing the data to a temp-file and afterwards dumping it back into the new table.
Is there a more efficient way to copy or move the data from the old to the new table? And can this be done in "pure" T-SQL?
Keep in mind, the data is correct (no external sources involved) and no changes are being made to the data during transfer.
Your approach seems fair, but the transaction generated by the insert command is too large and that is why it takes so long.
My approach when dealing with this in the past, was to use a cursor and a batching mechanism.
Perform the operation for only 100000 rows at a time, and you will see major improvements.
After the copy is made you can rebuild your references and eventually remove the old table... and so on. Be careful to reseed your new table accordingly after the data is copied.

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

Efficient DELETE TOP?

Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0

Resources