Remove duplicates from a SQL server rows using DISTINCT - sql-server

I need to remove SQL server duplicated rows when importing file into database with distinct method.
HallGroup is my table in database. I'm using this
Sql procedure:
SELECT DISTINCT * INTO tempdb.dbo.tmpTable
FROM HallGroup
DELETE FROM HallGroup
INSERT INTO HallGroup SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
With this procedure works fine duplicated rows are deleted, but the problem is when i try to import again data to SQL server rows are still duplicating. What i'm missing, So any hint?
How to remove SQL server duplicated rows properly when importing file into database with distinct method?

I am just getting back into SQL after being out for a bit but I would not have solved your problem in that way that you are trying (not that I completely understand why you are doing it that way) as I believe (even if it were working correctly) over time your process will take longer each time you do it as the size of the table increases.
It would be much more efficient if you inserted the new data based on the absence of a key (you indicate you are already using a stored proc). If you don't have a key to use (which very recently happened to me), make one. I just solved a similar problem to yours whereas I am importing data into a table from an external source and wanted to eliminate the possibility of duplicates. In my case, I associate name of the external source datafile (is distinct by dataset to import) with the data to be imported and use that to ensure I am not re-importing already imported data. I load the external data into a table using a dtsx and then run a stored proc to merge that data with an existing table. This gives me the added advantage of having a audit trail of where each record came from.
Hope this helps.

Related

How to run SSIS packages dynamically?

We have a large production MSSQL database (mdf appx. 400gb) and i have a test database. All the tables,indexes,views etc. are same eachother. I need to make sure that tha datas in the tables of this two database consistent. so i need to insert all the new rows and update all the updated rows into test db from production every night.
I came up with idea of using SSIS packages to make the data consistent by checking updated rows and new rows in all the tables. My SSIS Flow is ;
I have packages in SSIS for each tables seperately because;
Orderly;
Im getting the timestamp value in the table in order to get last 1 day rows instead of getting whole table.
I get the rows of the table in the production
Then im using 'Lookup' tool to compare this data with the test database table data.
Then im using conditional sprit to get a clue whether the data is new or updated.
If the data is new, i insert this data to the destination
5_2. If the data is updated, then i update the data in the destination table.
Data flow is in the MTRule and STBranch package in the picture
The problem is, im repeating creating all this single flow for each table and i have more than 300 table like this. It takes hours and hours :(
What im asking is;
Is there any way in SSIS to do this dynamically ?
PS: Every single table has its own columns and PK values but my data flow schema is always same. . (Below)
You can look into BiMLScript, which lets you create packages dynamically based on metadata.
I believe the best way to achieve this is to use Expressions. They empower you to dynamically set the source and Destination.
One possible solution might be as follows:
create a table which stores all your table names and PK columns
define a package which Loops through this table and which parses a SQL Statement
Call your main package and pass the stmt to it
Use the stmt as Data Source for your Data Flow
if applicable, pass the Destination Table as Parameter as well (another column in your config table)
This is how I processed several really huge tables: the data had to be fetched from 20 tables and moved to one single table.
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there.
Then call the stored procedure in a FOR EACH component in SSIS.
Why do you need to use SSIS?
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there. Then call the stored procedure in a FOR EACH component in SSIS.
In fact you might be able to do everything using a Stored Procedure and scheduling it in a SQL Agent Job.

Load data from multiple source into a destination

I have a desktop application through which data is entered and it is being captured in MS Access DB. The application is being used by multiple users(at different locations). The idea is to download data entered for that particular day into an excel sheet and load it into a centralized server, which is an MSSQL server instance.
i.e. data(in the form of excel sheets) will come from multiple locations and saved into a shared folder in the server, which need to be loaded into SQL Server.
There is a ID column with IDENTITY in the MSSQL server table, which is the primary key column and there are no other columns in the table which contains unique value. Though the data is coming from multiple sources, we need to maintain single auto-updating series(IDENTITY).
Suppose, if there are 2 sources,
Source1: Has 100 records entered for the day.
Source2: Has 200 records entered for the day.
When they get loaded into Destination(SQL Server), table should have 300 records, with ID column values from 1 to 300.
Also, for the next day, when the data comes from the sources, Destination has to load data from 301 ID column.
The issue is, there may be some requests to change the data at Source, which is already loaded in central server. So how to update the data for that row in the central server as the ID column value will not be same in Source and Destination. As mentioned earlier ID is the only unique value column in the table.
Please suggest some ides to do this or I've to take up different approach to accomplish this task.
Thanks in advance!
Krishna
Okay so first I would suggest .NET and doing it through a File Stream Reader, dumping it to the disconnected layer of ADO.NET in a DataSet with multiple DataTables from the different sources. But... you mentioned SSIS so I will go that route.
Create an SSIS project in Business Intelligence Development Studio(BIDS).
If you know for a fact you are just doing a bunch of importing of Excel files I would just create many 'Data Flow Task's or many Source to Destination tasks in a single 'Data Flow Task' up to you.
a. Personally I would create tables in a database for each location of an excel file and have their columns map up. I will explain why later.
b. In a data flow task, select 'Excel Source' as the source file. Put in the appropriate location of 'new connection' by double clicking the Excel Source
c. Choose an ADO Net Destination, drag the blue line from the Excel Source to this endpoint.
d. Map your destination to be the table you map to from SQL.
e. Repeat as needed for each Excel destination
Set up the SSIS task to automate from SQL Server through SQL Management Studio. Remember you to connect to an integration instance, not a database instance.
Okay now you have a bunch of tables right instead of one big one? I did that for a reason as these should be entry points and the logic to determinate dupes and import time I would leave to another table.
I would set up another two tables for the combination of logic and for auditing later.
a. Create a table like 'Imports' or similar, have the columns be the same except add three more columns to it: 'ExcelFileLocation', 'DateImported'. Create an 'identity' column as the first column and have it seed on the default of (1,1), assign it the primary key.
b. Create a second table like 'ImportDupes' or similar, repeat the process above for the columns.
c. Create a unique constraint on the first table of either a value or set of values that make the import unique.
c. Write a 'procedure' in SQL to do inserts from the MANY tables that match up to the excel files to insert into the ONE 'Imports' location. In the many inserts do a process similar to:
Begin try
Insert into Imports (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End try
-- if logic breaks unique constraint put it into second table
Begin Catch
Insert into ImportDupes (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End Catch
-- repeat above for EACH excel table
-- clean up the individual staging tables for the next import cycle for EACH excel table
truncate TableExcel1
d. Automate the procedure to go off
You now have two tables, one for successful imports and one for duplicates.
The reason I did what I did is two fold:
You need to know more detail than just the detail a lot of times like when it came in, from what source it came from, was it a duplicate, if you do this for millions of rows can it be indexed easily?
This model is easier to take apart and automate. It may be more work to set up but if a piece breaks you can see where and easily stop the import for one location by turning off the code in a section.

Export large amounts of binary data from one SQL database and import it into another database of the same schema

I have one database with an image table that contains just over 37,000 records. Each record contains an image in the form of binary data. I need to get all of those 37,000 records into another database containing the same table and schema that has about 12,500 records. I need to insert these images into the database with an IF NOT EXISTS approach to make sure that there are no duplicates when I am done.
I tried exporting the data into excel and format it into a script. (I have doe this before with other tables.) The thing is, excel does not support binary data.
I also tried the "generate scripts" wizard in SSMS which did not work because the .sql file was well over 18GB and my PC could not handle it.
Is there some other SQL tool to be able to do this? I have Googled for hours but to no avail. Thanks for your help!
I have used SQL Workbench/J for this.
You can either use WbExport and WbImport through text files (the binary data will be written as separate files and the text file contains the filename).
Or you can use WbCopy to copy the data directly without intermediate files.
To achieve your "if not exists" approache you could use the update/insert mode, although that would change existing row.
I don't think there is a "insert only if it does not exist mode", but you should be able to achieve this by defining a unique index and ignore errors (although that wouldn't be really fast, but should be OK for that small number of rows).
If the "exists" check is more complicated, you could copy the data into a staging table in the target database, and then use SQL to merge that into the real table.
Why don't you try the 'Export data' feature? This should work.
Right click on the source database, select 'Tasks' and then 'Export data'. Then follow the instructions. You can also save the settings and execute the task on a regular basis.
Also, the bcp.exe utility could work to read data from one database and insert into another.
However, I would recommend using the first method.
Update: In order to avoid duplicates you have to be able to compare images. Unfortunately, you cannot compare images directly. But you could cast them to varbinary(max) for comparison.
So here's my advice:
1. Copy the table to the new database under the name tmp_images
2. use the merge command to insert new images only.
INSERT INTO DB1.dbo.table_name
SELECT * FROM DB2.dbo.table_name
WHERE column_name NOT IN
(
SELECT column_name FROM DB1.dbo.table_name
)

Copying tables between databased with different authentication DB2

Hey StackOverflow community,
My question is as follows:
I have a table, say USER_ADDR with a bunch of columns in one database, say DB001
I need to copy the contents of this table(based on a criteria) to a similar table USER_ADDR (same name, yes) in another database DB002 with a different userID and pwd.
I need to do this in a stored procedure that will be executed using a .net framework.
I tried this:
INSERT INTO "DB002".USER_ADDR (--column names--)
SELECT *
FROM "DB001".USER_ADDR
WHERE ID = "APPLICATION_NO_IN";
I get:
0: Error occurred: [IBM][DB2/NT64] SQL0204N "DB002.USER_ADDR" is an undefined name. LINE NUMBER=15. SQLSTATE=42704 : -204: IBM.Data.DB2: 42704
What am I doing wrong?
Thanks in advance
Vashist
i'm deleting my other answer after seeing the additional info about your use case. Load is mainly for bulk loads of large numbers of records.
in this case i'd recommend you do something like open connection1 in .Net to your data source, select the data and hold it in a .Net DataTable. If required, you can do that select in a stored proc that returns either individual column values for a single row or return a cursor (rowset) that contains all the columns (and rows). Then in .Net open connection2 and insert the data from the DataTable to your destination. Again, that can be done with a stored proc.
Another approach is using an external script that connects to both databases.
From just one database is not possible, at least you use, as already mentioned, Information integration (federation) or by exporting the data and then loading it.

How do you get an SSIS package to only insert new records when copying data between servers

I am copying some user data from one SqlServer to another. Call them Alpha and Beta. The SSIS package runs on Beta and it gets the rows on Alpha that meet a certain condition. The package then adds the rows to Beta's table. Pretty simple and that works great.
The problem is that I only want to add new rows into Beta. Normally I would just do something simple like....
INSERT INTO BetaPeople
SELECT * From AlphaPeople
where ID NOT IN (SELECT ID FROM BetaPeople)
But this doesn't work in an SSIS package. At least I don't know how and that is the point of this question. How would one go about doing this across servers?
Your example seems simple, looks like you are adding only new people, not looking for changed data in existing records. In this case, store the last ID in the DB.
CREATE TABLE dbo.LAST (RW int, LastID Int)
go
INSERT INTO dbo.LAST (RW, LastID) VALUES (1,0)
Now you can use this to insert the last ID of the row transferred.
UPDATE dbo.LAST SET LastID = #myLastID WHERE RW = 1
When selecting OLEDB source, set data access mode to SQL Command and use
DECLARE #Last int
SET #Last = (SELECT LastID FROM dbo.LAST WHERE RW = 1)
SELECT * FROM AlphaPeople WHERE ID > #Last;
Note, I do assume that you are using ID int IDENTITY for your PK.
If you have to monitor for data changes of existing records, then have the "last changed" column in every table, and store time of the last transfer.
A different technique would involve setting-up a linked server on Beta to Alpha and running your example without using SSIS. I would expect this to be way slower and more resource intensive than the SSIS solution.
INSERT INTO dbo.BetaPeople
SELECT * FROM [Alpha].[myDB].[dbo].[AlphaPeople]
WHERE ID NOT IN (SELECT ID FROM dbo.BetaPeople)
Add a lookup between your source and destination.
Right click the lookup box to open Lookup Transformation Editor.
Choose [Redirect rows to no match output].
Open columns, map your key columns.
Add an entry with the table key in lookup column , lookup operation as
Connect lookup box to destination, choose [Lookup no Match Output]
Simplest method I have used is as follows:
Query Alpha in a Source task in a Dataflow and bring in records to the data flow.
Perform any needed Transformations.
Before writing to the Destination (Beta) perform a lookup matching the ID column from Alpha to those in Beta. On the first page of the Lookup Transformation editor, make sure you select "Redirect rows to no match output" from the dropdown list "Specify how to handle rows with now matching error"
Link the Lookup task to the Destination. This will give you a prompt where you can specify that it is the unmatched rows that you want to insert.
This is the classical Delta detection issue. The best solution is to use Change Data Capture with/without SSIS. If what you are looking for is a once in a life time activity, no need to go for SSIS. Use other means such as linked server and compare with existing records.
The following should solve issue of loading Changed and New records using SSIS:
Extract Data from Source usint Data flow.
Extract Data from Target.
Match on Primary key Add Unmatch records and split matched and unmatched records from Source and Matched records from Target call them Matched_Source,
Unmatch_Source and Matched_Target.
Compare Matched_Source and Matched_Target and Split Matched_Source to Changed and Unchanged.
Null load TempChanged Table.
Add Changed Records to TempChanged.
Execute SQL script/stored proc to Delete Records from Target for primary key in TempChanged and add records in TempChanged to Target.
Add Unmatched_Source to Target.
Another solution would be to use a temporary table.
In the properties for Beta's connection manager, change RetainSameConnection to true (by default SSIS runs each query in it's own connection, this would mean the temporary table would be killed as soon as it has been created).
Create a SQL Task using Beta's connection and use the following SQL to create your temporary table:
SELECT TOP 0 *
INTO ##beta_temp
FROM Beta
Next create a data flow that pulls data from Alpha and loads into ##beta_temp (you will need to run the SQL statement above on SSMS first so that Visual Studio can see the table at design time and you will also need to set the DelayValidation property to true on the Data Flow task).
Now you have two tables on the same server and you can just use your example SQL modified to use the temporary table.
INSERT INTO Beta
SELECT * FROM ##beta_temp
WHERE ID NOT IN (SELECT ID FROM Beta)

Resources