I have an ETL process (CSV to SQL database) that runs daily, but the data in the source sometimes changes, so I want to have it run again the next day with an updated file.
How do I write a SQL statement to find all the differences?
For example, let's say Table_1 has a composite PRIMARY KEY consisting of FK_1, FK_2 and FK_3.
Do I do this in SQL or in the ETL process?
Thanks.
Edit
I realize now this question is too broad. Disregard.
You can use EXCEPT to find which are the IDs which are missing. For example:
SELECT FK_1, FK_2, FK_2
FROM new_data_table
EXCEPT
SELECT FK_1, FK_2, FK_2
FROM current_data_table;
It will be better (in performance prospective) to materialized these IDs and then to join this new table to the new_data_table in order to insert all of the columns.
If you need to do this in one query, you can use simple LEFT JOIN. For example:
INSERT INTO current_data_table
SELECT A.*
FROM new_data_table A
LEFT JOIN current_data_table B
ON A.FK_1 = B.FK_1
AND A.FK_2 = B.FK_2
AND A.FK_3 = B.FK_3
WHRE B.[FK_1] IS NULL;
The idea is to get all records in the new_data_table for which, there is no match in the current_data_table table (WHRE B.[FK_1] IS NULL).
Related
I'm new to SQL Server. The scenario is the following:
I have a csv with a bunch of Serial N0, which are unique.
Example:
Serial No
-----------
01561
21654
156416
89489
I also have a SQL Server database table, where are several rows which can be identified with the serial no. For example I have 6 rows in the SQL Server table with the serial no. 01561. Now I want to update a field in all these rows with "Yes". If it is only about this number, I know the solution it's
UPDATE dbo.Table1
SET DeleteFlag = 'Yes'
WHERE Serial No. = 01561;
Unfortunately I have more than 10,000 Serial No in the csv for what I have to do that. Can you help me to find a solution for that?
First you should use the TASK feature to import the CSV. You right click to do this on the database and select "TASK" and import data. Its a UI which is pretty self explanatory, so itll help you get the job done quickly and easily. Make note of the name you give the table, SQL Server will try and give it a defualt name with a "$" in the name. Change that to something like "MyTableImport". If the data is already in SQL Server, go to the next step.
Step 2 - You can do the UPDATE for the entire table via a join. All youre doing is matching the ID's to another table, right? Looping would be a bad idea here especially since itll take a minute to loop through 10000+ and run an update FOR EACH ONE. Thats against an idea known as "Set based approach" which to sum it up nicely is to do things all at once (google it though because im horribly over simplifying the idea for you). Here is a sample update join query for you:
UPDATE x
SET x.DeleteFlag='Yes'
FROM yourimportable y
INNER JOIN yourLocal x ON y.SerialNo=x.SerialNo
Assuming that you have created a temp table to load CSV with a bunch of serial number. Now you can update your permanent table with the temp table data using update join like this:
UPDATE t1
SET t1.DeleteFlag = 'Yes'
FROM dbo.Table1 AS t1
INNER JOIN #TempTable2 AS t2
ON t1.Serial_No = t2.Serial_No
For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.
I'm using Azure's SQL Database & MS SQL Server Management Studio and I wondering if its possible to create a self-referencing table that maintains itself.
I have three tables: Race, Runner, Names. The Race table includes the following columns:
Race_ID (PK)
Race_Date
Race_Distance
Number_of_Runners
The second table is Runner. Runner contains the following columns:
Runner_Id (PK)
Race_ID (Foreign Key)
Name_ID
Finish_Position
Prior_Race_ID
The Names Table includes the following columns:
Full Name
Name_ID
The column of interest is Prior_Race_ID in the Runner Table. I'd like to automatically populate this field via a Trigger or Stored Procedure, but I'm not sure if its possible to do so and how to go about it. The goal would be to be able to get all a runners races very quickly and easily by traversing the Prior_Race_ID field.
Can anyone point me to a good resource or references that explains if and how this is achievable. Also, if there is a preferred approach to achieving my objective please do share that.
Thanks for your input.
Okay, so we want, for each Competitor (better name than Names?), to find their two most recent races. You'd write a query like this:
SELECT
* --TODO - Specific columns
FROM
(SELECT
*, --TODO - Specific columns
ROW_NUMBER() OVER (PARTITION BY n.Name_ID ORDER BY r.Race_Date DESC) rn
FROM
Names n
inner join
Runners rs
on
n.Name_ID = rs.Name_ID
inner join
Races r
on
rs.Race_ID = r.Race_ID
) t
WHERE
t.rn in (1,2)
That should produce two rows per competitor. If needed, you can then PIVOT this data if you want a single row per competitor, but I'd usually leave that up to the presentation layer, rather than do it in SQL.
And so, no, I wouldn't even have a Prior_Race_ID column. As a general rule, don't store data that can be calculated - that just introduces opportunities for that data to be incorrect compared to the base data.
run the following sql(The distinct here is to avoid that a runner has more than one race at a same day):
update runner r1
set r1.prior_race_id =
(
select distinct race.race_id from runner, race where runner.race_id = race.race_id and runner.runner_id = r1.runner_id group by runner.runner_id having race.race_date = max(race.race_date)
)
I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.
http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.
I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)
you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.
Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.
You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.
You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.
This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.