How to optimize oracle query for repeated full table sorts? - database

I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)

you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.

Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.

You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.

You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.

This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.

Related

Update SQL Table Based On Composite Primary Key

I have an ETL process (CSV to SQL database) that runs daily, but the data in the source sometimes changes, so I want to have it run again the next day with an updated file.
How do I write a SQL statement to find all the differences?
For example, let's say Table_1 has a composite PRIMARY KEY consisting of FK_1, FK_2 and FK_3.
Do I do this in SQL or in the ETL process?
Thanks.
Edit
I realize now this question is too broad. Disregard.
You can use EXCEPT to find which are the IDs which are missing. For example:
SELECT FK_1, FK_2, FK_2
FROM new_data_table
EXCEPT
SELECT FK_1, FK_2, FK_2
FROM current_data_table;
It will be better (in performance prospective) to materialized these IDs and then to join this new table to the new_data_table in order to insert all of the columns.
If you need to do this in one query, you can use simple LEFT JOIN. For example:
INSERT INTO current_data_table
SELECT A.*
FROM new_data_table A
LEFT JOIN current_data_table B
ON A.FK_1 = B.FK_1
AND A.FK_2 = B.FK_2
AND A.FK_3 = B.FK_3
WHRE B.[FK_1] IS NULL;
The idea is to get all records in the new_data_table for which, there is no match in the current_data_table table (WHRE B.[FK_1] IS NULL).

SQL query runs into a timeout on a sparse dataset

For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.

TSQL Comparing 2 tables

I have 2 tables in 2 database. The scheme for the tables is identical. There are no timestamps or last updated information. Table A is a live table, that is, it's updated in "the" program. Update records, insert records and delete records all happen in Table A. Table B is a backup made weekly. Is there a quick way to compare the 2 tables and give me results similar to:
I | 54
D | 55
U | 60
So record 54 in the live table is new, record 55 in the live table was deleted, record 60 in the live table was updated.
This needs to work in SQL Server 2008 and up.
Fields: id, first_name, last_name, phone, email, address_id, birth_date, last_visit, provider_id, comments
I have no control over the scheme. I have read-only access to Table A, read-write to Table B.
Would it be easier to store a hash of each Table A's rows rather than a full copy of the table? Generally speaking I need to know what rows have been updated/inserted and deleted without a build in timestamp. I have the weekly backup table to look at but I could create a hash table if needed.
Using two full joins the first one isvused to check just for id existance and identify inserts and deletes the second would be used for row equality.
In the example I have used checksum for simplicity but I recommend you read up on the cons of using it and consider alternatives like hashbytes or checking each column for equality
Select id, checksum(*) hash
Into #live
From live.dbo.tbl
Select id, checksum(*) hash
Into #archive
From archive.dbo.tbl
Select l1.id,
Case when l1.id is null then 'd'
when a1.id is null then 'I'
when a2.id is null then 'u' end change_type
From #live l1
Full Join #archive a1 On a1.id = l1.id
Full Join #archive a2 On a2.id = l1.id
And a2.hash = l1.hash
I'm going to recommend a tool, but it's not free, although it has a fully functioning 30 day trial period. If you're going to compare data in SQL Server tables, look at Red Gate's SQL Data Compare. It's not cheap, and it will pay for itself many times over. (If you need to compare schemas, their SQL Compare does that.)
Barring that, having a third table, where you write a compare query and select those in one table and not the other (with a field indicating that), those in the other table and not the first, and then comparing field by field to find those different -- well that should work too. It will take longer, but if it's just one one table, the time it takes to write that code should be less than what you'll pay for the Red Gate tools.
If there is a column or set of columns that can uniquely identify each row, then a series of sql statements could be written to identify the inserts, updates and deletes. If there isn't a unique row identifier or the unique identifier (for example, one of the columns that makes it unique) changes, then no.

Populating a table with fields from two other tables

I have two tables in Filemaker:
tableA (which includes fields idA (e.g. a123), date, price) and
tableB (which includes fields idB (e.g. b123), date, price).
How can I create a new table, tableC, with field id, populated with both idA and idB, (with the other fields being used for calculations on the combined data of both tables)?
The only way is to script it (for repeating uses) or do it 'manually', if this is an ad-hoc process. Details depend on the situation, so please clarify.
Update: Sorry, I actually forgot about the question. I assume the ID fields do not overlap even across tables and you do not need to add the same record more than once, but update it instead. In such a case the simplest script would be like that:
Set Variable[ $self, Get( FileName ) ]
Import Records[ $self, Table A -> Table C, sync on ID, update and add new ]
Import Records[ $self, Table B -> Table C, sync on ID, update and add new ]
The Import Records step is managed via rather elaborate dialog, but the idea is that you import from the same file (you can just type file:<YourFileName> there), the format is FileMaker Pro, and then set the field mapping. Make sure to choose the Update matching records and Add remaining records options and select the ID fields as key files to sync by.
It would be a FileMaker script. It could be run as a script trigger, but then it's not going to be seamless to the user. Your best bet is to create the tables, then just run the script as needed (manually) to build Table C. If you have FileMaker Server, you could schedule this script to be run periodically to keep Table C up-to-date.
Maybe you can use the select into statement.
I'm unsure if you wish to use calculated field from TableA and TableB or if your intension was to only calculate fields from the same table?
If tableA.IdA exists also in tableB.IdA, you could join the two tables and select into.
Else, you run the statement once for each table.
Select into statement
Select tableA.IdA, tableA.field1A, tableA.field2A, tableA.field1A * tableB.field2A
into New_Table from tableA
Edit: missed the part where you mentioned FileMaker.
But maybe you could script this on the db and just drop the table.

Creating a partitioned view of detail tables when the CHECK is on the header tables

I've been reading documentation and looking at FAQs and haven't found an answer for this one which probably means it can't be done. My actual situation is a little more complex, but I'll try to simplify it for this question. For each of the past years, I have a header/detail tables with a foreign key linking them. The year datum is in the header records! I want to be able to query all tables concatenated across years.
I have set up views that follows a 'SELECT + UNION ALL' format. I've also put check constraints on the header tables to restrict their values to their respective year. This allows the SQL server query optimizer to only query specific tables when running a query that is restricted with a WHERE clause. Awesome. Up to this point, this information can be found anywhere and everywhere by searching for Partitioned Views.
I want to do the same sort of query optimization with the detail tables but can't figure it out. There is nothing in the detail record that indicates what year it belongs to without joining with the header record; Meaning, the foreign key constraint is the only constraint I have to go off of.
The only solution I've thought of is adding a 'year' column to the detail tables and then adding another where sub clause to the queries. Is there any thing I can do to create a partitioned view of the detail tables using the existing foreign key constraint?
Here is some DDL for reference:
CREATE TABLE header2008 (
hid INT PRIMARY KEY,
dt DATE CHECK ('2008-01-01' <= dt AND dt < '2009-01-01')
)
CREATE TABLE header2009 (
hid INT PRIMARY KEY,
dt DATE CHECK ('2009-01-01' <= dt AND dt < '2010-01-01')
)
CREATE TABLE detail2008 (
did INT PRIMARY KEY,
hid INT FOREIGN KEY REFERENCES header2008(hid),
value INT
)
CREATE TABLE detail2009 (
did INT PRIMARY KEY,
hid INT FOREIGN KEY REFERENCES header2009(hid),
value INT
)
GO
CREATE VIEW headerAll AS
SELECT * FROM header2008 UNION ALL
SELECT * FROM header2009
GO
CREATE VIEW detailAll AS
SELECT * FROM detail2008 UNION ALL
SELECT * FROM detail2009
GO
--This only hits the header2008 table (GOOD)
SELECT *
FROM headerAll h
WHERE dt = '2008-04-04'
--This hits the header2008, detail2008, and detail 2009 tables. (BAD)
SELECT *
FROM headerAll h
INNER JOIN detailAll d ON h.hid = d.hid
WHERE dt = '2008-04-04'
Since you're not going for partitioned tables, I'm assuming you can't target 2005+ Enterprise Edition or higher.
Here is an alternative to adding a new physical column to your tables:
CREATE VIEW detailAll AS
SELECT 2008 AS Year, * FROM detail2008
UNION ALL
SELECT 2009, * FROM detail2009
then,
SELECT *
FROM headerAll h
INNER JOIN detailAll d ON h.hid = d.hid
WHERE dt = '2008-04-04' AND d.Year = 2008
Before you run off and implement this, there is a catch; well, two catches actually.
This solution, like the headerAll view as it's written, cannot accommodate parameters on the partitioning column and still do partition elimination. Using a search predicate of WHERE dt = #date AND d.Year = YEAR(#date) causes table scans across all tables in both views because the query optimizer assumes #date is an arbitrary value (and there's no way to fix that). This is a recipe for a performance disaster if the view is exposed publicly in your database API: there is no restriction on parameterization in queries, and most query authors and ORMs tend to use parameterized queries wherever possible (it's almost always a good thing!).
To get the views to do partition elimination in a real application, you will have to resort to dynamic string execution. How you accomplish this will depend on your business requirements, data requirements, and application architecture. It will be a bit trickier if you're grabbing data from multiple years.
Note also that using dynamic string execution would allow you to write queries directly against the base tables instead of introducing a UNIONed view for each "table". I don't think there's anything wrong with the latter, but this is an option you may not have considered.

Resources