netezza left outer join query performance - netezza

I have a question related to Netezza query performance .I have 2 tables Table A and Table B and Table B is the sub set of Table A with data alteration .I need to update those new values to table A from table B
We can have 2 approaches here
1) Left outer join and select relevant columns and insert in target table
2) Insert table a data into target table and update those values from tableB using join
I tried both and logically both are same.But Explain plan is giving different cost
for normal select
a)Sub-query Scan table "TM2" (cost=0.1..1480374.0 rows=8 width=4864 conf=100)
update
b)Hash Join (cost=356.5..424.5 rows=2158 width=27308 conf=21)
for left outer join
Sub-query Scan table "TM2" (cost=51.0..101474.8 rows=10000000 width=4864 conf=100)
From this I feel left outer join is better .Can anyone put some thought on this and guide
Thanks

The reason that the cost of insert into table_c select ... from table_a; update table_c set ... from table_b; is higher is because you're inserting, deleting, then inserting. Updates in Netezza mark the records to be updated as deleted, then inserts new rows with the updated values. Once the data is written to an extent, it's never (to my knowledge) altered.
With insert into table_c select ... from table_a join table_b using (...); you're only inserting once, thereby only updating all the zone maps once. The cost will be noticeably lower.
Netezza does an excellent job of keeping you away from the disk on reads, but it will write to the disk as often as you tell it to. In the case of updates, seemingly more so. Try to only write as often as is necessary to gain benefits of new distributions and co-located joins. Any more than that, and you're just using excess commit actions.

Related

Number of rows updated in a oracle table

I have a table called t1 which is already updated by a file. I have table t2 which is created as backup for table t1 before modifications. Now I want to know how many records got updated in table t1. Is there anyway that I can do join with back up table and know how many records got altered? Or how to use sql%rowcount function on a already updated table? Or how should i proceed with ALL_TAB_MODIFICATIONS?
You can join the tables on their primary key (cos you didn't update that, hopefully!) and then compare every column.. you'll have to check for nulls too, and it'll make quite a lot of typing. You could use all_tab_cols and a bit of sql to create your query though (write an sql that creates sql as its output )
Actually, thinking about it, you might be able to get away with less typing by doing a natural join the tables together to get a set of rows that didn't change and removing that set from the original full set:
select * from original
Minus
select original.* from original natural inner join backup
Ive never done it, but the theory is that natural join joins on all equal column names so every column of each table will feature in the join condition. It's an inner join so only columns that have not changed will be represented. Any columns that have become null or become valued from null will also disappear. This is hence the set of rows that have not changed. If all you're after is a count, do a count of the original table less the count of this join result. If you want to know which rows changed, do the result set minus.
Ideally you shouldn't do this; instead at the point the update is run, capture the number of rows it affected. However, this technique could be used long after the update was performed (but before some other update was run)

What happens in an UPDATE statement in which the updated table isn't mentioned in the FROM/JOIN clauses?

I intended to run the following UPDATE statement on a SQL Server database table:
UPDATE TABLE_A
SET COL_1=B.COL_1
FROM TABLE_A A
INNER JOIN TABLE_B B
ON A.KEY_1=B.KEY_1
WHERE B.COL_2 IS NOT NULL
AND A.COL_1=91216599
By mistake, I ran the following statement instead:
UPDATE TABLE_A
SET COL_1=B.COL_1
FROM TABLE_A_COPY A
INNER JOIN TABLE_B B
ON A.KEY_1=B.KEY_1
WHERE B.COL_2 is not NULL
AND A.COL_1=91216599
Notice that in this second statement (wrong one), the FROM clause specifies table TABLE_A_COPY instead of TABLE_A. Both tables have exactly the same schema (i.e., same columns) and the same data (before any UPDATE is executed, that is).
Both TABLE_A and TABLE_A_COPY have about 100 million records and the update affects about 500,000 records. The second statement (the wrong one) runs for several hours and fails while the 1st statement (the correct one) runs for 40 seconds and succeeds.
Clearly, both statements are syntactically correct, but I am not sure what exactly I asked SQL Server to do with the first statement.
My questions are:
What SQL Server was trying to do in the second statement? With my mistake I didn't specify the linkage between records from TABLE_A to TABLE_A_COPY, so was it trying to do a CROSS JOIN between the two, and then update each record in TABLE_A a gazillion times?
If it isn't too broad a question to ask, what would be a valid scenario for such an UPDATE statement in which the table being updated is not mentioned in the FROM/JOIN clauses. Why would anyone do that? Why would SQL Server even allow that?
I did try searching for an answer to my questions, but Google seems to think I'm asking about UPDATE FROM syntax.
1) There is no connection between TABLE_A and TABLE_A_COPY so you will get CROSS JOIN and massive update the same row. Result can be non-deterministic if parallel execution is involed:
LiveDemo
CREATE TABLE #TABLE_A(KEY_1 INT PRIMARY KEY,COL_1 INT);
CREATE TABLE #TABLE_A_COPY(KEY_1 INT PRIMARY KEY,COL_1 INT);
CREATE TABLE #TABLE_B(KEY_1 INT PRIMARY KEY, COL_1 INT, COL_2 INT);
INSERT INTO #TABLE_A VALUES (1,91216599),(2,91216599),(3,91216599),
(4,91216599),(5,91216599),(6,6);
INSERT INTO #TABLE_A_COPY VALUES (1,91216599),(2,91216599),(3,91216599),
(4,91216599),(5,91216599),(6,6);
INSERT INTO #TABLE_B VALUES (1,10,10),(2,20,20), (3,30,30);
/*
UPDATE #TABLE_A
SET COL_1=B.COL_1
--SELECT *
FROM #TABLE_A A
INNER JOIN #TABLE_B B
ON A.KEY_1=B.KEY_1
WHERE B.COL_2 IS NOT NULL
AND A.COL_1=91216599;
*/
UPDATE #TABLE_A
SET COL_1=B.COL_1
FROM #TABLE_A_COPY A
INNER JOIN #TABLE_B B
ON A.KEY_1=B.KEY_1
WHERE B.COL_2 is not NULL
AND A.COL_1=91216599
SELECT *
FROM #TABLE_A;
Check in above code how TABLE_A record with KEY_1 = 6 changed.
2)
SQL Server UPDATE FROM/DELETE FROM syntax is much more broad than ANSI standard, the problem you encounter can be reduced to multiple update the same row. With UPDATE you don't get any error or warning:
From Let's deprecate UPDATE FROM! and Deprecate UPDATE FROM and DELETE FROM :
Correctness? Bah, who cares?
Well, most do. That’s why we test.
If I mess up the join criteria in a SELECT query so that too many rows
from the second table match, I’ll see it as soon as I test, because I
get more rows back then expected. If I mess up the subquery criteria
in an ANSI standard UPDATE query in a similar way, I see it even
sooner, because SQL Server will return an error if the subquery
returns more than a single value. But with the proprietary UPDATE FROM
syntax, I can mess up the join and never notice – SQL Server will
happily update the same row over and over again if it matches more
than one row in the joined table, with only the result of the last of
those updates sticking. And there is no way of knowing which row that
will be, since that depends in the query execution plan that happens
to be chosen. A worst case scenario would be one where the execution
plan just happens to result in the expected outcome during all tests
on the single-processor development server – and then, after
deployment to the four-way dual-core production server, our precious
data suddenly hits the fan…
If you use for example MERGE you will get error indicating:
The MERGE statement attempted to UPDATE or DELETE the same row more
than once. This happens when a target row matches more than one source
row. A MERGE statement cannot UPDATE/DELETE the same row of the target
table multiple times. Refine the ON clause to ensure a target row
matches at most one source row, or use the GROUP BY clause to group
the source rows.
So you need to be more carefull and check your code. I wish also to get error but as you see in connect link this won't happen.
One way to avoid this is to use UPDATE alias so you are sure you use tables that take part in FROM JOIN and no other tables are involved.:
UPDATE A
SET COL_1=B.COL_1
FROM #TABLE_A A
INNER JOIN #TABLE_B B
ON A.KEY_1=B.KEY_1
WHERE B.COL_2 IS NOT NULL
AND A.COL_1=91216599;
SQL will allow a lot of stuff that probably does not make sense
Notice tableB is on both side of the on
select *
from tableA
join tableB
on tableB.col1 = tableB.col1
SQL just checks syntax - it is up to you so write a statement that makes sense
There might be some case you really do want to do want a cross product type update
This is how I would write that statement
I line the table names up so it is easier to see
UPDATE TABLE_A
SET A.COL_1 = B.COL_1
FROM TABLE_A A
JOIN TABLE_B B
ON A.KEY_1 = B.KEY_1
AND B.COL_2 IS NOT NULL
AND A.COL_1 = 91216599
AND A.COL_1 <> B.COL_1

Join multiple table performance

In my current project, I have to left join multiple table (about 10->20 table) together. In these tables, there are about 1->3 large table with millions row (at maximum: 80 millions), the other table only have thousands row at most.
Currently, my query is like:
SELECT *
FROM table1 left join table2 on table1.A=table2.A
table1 left join table3 on table1.B=table3.B
table1 left join table4 on table1.C=table4.C
table1 left join table5 on table1.D=table5.D
....
table1 left join table15 on table1.Z=table15.Z
table1 and table2 are large table, other are small.
I have clustered index in all of these table but the performance is still low. So, I want to know if there is anything I can try to increase the performance.
p/s: I have try to create nonclustered index in these table but the performance become lower than before.
Well the fastest query would be if you de-normalized your table1 so that the split out normalized values were actually part of the table.
Another solution that you might try is building a temp table that was one big collection of the 20 other small tables. And then just join that temp table back to your table1.
First of all, do you really need all those joined data? I suppose most of the situations you don't. If you do, you probably need to review your requirements and architecture.
So the trick is, you only get the data you want, instead of all of them. And filter the data as early as possible (even before joining the next table. but don't worry, SQL Server would do some optimization for you).
I would start from checking the execution plan with Ctrl+L. Try finding out those "Index Scan" nodes and build index for them. I can't go any further without seeing your execution plan.

Proper way to filter a table using values in another table in MS Access?

I have a table of transactions with some transaction IDs and Employee Numbers. I have two other tables which are basically just a column full of transactions or employees that need to be filtered out from the first.
I have been running my query like this:
SELECT * FROM TransactionMaster
Where TransactionMaster.TransID
NOT IN (SELECT TransID from BadTransactions)
AND etc...(repeat for employee numbers)
I have noticed slow performance when running these types of queries. I am wondering if there is a better way to build this query?
If you want all TransactionMaster rows which don't include a TransID match in BadTransactions, use a LEFT JOIN and ask for only those rows where BadTransactions.TransID Is Null (unmatched).
SELECT tm.*
FROM
TransactionMaster AS tm
LEFT JOIN
BadTransactions AS bt
ON tm.TransID = bt.TransID
WHERE bt.TransID Is Null;
That query should be relatively fast with TransID indexed.
If you have Access available, create a new query using the "unmatched query wizard". It will guide you through the steps to create a similar query.

UPDATE query from OUTER JOINed tables or derived tables

Is there any way in MS-Access to update a table where the data is coming from an outer joined dataset or a derived table? I know how to do it in MSSQL, but in Access I always receive an "Operation must use updateable query" error. The table being updated is updateable, the source data is not. After reading up on the error, Microsoft tells me that the error is caused when the query would violate referential integrity. I can assure this dataset will not. This limitation is crippling when trying to update large datasets. I also read that this can supposedly be remedied by enabling cascading updates. If this relationship between my tables is defined in the query only, is this a possibility? So far writing the dataset to a temp table and then inner joining that to the update table is my only solution; that is incredibly clunky. I would like to do something along the lines of this:
UPDATE Table1
LEFT JOIN Table2 ON Table1.Field1=Table2.Field1
WHERE Table2.Field1 IS Null
SET Table1.Field1= Table2.Field2
or
UPDATE Table1 INNER JOIN
(
SELECT Field1, Field2
FROM Table2, Table3
WHERE Field3=’Whatever’
) AS T2 ON Table1.Field1=T2.Field1
SET Table1.Field1= T2.Field2
Update Queries are very problematic in Access as you've been finding out.
The temp table idea is sometimes your only option.
Sometimes using the DISTINCTROW declaration solves the problem (Query Properties -> Unique Records to 'Yes'), and is worth trying.
Another thing to try would be to use Aliases on your tables, this seems to help out the JET engine as well.
UPDATE Table3
INNER JOIN
(Table1 INNER JOIN Table2 ON Table1.uid = Table2.uid)
ON
(Table3.uid = Table2.uid)
AND
(Table3.uid = Table1.uid)
SET
Table2.field=NULL;
What I did is:
1. Created 3 tables
2. Establish relationships between them
3. And used the query builder to update a field in Table2.
There seems to be a problem in the query logic. In your first example, you LEFT JOIN to Table2 on Field1, but then have
Table2.Field1 IS NULL
in the WHERE clause. So, this limits you to records where no JOIN could be made. But then you try and update Table 1 with data from Table2, despite there being no JOIN.
Perhaps you could explain what it is you are trying to do with this query?

Resources