Let's suppose I have the following table with a clustered index on a column (say, a)
CREATE TABLE Tmp
(
a int,
constraint pk_a primary key clustered (a)
)
Then, let's assume that I have two sets of a very large number of rows to insert to the table.
1st set) values are sequentially increasing (i.e., {0,1,2,3,4,5,6,7,8,9,..., 999999997, 999999998, 99999999})
2nd set) values are sequentially decreasing (i.e., {99999999,999999998,999999997, ..., 3,2,1,0}
do you think there would be performance difference between inserting values in the first set and the second set? If so, why?
thanks
SQL Server will generally try and sort large inserts into clustered index order prior to insert anyway.
If the source for the insert is a table variable however then it will not take account of the cardinality unless the statement is recompiled after the table variable is populated. Without this it will assume the insert will only be one row.
The below script demonstrates three possible scenarios.
The insert source is already exactly in correct order.
The insert source is exactly in reversed order.
The insert source is exactly in reversed order but OPTION (RECOMPILE) is used so SQL Server compiles a plan suited for inserting 1,000,000 rows.
Execution Plans
The third one has a sort operator to get the inserted values into clustered index order first.
/*Create three separate identical tables*/
CREATE TABLE Tmp1(a int primary key clustered (a))
CREATE TABLE Tmp2(a int primary key clustered (a))
CREATE TABLE Tmp3(a int primary key clustered (a))
DBCC FREEPROCCACHE;
GO
DECLARE #Source TABLE (N INT PRIMARY KEY (N ASC))
INSERT INTO #Source
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM sys.all_columns c1, sys.all_columns c2, sys.all_columns c3
SET STATISTICS TIME ON;
PRINT 'Tmp1'
INSERT INTO Tmp1
SELECT TOP (1000000) N
FROM #Source
ORDER BY N
PRINT 'Tmp2'
INSERT INTO Tmp2
SELECT TOP (1000000) 1000000 - N
FROM #Source
ORDER BY N
PRINT 'Tmp3'
INSERT INTO Tmp3
SELECT 1000000 - N
FROM #Source
ORDER BY N
OPTION (RECOMPILE)
SET STATISTICS TIME OFF;
Verify Results and clean up
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('Tmp1'), 1, NULL, 'DETAILED')
WHERE index_level = 0
UNION ALL
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('Tmp2'), 1, NULL, 'DETAILED')
WHERE index_level = 0
UNION ALL
SELECT object_name(object_id) AS name,
page_count,
avg_fragmentation_in_percent,
fragment_count,
avg_fragment_size_in_pages
FROM
sys.dm_db_index_physical_stats(db_id(), object_id('Tmp3'), 1, NULL, 'DETAILED')
WHERE index_level = 0
DROP TABLE Tmp1, Tmp2, Tmp3
STATISTICS TIME ON results
+------+----------+--------------+
| | CPU Time | Elapsed Time |
+------+----------+--------------+
| Tmp1 | 6718 ms | 6775 ms |
| Tmp2 | 7469 ms | 7240 ms |
| Tmp3 | 7813 ms | 9318 ms |
+------+----------+--------------+
Fragmentation Results
+------+------------+------------------------------+----------------+----------------------------+
| name | page_count | avg_fragmentation_in_percent | fragment_count | avg_fragment_size_in_pages |
+------+------------+------------------------------+----------------+----------------------------+
| Tmp1 | 3345 | 0.448430493 | 17 | 196.7647059 |
| Tmp2 | 3345 | 99.97010463 | 3345 | 1 |
| Tmp3 | 3345 | 0.418535127 | 16 | 209.0625 |
+------+------------+------------------------------+----------------+----------------------------+
Conclusion
In this case all three of them ended up using exactly the same number of pages. However Tmp2 is 99.97% fragmented compared with only 0.4% for the other two. The insert to Tmp3 took the longest as this required an additional sort step first but this one time cost needs to be set against the benefit to future scans against the table of minimal fragmentation.
The reason why Tmp2 is so heavily fragmented can be seen from the below query
WITH T AS
(
SELECT TOP 3000 file_id, page_id, a
FROM Tmp2
CROSS APPLY sys.fn_PhysLocCracker(%%physloc%%)
ORDER BY a
)
SELECT file_id, page_id, MIN(a), MAX(a)
FROM T
group by file_id, page_id
ORDER BY MIN(a)
With zero logical fragmentation the page with the next highest key value would be the next highest page in the file but the pages are exactly in the opposite order of what they are supposed to be.
+---------+---------+--------+--------+
| file_id | page_id | Min(a) | Max(a) |
+---------+---------+--------+--------+
| 1 | 26827 | 0 | 143 |
| 1 | 26826 | 144 | 442 |
| 1 | 26825 | 443 | 741 |
| 1 | 26824 | 742 | 1040 |
| 1 | 26823 | 1041 | 1339 |
| 1 | 26822 | 1340 | 1638 |
| 1 | 26821 | 1639 | 1937 |
| 1 | 26820 | 1938 | 2236 |
| 1 | 26819 | 2237 | 2535 |
| 1 | 26818 | 2536 | 2834 |
| 1 | 26817 | 2835 | 2999 |
+---------+---------+--------+--------+
The rows arrived in descending order so for example values 2834 to 2536 were put into page 26818 then a new page was allocated for 2535 but this was page 26819 rather than page 26817.
One possible reason why the insert to Tmp2 took longer than Tmp1 is because as the rows are being inserted in exactly reverse order on the page every insert to Tmp2 means the slot array on the page needs to be rewritten with all previous entries moved up to make room for the new arrival.
To answer this question, you only need to look up what effect clustering has on data and the manner in which it is logically ordered. By clustering ascending, higher numbers get added on to the end of the table; inserts will be very fast. When inserting in reverse, it will be inserted in between two other records (read up on page splitting); this will result in slower inserts. This actually has other negative effects as well (read up on fill factor).
It has to do with allocating pages sequentially as is done for a clustered index. With the first they would naturally cluster together. But in the second, I think you would have to keep moving the page locations to have them sequentially ascending. However, I really only understand SQL server at a conceptual level, so you'd have to test.
Related
As a follow up to What columns generally make good indexes? where I am attempting to know what columns are good index candidates for my query ?
using ROWNUM for my query, which columns I should add to an index to improve performance of my query for oracle Database ?
I already create and index on startdate and enddate .
SELECT
ID,
startdate,
enddate,
a,
b,
c,
d,
e,
f, /*fk in another table*/
g /*fk in another table*/
FROM tab
WHERE (enddate IS NULL ) AND ROWNUM <= 1;
below is the plan table output:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
Plan hash value: 3956160932
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 64 | 2336 (2)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
|* 2 | TABLE ACCESS FULL| tab | 2 | 64 | 2336 (2)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=1)
2 - filter("tab"."enddate " IS NULL)
thanks for help.
One workaround for NULL values is to create function based index as below:
CREATE TABLE TEST_INDEX(ID NUMBER, NAME VARCHAR2(20));
INSERT INTO TEST_INDEX
SELECT LEVEL, NULL
FROM DUAL CONNECT BY LEVEL<= 1000;
--SELECT * FROM TEST_INDEX WHERE NAME IS NULL AND ROWNUM<=1;
CREATE INDEX TEST_INDEX_IDX ON TEST_INDEX(NVL(NAME, 'TEST'));
SELECT * FROM TEST_INDEX WHERE NVL(NAME,'TEST')= 'TEST' AND ROWNUM<=1;
Another common workaround is to index both the column and a literal. NULLs are indexed if at least one of the columns in an index is not NULL. A multi-column index would be a little larger than a function based index, but it has the advantage of working with the NAME IS NULL predicate.
CREATE INDEX TEST_INDEX_IDX ON TEST_INDEX(NAME, 1);
I have an org chart table which is modeled like this:
+-------------+------------+-----------------+
| Employee_ID | Manager_ID | Department_Name |
+-------------+------------+-----------------+
| 1 | 2 | Level1 |
| 2 | 3 | Level2 |
| 3 | | Level3 |
+-------------+------------+-----------------+
So, each employee refers to another row, in a chain which represents the org chart. With all employees, this model is used to represent the hierarchy.
However, for reporting purposes, we'd need to query a denormalized table, i.e. where the data is represented like this:
+-------------+--------+--------+--------+
| Employee_ID | ORG_1 | ORG_2 | ORG_3 |
+-------------+--------+--------+--------+
| 1 | Level1 | | |
| 2 | Level1 | Level2 | |
| 3 | Level1 | Level2 | Level3 |
+-------------+--------+--------+--------+
with an many ORG_x columns as needed to represent all levels that can be found. Then you can do simple groupings such as GROUP BY ORG_1, ORG_2, ORG_3. Note that one could reasonably assume the maximum number of levels.
So here's my question: since the database sits on SQL server, can I expect this to be feasible in Transact-SQL so that I could build a view?
Before I start learning T-SQL, I want to make sure I'm on the right track.
(BTW, if yes, I'd be interested in recommendations for a good tutorial!)
Thanks!
R.
I would use common table expressions with PIVOT:
DECLARE #T TABLE
(
Employee_ID int,
Manager_ID int,
Department_Name varchar(10)
);
INSERT #T VALUES
(1,2,'Level 1'),
(2,3,'Level 2'),
(3,NULL,'Level 3');
WITH C AS (
SELECT Employee_ID, Manager_ID, Department_Name
FROM #T
UNION ALL
SELECT T.Employee_ID, T.Manager_ID, C.Department_Name
FROM C
JOIN #T T ON C.Manager_ID=T.Employee_ID
), N AS (
SELECT ROW_NUMBER() OVER (PARTITION BY Employee_ID ORDER BY Department_Name) N, *
FROM C
)
SELECT Employee_ID, [1] ORG_1, [2] ORG_2, [3] ORG_3
FROM N
PIVOT (MAX(Department_Name) FOR N IN ([1],[2],[3])) P
ORDER BY Employee_ID
Result:
Employee_ID ORG_1 ORG_2 ORG_3
----------- ---------- ---------- ----------
1 Level 1 NULL NULL
2 Level 1 Level 2 NULL
3 Level 1 Level 2 Level 3
Note: If you have only 3 levels, you can also do simple 3 x JOIN
Yes the pattern you have here is known as an adjacency list. It is very common. The downside is that build your tree requires you to use recursion which can lead to performance problems on large sets. Another approach that is a lot faster is to use the Nested Sets model. It is a little less intuitive at first but once you understand the concept it is super easy.
No matter which model you use to store your data it is going to require a dynamic pivot or a dynamic crosstab to get it in the denormalized format you need.
Going of the diagram here: I'm confused on column 1 and 3.
I am working on an datawarehouse table and there are two columns that are used as a key that gets you the primary key.
The first column is the source system. there are three possible values Lets say IBM, SQL, ORACLE. Then the second part of the composite key is the transaction ID it could ne numerical or varchar. There is no 3rd column. Other than the secret key which would be a key generated by Identity(1,1) as the record gets loaded. So in the graph below I imagine if I pass in a query
Select a.Patient,
b.Source System,
b.TransactionID
from Patient A
right join Transactions B
on A.sourceSystem = B.sourceSystem and
a.transactionID = B.transactionID
where SourceSystem = "SQL"
The graph leads me to think that column 1 in the index should be set to the SourceSystem. Since it would immediately split the drill down into the next level of index by a 3rd. But when showing this graph to a coworker, they interpreted it as column 1 would be the transactionID, and column 2 as the source system.
Cols
1 2 3
-------------
| | 1 | |
| A |---| |
| | 2 | |
|---|---| |
| | | |
| | 1 | 9 |
| B | | |
| |---| |
| | 2 | |
| |---| |
| | 3 | |
|---|---| |
First, you should qualify all column names in a query. Second, left join usually makes more sense than a right join (the semantics are keep all columns in the first table). Finally, if you have proper foreign key relationships, then you probably don't need an outer join at all.
Let's consider this query:
Select p.Patient, t.Source System, t.TransactionID
from Patient p join
Transactions t
on t.sourceSystem = p.sourceSystem and
t.transactionID = p.transactionID
where t.SourceSystem = 'SQL';
The correct index for this query is Transactions(SourceSystem, TransactionId).
Notes:
Outer joins affect the choice of indexes. Basically if one of the tables has to be scanned anyway, then an index might be less useful.
t.SourceSystem = 'SQL' and p.SourceSystem = 'SQL' would probably optimize differently.
Does the patient really have a transaction id? That seems strange.
Wrong Result
So i have two tables
Order
Staging
Order Table having column structure
+-------+---------+-------------+---------------+----------+
| PO | cashAmt | ClaimNumber | TransactionID | Supplier |
+-------+---------+-------------+---------------+----------+
| 12345 | 100 | 99876 | abc123 | 0101 |
| 12346 | 50 | 99875 | abc123 | 0102 |
| 12345 | 100 | 99876 | abc123 | 0101 |
+-------+---------+-------------+---------------+----------+
Staging Table having column structure
+----------+------------+-------------+---------------+
| PONumber | paymentAmt | ClaimNumber | TransactionID |
+----------+------------+-------------+---------------+
| 12345 | 100 | 99876 | abc123 |
| 12346 | 50 | 99875 | abc123 |
+----------+------------+-------------+---------------+
The query i am executing is
select sum(cashAmt) CheckAmount, count(ClaimNumber) TotalLines
FROM [order] with (nolock)
WHERE TransactionID='abc123'
union
select sum(paymentAmt) CheckAmount, count(ClaimNumber) TotalLines
from Staging with (nolock)
where TransactionID='abc123'
but the sum is getting messed up because there is duplicate in one of the tables.
How can i edit that i get only uniques from the order table and the sums are correct
First ask yourself why are there duplicates in the Orders table? There must be a reason why they are there. I would deal with that first.
That issue aside, if the duplicates in the Orders table have a purpose and yet are not to be considered for this particular query, then you should be able to leave out the duplicates by simply changing the query to use DISTINCT on whatever field in the Orders table can reliably identify a duplicate.
select Distinct fieldname sum(cashAmt)... etc.
Assuming duplicates in your table are OK.
Not sure why you are using no lock, it seems like it shouldn't be included.
You could use a table variable to store the distinct values. You'll need to adjust the data types in the table variable to match your table structure.
I haven't tested the code below but it should look something like this.
DECLARE #OrderTmp TABLE (
cashAmt MyNumericColumn numeric(10,2)
, ClaimNumber int
, TransactionID Int
)
INSERT INTO #OrderTmp
select Distinct
cashAmt
,ClaimNumber
,TransactionID
FROM
[order]
WHERE TransactionID='abc123'
SELECT DISTINCT
select sum(cashAmt) CheckAmount, count(ClaimNumber) TotalLines
FROM #OrderTmp
where TransactionID='abc123'
union
select sum(paymentAmt) CheckAmount, count(ClaimNumber) TotalLines
from Staging
where TransactionID='abc123'
I have a query:
select min(timestamp) from table
This table has 60+million rows, and daily I delete a few off the end. To determine whether or not there is any data old enough do delete I run the query above. There is an index on timestamp ascending, containing only one column, and the query plan in oracle causes this to be a full index scan. Should this not be the definition of a seek?
edit including plan:
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 2 | INDEX FULL SCAN (MIN/MAX)| NEVENTS_I2 | 1 | 8 | 4 (100)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
| 0 | SELECT STATEMENT | | 1 | 8 | 4 (0)| 00:00:01 |
Can you post the actual query plan? Are you sure that it is not doing a min/max index full scan? As you can see in this example, we're getting the MIN value from a 100,000 row table using a min/max index full scan with only a handful of consistent gets.
SQL> create table foo (
2 col1 date not null
3 );
Table created.
SQL> insert into foo
2 select sysdate + level
3 from dual
4 connect by level <= 100000;
100000 rows created.
SQL> create index idx_foo_col1
2 on foo( col1 );
Index created.
SQL> analyze table foo compute statistics for all indexed columns;
Table analyzed.
SQL> set autotrace on;
<<Note that I ran this statement once just to get the delayed block cleanout to
happen so that the consistent gets number wouldn't be skewed. You could run a
different query as well>>
1* select min(col1) from foo
SQL> /
MIN(COL1)
---------
02-FEB-11
Execution Plan
----------------------------------------------------------
Plan hash value: 817909383
--------------------------------------------------------------------------------
-----------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
Time |
--------------------------------------------------------------------------------
-----------
| 0 | SELECT STATEMENT | | 1 | 7 | 2 (0)|
00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 7 | |
|
| 2 | INDEX FULL SCAN (MIN/MAX)| IDX_FOO_COL1 | 1 | 7 | 2 (0)|
00:00:01 |
--------------------------------------------------------------------------------
-----------
Note
-----
- dynamic sampling used for this statement (level=2)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
2 consistent gets
0 physical reads
0 redo size
532 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
At first I thought that the index would only be used if the column is declared NOT NULL. I tested with the following setup:
SQL> CREATE TABLE my_table (ts TIMESTAMP);
Table created
SQL> INSERT INTO my_table
2 SELECT systimestamp + ROWNUM * INTERVAL '1' SECOND
3 FROM dual CONNECT BY LEVEL <= 100000;
100000 rows inserted
SQL> CREATE INDEX ix ON my_table(ts);
Index created
SQL> EXPLAIN PLAN FOR SELECT MIN(ts) FROM my_table;
Explained
SQL> SELECT * FROM TABLE(dbms_xplan.display);
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 69 (2)| 00:00:0
| 1 | SORT AGGREGATE | | 1 | 13 | |
| 2 | INDEX FULL SCAN (MIN/MAX)| IX | 90958 | 1154K| |
--------------------------------------------------------------------------------
Here we notice that the index is used, but all rows from the index are read. If we specify that the column is not null we get a much better plan:
SQL> ALTER TABLE my_table MODIFY ts NOT NULL;
Table altered
SQL> EXPLAIN PLAN FOR SELECT MIN(ts) FROM my_table;
Explained
SQL> SELECT * FROM TABLE(dbms_xplan.display);
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 2 (0)| 00:00:0
| 1 | SORT AGGREGATE | | 1 | 13 | |
| 2 | INDEX FULL SCAN (MIN/MAX)| IX | 90958 | 1154K| 2 (0)| 00:00:0
--------------------------------------------------------------------------------
In fact this is the same plan that is also used if we add a WHERE clause (Oracle will read a single row from the index):
SQL> EXPLAIN PLAN FOR SELECT MIN(ts) FROM my_table WHERE ts IS NOT NULL;
Explained
SQL> SELECT * FROM TABLE(dbms_xplan.display);
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 2 (0)| 00:00:
| 1 | SORT AGGREGATE | | 1 | 13 | |
| 2 | FIRST ROW | | 90958 | 1154K| 2 (0)| 00:00:
| 3 | INDEX FULL SCAN (MIN/MAX)| IX | 90958 | 1154K| 2 (0)| 00:00:
--------------------------------------------------------------------------------
This last plan shows (line 2) that Oracle is indeed performing a "seek".
Just wanted to hone in on the fact that an "INDEX FULL SCAN (MIN/MAX)" is simply not the same as an "INDEX FULL SCAN". An INDEX FULL SCAN really does scan the entire index (possibly with filtering). However an INDEX FULL SCAN (MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) only gets the smallest or largest leaf block (from the range), but can only be employed as long as the column is NOT NULL (which is a bit silly, and really a bug, since a NULL value is by definition neither the smallest nor largest value). The (MIN/MAX) optimization is an implicit FIRST_ROWS action, and doesn't need the "WHERE ... IS NOT NULL" query condition to perform the optimization. Interestingly the MIN/MAX optimization is normally not considered by the CBO for function-based indexes, that's another little bug.