Background
This is a simplified version of the postgres database I am managing:
TableA: id,name
TableB: id,id_a,prop1,prop2
This database has a peculiarity: when I select data, I only consider rows of TableB that have the same id_a. So I am never interested in selecting data from TableB with mixed values of id_a. Therefore, queries are always of this kind:
SELECT something FROM TableB INNER JOIN TableA ON TableA.id=id_a
Some time ago, the number of rows in TableA grew up to 20000 rows and TableB up to 10^7 rows.
To first speedup queries I added a binary tree lookup table to the TableB properties. Something like the following:
"my_index" btree (prop1)
The problem
Now I have to insert new data, and the database size will became more than the double of its current size. Inserting data to TableB became too slow.
I understood that the slowness cames from the updating of my_index.
When I add a new row of TableB, the database have to reorder the my_index lookup table.
I feel like this would be speeded up, if my_index was not over all elements.
But I do not need the new row, with a given id_a property to be sorted with a row having a different id_a property
The question
How can I create an index on a table, where the elements are ordered only when they have a same common property (es. a column called id_a)?
You can't.
The question that I would immediately ask you if you want such an index is: Yes, but for what values of id_a do you want the index? And your answer would be “for all of them”.
If you actually would want an index only for some values, you could use a partial index:
CREATE INDEX partidx ON tableb(prop1) WHERE id_a = 42;
But really you want an index for the whole table.
Besides, the INSERT would be just as slow unless the row inserted does not satisfy the WHERE condition of your index.
There are three things you can do to speed up INSERT:
Run as many INSERT statements as possible in a single transaction, ideally all of them.
Then you don't have to pay the price of COMMIT after every single INSERT, and COMMITs are quite expensive: they require data to be written to the disk hardware (not the cache), and this is incredibly slow (1 ms a decent time).
You can speed this up even more if you use prepared statements. That way the INSERT doesn't have to be parsed and prepared every time.
Use the SQL command COPY to insert many rows. COPY is specifically designed for bulk data import and will be faster than INSERT.
If COPY is to slow, usually because you need to INSERT a lot of data, the fatsest way is to drop all indexes, insert the data with COPY and then recreate the indexes. It can speed up the process by an order of magnitude, but of course the database is not fully available while the indexes are dropped.
Related
Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)
So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.
ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.
I have a Lookup Transformation on a table with 30 columns but I only am using two columns: ID column for the join and Update column as Output.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
Using table in Drop down would this be like doing Select * From T1 or is SSIS clever enough to know I only need 2 columns.
I'm thinking I should go with the Query Select ID, Update From T1.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
It is best to specify which columns you want.
Using table in Drop down, would this be like doing Select * From T1
Yes, it is a SELECT *.
or is SSIS clever enough to know I only need 2 columns?
Nope.
Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.
Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.
This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.
When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.
The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.
If you need only two columns from the lookup table then it is better to use a select query then selecting table from drop down list but the columns specified must contains the primary key (ID). Because reading all columns will consume more resources. Even if it may not meaningful effect in small tables.
You can refer to the following answer on database administrators community for more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Note that what #JWeezy mentioned about lookup from large table is right. Lookups is not designed for large table, i will use SQL JOINs instead.
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
I have 2 tables like:
-table1: id_1, id_2, id_3, ref_id (id_1, id_2 is pk)
-table2: ref_id, id_4
I want id_3 field should be equal to table2's id_4(ref_id is the primary key)
table1 has about 6 million records and table2 has about 2700 records.
I wrote a sql like:
update table1
set id_3 = b.id_3
from table1
left join table2 b on id_1= b.ref_id
By using SQL Server the query takes so much time like about 16 hr and still no response. How can I reduce the query time?
Sounds like it is indeed taking absurdly long, but the lack of indices could be the cause of that. Without indices the database basically has to walk through 2700 records for every single record in your 6M records table.
So start by adding an index (assuming the primary key isn't an index) on ref_id and also add an index on id_1.
To make things easier to monitor (in terms of progress) simply loop through the 2700 records in table 2 and do an update per record (or per 10, 100, etc..) so you can update in parts and see how far it gets.
Also, to make sure you don't do anything useless, I would recommend adding a and table1.id_3 <> table2.id_3
Updating every row in a 6-million row table is likely to be slow regardless.
One way to get a benchmark for the maximum speed of updating every row would be to just time the query:
update table1
set id_3 = 100
Also, do you need to update rows in table1 that have no matching row in table2? In that case, switching the left outer join to an inner join would greatly improve performance.
To answer this question we really need to know what the clustered indexes on the two tables are. I can make a suggestion for the clustered indexes to make this particular query fast, however, other factors should really be considered when choosing clustered indexes.
So with that in mind, see if these indexes help:
table1: UNIQUE CLUSTERED INDEX on (id_1, id_2)
table2: UNIQUE CLUSTERED INDEX on (ref_id)
Basically make your PKs clustered if they are not already.
The other important thing is whether the tables are seeing other traffic while you are running this update. If so, the long runtime may be due to blocking. In this case you should consider batching, i.e. updating only a small portion at a time instead of all in single statement.
As per subject, i am looking for a fast way to count records in a table without table scan with where condition
There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.
The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.