I have table called as tbl_WHO with 90 millions of records and temp table #EDU with just 5 records.
I want to do pattern matching on name field between two tables (tbl_WHO and #EDU).
Query: Following query took 00:02:13 time for execution.
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0
)
Sometimes I have to do pattern matching on more than one columns like:
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0 AND
(ISNULL(PATINDEX('%'+Tbl.PAddress+'%',Tmp.Addres),'0')) > 0 OR
(ISNULL(PATINDEX('%'+Tbl.PZipCode,Tmp.ZCode),'0')) > 0
)
Note: There is INDEX created on the columns which comes under condition.
Is there any other way to tune the query performance?
Searches starting with % are not sargable, so even having index on the given column, you are not going to be able to use it effectively.
Are you sure you need to search with PATINDEX each time? Table with 90 millions records is not huge, but having many columns and not applying normalization correctly can decrease the performance for sure.
I will advice to revise the table and check if the data can be normalized further. This can lead to better performance in particular cases and decreasing the table storage as well.
For example, the zipcode can be move to separated table and instead using the zipcode string, you can join by integer column. Try to normalized the address further - if you have city, street or block, street or block number? The names - if you need to search by first, last names just split the names in separate columns.
For string values, the data can be sanitized - remove empty strings at the beg and at the end (trim) for example. And having such data, we can create hash indexes and get extremely fast equal searches.
What I want to say is that if you normalized your data and add some rules (on database and application level) to ensure the input data is correct you are going to have very nice performance. And it is the long way, but you are going to do this - it's easier to be done now, than later (you are late and now).
Related
I have a table with about 50,000 records (a global index of corporate and government bonds).
I would like the user to be able to filter this master index firstly into a smaller subset index (based on permanent logic), and then apply further run time criteria that vary each time.
For example, let's say the user wanted to start from one of many subset indices of bonds, let's say of government bonds only, rather than government and corporate bonds, and also only wanted the US$ government bond index specifically. This would be a permanently defined subset of the master index, with a where clause something like "[Level1]='Government' AND [Currency]='USD' AND [CountryCode]='US'"
At run time, the user would additionally request additional criteria, say for example "AND [IssueSize] > 1,000,000,000 AND [Yield] > 0.0112".
I initially thought of having a separate table that stored the different criteria for these permanent sub-indices as where clauses, for example it might have columns "IndexCode, IndexLogic", and using the example above the values would be "UST", "[Level1]='Government' and [Currency]='USD' AND [CountryCode]='US'", and there would be dozens of rows in this table defining commonly used bond indices.
I had originally thought of creating a dynamic string at run-time, where the user supplies their choice of sub-index code ('UST' in the example above), which then adds the relevant where conditions, and any additional criteria passed as separate parameters, and then doing an exec(#tsql) type command. I had also thought of perhaps having a where clause that was a function call, but this seems very inefficient?
Is the dynamic string method the best way of doing this, or is there a better way involving some kind of 'eval' function equivalent which can take a field value and use that as a where clause?
The problem here is you don't know in advance what the filtered index is.
A solution I have used in this instance, where the filtered index can often change is to grab the definition of the filter back into the client app, and use that to dynamically generate the SQL batch. You can also do this with dynamic SQL in a stored procedure:
SELECT ISNULL(
(SELECT i.filter_definition
FROM sys.indexes i
WHERE i.object_id = OBJECT_ID(#tablename) AND
i.name = #indexname AND has_filter = 1),
'(1=1)');
You pass the table name, and the index name, and you get back the exact definition for the index. This has the benefit of if the index is dropped, the condition becomes (1=1) i.e. every row. You can change this to (1=0) to return nothing instead.
Then you concat this into your dynamic query like so:
SELECT *
FROM table
WHERE regular_conditions_here
AND concated_filter_here
If you are joining other tables, I would advise you to subquery the filter, otherwise you many get column clash as there are no aliases.
SELECT *
FROM (SELECT * FROM table WHERE concated_filter_here) table
JOIN othertables etc
WHERE regular_conditions_here
I have a table Customers with Millions of Records on 701 attributes ( columns ). I receive a csv file that has one row and 700 columns. Now on the basis of these 700 column values I have to extract the ids from the table Customers.
Now one way is obvious that i fire a select query with all 700 values in where clause.
My question is that if I first fetch a smaller table using only one attribute in where clause and then fetching again on the basis of second attribute in where clause ... and repeating this process for all attributes, would it be any faster ?
Or can you suggest any other method that could make it faster ?
Try to understand the logic of those 700 attributes. There might be dependencies between them that can help reduce the number of the attributes to something more "realistic".
I would then use the same technique to see if I can run smaller queries which would benefit from indexes on the main table. Each time I would store the result in a temporary table (reducing the number or rows in the tmp table), index the temp table for the next step and do it again till I have the final result.
Example: if you have date attributes: try to isolate all record for the year, then the day, etc.
Try to keep complex requests for the end as they will be running against smaller tmp tables.
I have a search procedure which has to search in five tables for the same string. I wonder which is better regarding read performance?
To have all tables combined in one table, then add a full text index on it
To create full text indexes in all those tables and issue a query on all of them, then to combine the results
I think something to consider when working with performance, is that reading data is almost always faster than reading data, moving data and then reading again rather than just reading once.
So from your question if you are combining all the tables into a single say temporary table or table variable this will most definitely be slow because it has to query all that data and move it (depending on how much data you are working with this may or may not make much of a difference). As well regarding your table structure, indexing on strings only really becomes efficient when the same string shows up a number of times throughout the tables.
I.e. if you are searching on months (Jan, Feb, Mar, etc...) an index will be great because it can only be 1 of 12 options, and the more times a value is repeated/the less options there are to choose from the better an index is. Where if you are searching on say user entered values ("I am writing my life story....") and you are searching on a word inside that string, an index may not make any difference.
So assuming your query is searching on, in my example months, then you could do something like this:
SELECT value
FROM (
SELECT value FROM table1
UNION
SELECT value FROM table2
UNION
SELECT value FROM table3
UNION
SELECT value FROM table4
UNION
SELECT value FROM table5
) t
WHERE t.value = 'Jan'
This will combine your results into a single result set without moving data around. As well the interpreter will be able to find the most efficient way to query each table using the provided index on each table.
We have table which stores information about clients which gets loaded using a scheduled job on daily basis from Data warehouse. There are more than 1 million records in that table.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Does it have any impact on the indexes if we delete and reload data into table on daily basis. Do we need to explicitly rebuild the index after every load
Bitmap index is dangerous when the table is frequently updated (the indexed column) because DML on a single row can lock many rows in the table. That's why it is more data warehouse tool than OLTP. Also the true power of bitmap indexes comes with combining more of them using logical operations and translating the result into ROWIDs (and then accessing the rows or aggregate them). In Oracle in general there is not so many reasons to rebuild an index. When frequently modified it will always adapt by 50/50 block split. It doesn't make sense to try to compact it to smallest possible space. One million rows today is nothing unless each row contains big amount of data.
Also be aware that BITMAP indexes requires Enterprise edition license.
The rationale for defining a bitmap index is not a few values in a column, but a query(s) that can profit from it accessing the table rows.
For example if you have say 4 countries equaly populated, Oracle will not use the index as a FULL TABLE SCAN comes cheaper.
If you have some "exotic" countries (very few records) BITMAP index could be used, but you will most probably spot no difference to a conventional index.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Just because a column is low cardinality does not mean it is a candidate for a bitmap index. It might be, it might not be.
Good explanation by Tom Kyte here.
Bitmap indexes are extremely useful in environments where you have
lots of ad hoc queries, especially queries that reference many columns
in an ad hoc fashion or produce aggregations such as COUNT. For
example, suppose you have a large table with three columns: GENDER,
LOCATION, and AGE_GROUP. In this table, GENDER has a value of M or F,
LOCATION can take on the values 1 through 50, and AGE_GROUP is a code
representing 18 and under, 19-25, 26-30, 31-40, and 41 and over.
For example,
You have to support a large number of ad hoc queries that take the following form:
select count(*)
from T
where gender = 'M'
and location in ( 1, 10, 30 )
and age_group = '41 and over';
select *
from t
where ( ( gender = 'M' and location = 20 )
or ( gender = 'F' and location = 22 ))
and age_group = '18 and under';
select count(*) from t where location in (11,20,30);
select count(*) from t where age_group = '41 and over' and gender = 'F';
You would find that a conventional B*Tree indexing scheme would fail you. If you wanted to use an index to get the answer, you would need at least three and up to six combinations of possible B*Tree indexes to access the data via the index. Since any of the three columns or any subset of the three columns may appear, you would need large
concatenated B*Tree indexes on
GENDER, LOCATION, AGE_GROUP: For queries that used all three, or GENDER with
LOCATION, or GENDER alone
LOCATION, AGE_GROUP: For queries that used LOCATION and AGE_GROUP or LOCATION
alone
AGE_GROUP, GENDER: For queries that used AGE_GROUP with GENDER or AGE_GROUP
alone
Having only a single Bitmap Index on a table is useless in most times. The benefit of Bitmap Indexes you get when you have several created on a table and your query combines them.
Maybe a List-Partition is more suitable in your case.
We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.