SQL Server: one table fulltext index vs multiple table fulltext index - sql-server

I have a search procedure which has to search in five tables for the same string. I wonder which is better regarding read performance?
To have all tables combined in one table, then add a full text index on it
To create full text indexes in all those tables and issue a query on all of them, then to combine the results

I think something to consider when working with performance, is that reading data is almost always faster than reading data, moving data and then reading again rather than just reading once.
So from your question if you are combining all the tables into a single say temporary table or table variable this will most definitely be slow because it has to query all that data and move it (depending on how much data you are working with this may or may not make much of a difference). As well regarding your table structure, indexing on strings only really becomes efficient when the same string shows up a number of times throughout the tables.
I.e. if you are searching on months (Jan, Feb, Mar, etc...) an index will be great because it can only be 1 of 12 options, and the more times a value is repeated/the less options there are to choose from the better an index is. Where if you are searching on say user entered values ("I am writing my life story....") and you are searching on a word inside that string, an index may not make any difference.
So assuming your query is searching on, in my example months, then you could do something like this:
SELECT value
FROM (
SELECT value FROM table1
UNION
SELECT value FROM table2
UNION
SELECT value FROM table3
UNION
SELECT value FROM table4
UNION
SELECT value FROM table5
) t
WHERE t.value = 'Jan'
This will combine your results into a single result set without moving data around. As well the interpreter will be able to find the most efficient way to query each table using the provided index on each table.

Related

WHERE clause partly based on a field value from another table: is there a better way than exec with a dynamic string?

I have a table with about 50,000 records (a global index of corporate and government bonds).
I would like the user to be able to filter this master index firstly into a smaller subset index (based on permanent logic), and then apply further run time criteria that vary each time.
For example, let's say the user wanted to start from one of many subset indices of bonds, let's say of government bonds only, rather than government and corporate bonds, and also only wanted the US$ government bond index specifically. This would be a permanently defined subset of the master index, with a where clause something like "[Level1]='Government' AND [Currency]='USD' AND [CountryCode]='US'"
At run time, the user would additionally request additional criteria, say for example "AND [IssueSize] > 1,000,000,000 AND [Yield] > 0.0112".
I initially thought of having a separate table that stored the different criteria for these permanent sub-indices as where clauses, for example it might have columns "IndexCode, IndexLogic", and using the example above the values would be "UST", "[Level1]='Government' and [Currency]='USD' AND [CountryCode]='US'", and there would be dozens of rows in this table defining commonly used bond indices.
I had originally thought of creating a dynamic string at run-time, where the user supplies their choice of sub-index code ('UST' in the example above), which then adds the relevant where conditions, and any additional criteria passed as separate parameters, and then doing an exec(#tsql) type command. I had also thought of perhaps having a where clause that was a function call, but this seems very inefficient?
Is the dynamic string method the best way of doing this, or is there a better way involving some kind of 'eval' function equivalent which can take a field value and use that as a where clause?
The problem here is you don't know in advance what the filtered index is.
A solution I have used in this instance, where the filtered index can often change is to grab the definition of the filter back into the client app, and use that to dynamically generate the SQL batch. You can also do this with dynamic SQL in a stored procedure:
SELECT ISNULL(
(SELECT i.filter_definition
FROM sys.indexes i
WHERE i.object_id = OBJECT_ID(#tablename) AND
i.name = #indexname AND has_filter = 1),
'(1=1)');
You pass the table name, and the index name, and you get back the exact definition for the index. This has the benefit of if the index is dropped, the condition becomes (1=1) i.e. every row. You can change this to (1=0) to return nothing instead.
Then you concat this into your dynamic query like so:
SELECT *
FROM table
WHERE regular_conditions_here
AND concated_filter_here
If you are joining other tables, I would advise you to subquery the filter, otherwise you many get column clash as there are no aliases.
SELECT *
FROM (SELECT * FROM table WHERE concated_filter_here) table
JOIN othertables etc
WHERE regular_conditions_here

Performance tuning on PATINDEX with JOIN

I have table called as tbl_WHO with 90 millions of records and temp table #EDU with just 5 records.
I want to do pattern matching on name field between two tables (tbl_WHO and #EDU).
Query: Following query took 00:02:13 time for execution.
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0
)
Sometimes I have to do pattern matching on more than one columns like:
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0 AND
(ISNULL(PATINDEX('%'+Tbl.PAddress+'%',Tmp.Addres),'0')) > 0 OR
(ISNULL(PATINDEX('%'+Tbl.PZipCode,Tmp.ZCode),'0')) > 0
)
Note: There is INDEX created on the columns which comes under condition.
Is there any other way to tune the query performance?
Searches starting with % are not sargable, so even having index on the given column, you are not going to be able to use it effectively.
Are you sure you need to search with PATINDEX each time? Table with 90 millions records is not huge, but having many columns and not applying normalization correctly can decrease the performance for sure.
I will advice to revise the table and check if the data can be normalized further. This can lead to better performance in particular cases and decreasing the table storage as well.
For example, the zipcode can be move to separated table and instead using the zipcode string, you can join by integer column. Try to normalized the address further - if you have city, street or block, street or block number? The names - if you need to search by first, last names just split the names in separate columns.
For string values, the data can be sanitized - remove empty strings at the beg and at the end (trim) for example. And having such data, we can create hash indexes and get extremely fast equal searches.
What I want to say is that if you normalized your data and add some rules (on database and application level) to ensure the input data is correct you are going to have very nice performance. And it is the long way, but you are going to do this - it's easier to be done now, than later (you are late and now).

How do I return the database rows in a custom order?

I have a data base with two columns: a number and a name. The table is as follows:
Am expecting the result of select * from table where num=6 or num=3 to be :
But what I get is:
How do I order the results?
Note: I assume that your actual data is not just A, B, C, ..., F. Otherwise, you don't need a database for that, and can do it directly in your language (example in C#).
You should understand the difference between filtering and ordering.
The data in the database is presented in a specific order. When a specific order is important, it is specified in the query. When not, some databases might return the rows in a nearly-nondeterministic order.
When you use where, you are simply filtering the results. Your query tells to the database:
“Please, give me every column in a table table, given that I'm only interested by the rows containing num 6 or 3.”
While you say what should be returned, you don't specify in which order.
The order by is used exactly for that:
select * from table where num=3 or num=6 order by num desc
will return the rows in the order where the highest num value will appear first.
You are not limited to asc and desc, by the way. If you need, for instance, to have an order such as 6, 7, 2, 3, you can do that with case in Microsoft SQL Server (and similar constructs in other databases). Example:
select [name] from table
where num in (2, 3, 6, 7)
order by case when [num] = 6 then 1
when [num] = 7 then 2
when [num] = 2 then 3
else [num] end asc
While this will do what you want, it's not a good solution, since you'll need to build your SQL query dynamically from user's input—thing you should avoid at all costs (because, aside poor performance, you'll end up letting SQL Injection through).
Instead:
Create a temporary table with two columns: an auto-incremented primary key column and a column containing numbers.
Insert your values (6, 7, 2, 3) in the second column.
Join two tables (the table table and the temporary table).
Filter on the primary key of the temporary table.
Remove the temporary table.
This has a benefit that you don't have to create your queries dynamically, the drawback being that the solution is slightly difficult, especially if multiple users can select the data at the same time, which means that you have to chose the name of your temporary table wisely.
The easiest solution is to just do the filtering in your programming language instead of the database. Load all the required rows (don't forget the where), and filter them later. Since, according to your comments, the user is specifying the order, I'm pretty sure you deal with only few rows, not thousands, which makes it an ideal solution. If you had to deal with hundreds of thousands of rows that you need to process as they arrive from the database without keeping them all in memory, then the previous solution with a temporary table (and a bulk insert) will be more appropriate.
Eventually, if the number of rows in the overall table is low, you don't even have to filter the data. Just load it in memory and keep it there as a map.
Notes:
Don't use select *. There are practically no cases where you really need it in your application. Using select * instead of explicitly specifying the columns has at least two drawbacks:
If later, a column is added, the data set you get will be different, and it might take you time to notice that and debug the related issues. Also, if a column is removed or renamed, the bugs won't be necessarily easy to find, since the errors will occur not at database level, with a clear, explicit error message, but somewhere within your application.
It has a performance impact, since the database needs to figure out which columns should be selected. If you do a select * on a table containing a few dozens of columns, including some blobs or long strings while all you need is a few columns, the performance will be quite terrible.
If your database supports in, use it:
select * from table where num in (3, 6) order by num desc
A common construct I have seen used in numerous database for metadata like lists is to add a column to the table with the boring name of "Column_order", a number.
Then your select is
select list, list_value
from your_table
where enabled = 1
order by column_order;
This works great in allowing any arbitrary order you want but does not work well if you have an application in multiple languages where the order might be different by language.

join versus explicit in condition

are there some valid reasons, in a Oracle db, to preferring in a generic query, a filter condition expressed by a join table , instead of a filter with an IN condition with a large number of elements (some hundreds). I mean if you can write something like
SELECT .... FROM t1
WHERE t1. IN (......) with 100-200 items
or if it is better to change it with
SELECT .... FROM t1
JOIN t2 ON t1. = T2.
where the t2 table contains the values needed for the filter
many thanks
Thanks for the answers
I try to explain the situation and my doubt
I have an user interface where the user can choose in a control many items (for example one or more people a list of professionals).
I can use directly this list adding this in a IN condition, that is
SELECT .... FROM t1 WHERE t1. IN (p1... p200)
but this solution, could raise some problems:
- if the selected items are a lot, then the string can exceed a limit of sql string (I remember in Oracle existed a limit of 4000 bytes)
- an IN condition with many valuesmay be inefficient
So an alternative solution can be
1. create a temporary table with the selected item
2. using a join between the temparary table and the main table
Usually the filling of a temporary table is fast and my question is if this second solution is more efficient of the first
The two queries are not functionally equivalent, so the question is somewhat odd--I can't imagine this comes up very often (if ever).
That said, if you have a table that contains exactly the rows that need to be filtered, a JOIN would be a more natural/standard way to handle it.
Is the idea in the first example is to query t2 to get all the values, then add them to a collection and generate an IN clause? If so, I would say this would be a very bad practice.
From what I see, there are two different questions.
a) Using a Static List/table.
If the (100-200) item list is a list of static values, for eg.Let's say a list of Countries or currencies, I think it would be better to add this to a static table/parameter table and change the query to use the table instead. If you need to track a new code/country etc. later, all you need to do later is insert a new code in the look up table.
Also, if there are other queries that use the same conditions (and there usually are), this look up table will promote re-use.
select * from t1 where id in (select id from t2);
and
select * from t1,t2
where t1.id = t2.id
are both equivalent and better than
select * from t1 where
id in ('USD','EUR'..... ); -- 100 to 200 items to track.
b) The choice of Join vs IN:
It really does not matter a lot. The final query that oracle executes will be the transformed version of your query which might evaluate to the same query in both cases.
You should see which of the two queries are more easier to read and convey the intentions correctly.
Useful Link : http://explainextended.com/2009/09/30/in-vs-join-vs-exists-oracle/
http://explainextended.com/2009/09/30/in-vs-join-vs-exists-oracle/

In SQL Server what is most efficient way to compare records to other records for duplicates with in a given range of values?

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.

Resources