Fulltext index search has large number of page reads - sql-server

I have a full-text index on a column in a table that contains data like this:
searchColumn
90210 Brooks Diana Miami FL diana.brooks#email.com 5612233395
The column is an aggregate of Zip, last name, first name, city, state, e-mail and phone number.
I use this column to search for a customer based on any of this possible information.
The issue I am worried about is with the high number of reads that occurs when doing a query on this column. The query I am using is:
declare #searchTerm varchar(100) = ' "FL" AND "90210*" AND "Diana*" AND "Brooks*" '
select *
from CustomerInformation c
where contains(c.searchColumn, #searchTerm)
Now, when running Profiler I can see that this search has about 50.000 page reads to return a single row, as opposed to when using a different approach using regular indexes and multiple variables, broken down like #firstName, #LastName, like below:
WHERE C.FirstName like coalesce(#FirstName + '%' , C.FirstName)
AND C.LastName like coalesce(#LastName + '%' , C.LastName)
etc.
Using this approach I get only around 140 page reads. I know the approaches are quite different, but I'm trying to understand why the full-text version has so much more reads and if there is any way I can bring that down to something closer to the numbers I get when using regular indexes.

I have a couple of thoughts on this. First the Select * will generate a great number of page reads because it has to pull all columns which may or may not be indexed. When you pull every column it most likely will not make use of the best Index plan out there.
As to your Where clauses, when using the #searchTerm and the value of "FL" AND "90210*" AND "Diana*" AND "Brooks*" it has to check the datapages multiple times each time it is run. Think of how you would look up this information if you had to do it. You look at a piece of paper with the info on it and see if the search column contains FL. Now does it contain FL and 90210*. Now does it contain both of those plus Diana...etc.
You can see why it would keep having to go back to the page to read over and over again. The second query only has to look at 2 columns narrowly defined.
If you want more information on this, I would suggest a class by Brent Ozar that is free right now.
How to think like the SQL Server Engine
I hope that helps.

Related

SQL Server Performance tips for like %

I have a table with 5-10 million records which has 2 fields
example data
Row Field1 Field2
------------------
1 0712334 072342344
2 06344534 083453454
3 06344534 0845645565
Given 2 variables
variable1 : 0634453445645
variable2 : 08345345456756
I need to be able to query the table for best matches as fast as possible
The above example would produce 1 record (e.g row 2)
What would be the fastest way to query the database for matches?
Note : the data and variables are always in this format (i.e always a number, may or may not have a leading zero, and fields are not unique however the combination of both will be )
My initial thought was to do something like this
Select blah where Field1 + "%" like variable1 and Field2 + "%" like variable2
Please forgive my pseudo-code if it's not correct, as this is more a fact-finding mission. However I think I'm in the ball park.
Note : I don't think any indexing can help here, though a memory-based table I'm guessing would speed this up.
Can anyone think of a better way of solving the problem?
You can get a plan with a seek on an index on Field1 with query like this.
declare #V1 varchar(20) = '0634453445645'
declare #V2 varchar(20) = '08345345456756'
select Field1,
Field2
from YourTable
where Field1 like left(#V1, 4) + '%' and
#V1 like Field1 + '%' and
#V2 like Field2 + '%'
It does a range seek on the first four characters on Field1 and uses the full comparison on Field1 and Field2 in a residual predicate.
There is no performance tip. SImple like that.
%somethin% is table scan, Indices are not used due to the beginning %. Ful ltext indexing won't work as it is not a full text you seek but part of a word.
Getting a faster machine to handle the table scans and denormalizing is the only thing you can do. 5-10 million rows should be faste enough on a decent computer. Memory based table is not needed - just enough RAM to cache that table.
And that pretty much is it. Either find a way to get rid of the initial % or get hardware (mostly memory) fast enough to handle this.
OR - handle it OUTSIDE sql server. Load the 5-10 million rows into a search service and use a better data structure. SQL being generic has to make compromises. But again, the partial match will kill pretty much most approaches.
Postgres has trigram indexes http://www.postgresql.org/docs/current/interactive/pgtrgm.html
Maybe SQL Server has something like that?
What is the shortest length in Column 'Field1' and 'Field2'? Call this number 'N'.
Then create a select statement which asks for all substrings starting at the first character of length N to the length of each variable. Example (say, N=10)
select distinct * from myTable
where Field1 in ('0634453445','06344534456','063445344564', '0634453445645')
and Field2 in ('0834534545','08345345456','083453454567', '0834534545675','08345345456756')
Write a small script which creates the query for you. Of course there is much more to optimize but this requires (imho) changes in the structure of your table and I can imagine that this is something you don't want. At least you can give it a fast try.
Also, you should include the query plan when you try this approach in SSMS. The query plan will give you a nice hint in how to organize your index.

Find columns that match in two tables

I need to query two tables of companies in the first table are the full names of companies, and the second table are also the names but are incomplete. The idea is to find the fields that are similar. I put pictures of the reference and SQL code I'm using.
The result I want is like this
The closest way I found to do so:
SELECT DISTINCT
RTRIM(a.NombreEmpresaBD_A) as NombreReal,
b.EmpresaDB_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE 'VoIP%' AND b.EmpresaDB_B LIKE 'VoIP%'
The problem with the above code is that it only returns the record specified in the WHERE and if I put this LIKE '%' it returns the Cartesian product of two tables. The RDBMS is Microsoft SQL Server. I would greatly appreciate if you help me with any proposed solution.
Use the short name plus appended '%' as argument in the LIKE expression:
Edit with info that we deal with SQL Server:
SELECT a.NombreEmpresaBD_A as NombreReal
,b.NombreEmpresaBD_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE (b.NombreEmpresaBD_B + '%');
According to your screenshot you had the column name wrong!
String concatenation in T-SQL with + operator.
Above query finds a case where
'Computex S.A' LIKE 'Computex%'
but not:
'Voip Service Mexico' LIKE 'VoipService%'
For that you would have to strip blanks first or use more powerful pattern matching functions.
I have created a demo for you on data.SE.
Look up pattern matching or the LIKE operator in the manual.
I would suggest adding a foreign key between the tables linking the data. Then you can just search for the one table and join the second to get the other results.

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Apostrophes and SQL Server FT search

I have setup FT search in SQL Server 2005 but I cant seem to find a way to match "Lias" keyword to a record with "Lia's". What I basically want is to allow people to search without the apostrophe.
I have been on and off this problem for quite some time now so any help will really be a blessing.
EDIT 2: just realised this doesn't actually resolve your problem, please ignore and see other answer! The code below will return results for a case when a user has inserted an apostrophe which shouldn't be there, such as "abandoned it's cargo".
I don't have FT installed locally and have not tested this - you can use the syntax of CONTAINS to test for both the original occurrence and one with the apostrophe stripped, i.e.:
SELECT *
FROM table
WHERE CONTAINS ('value' OR Replace('value', '''',''))
EDIT: You can search for phrases using double quotes, e.g.
SELECT *
FROM table
WHERE CONTAINS ("this phrase" OR Replace("this phrase", '''',''))
See MSDN documentation for CONTAINS. This actually indicates the punctuation is ignored anyway, but again I haven't tested; it may be worth just trying CONTAINS('value') on its own.
I haven't used FT, but in doing queries on varchar columns, and looking for surnames such as O'Reilly, I've used:
surname like Replace( #search, '''', '') + '%' or
Replace( surname,'''','') like #search + '%'
This allows for the apostrophe to be in either the database value or the search term. It's also obviously going to perform like a dog with a large table.
The alternative (also not a good one probably) would be to save a 2nd copy of the data, stripped of non-alpha characters, and search (also?) against that copy. So there original would contain Lia's and the 2nd copy Lias. Doubling the amount of storage, etc.
Another attempt:
SELECT surname
FROM table
WHERE surname LIKE '%value%'
OR REPLACE(surname,'''','') LIKE '%value%'
This works for me (without FT enabled), i.e. I get the same results when searching for O'Connor or OConnor.

Finding alphabetical position in a large list

I have an as400 table containing roughly 1 million rows of full names / company names which I would like to convert to use another datastore while still matching the speed of the original.
Currently, a user enters the search and almost instantaneously gets the alphabetical position of the search term in the table and and a page of matches. The user can then paginate either up or down through the records very quickly.
There is almost no updating of the data and approximately 50 inserts per week. I'm thinking that any database can maintain an alphabetical index of the names, but I'm unsure of how to quickly find the position of the search within the dataset. Any suggestions are greatly appreciated.
This sounds just like a regular pagination of results, except that instead of going to a specific page based on a page number or offset being requested, it goes to a specific page based on where the user's search fits in the results alphabetically.
Let's say you want to fetch 10 rows after this position, and 10 rows before.
If the user searches for 'Smith', you could do two selects such that:
SELECT
name
FROM
companies
WHERE
name < 'Smith'
ORDER BY
name DESC
LIMIT 10
and then
SELECT
name
FROM
companies
WHERE
name >= 'Smith'
ORDER BY
name
LIMIT 10
You could do a UNION to fetch that in one query, the above is just simplified.
The term the user searched for would fit half way through these results. If there are any exact matches, then the first exact match will be positioned such that it is eleventh.
Note that if the user searches for 'aaaaaaaa' then they'll probably just get the 10 first results with nothing before it, and for 'zzzzzzzz' they may get just the 10 last results.
I'm assuming that the SQL engine in question allows >= and < comparisons between strings (and can optimise that in indexes), but I haven't tested this, maybe you can't do this. If, like MySQL, it supports internationalized collations then you could even have the ordering done correctly for non-ascii characters.
If by "the position of the search" you mean the number of the record if they were enumerated alphabetically, you may want to try something like:
select count(*) from companies where name < 'Smith'
Most databases ought to optimize that reasonably well (but try it--theories you read on the web don't trump empirical data).
Just to add to the ordering suggestions:
Add an index to the name if this is your standard means of data retrieval.
You can paginate efficiently by combining LIMIT and OFFSET.

Resources