Finding alphabetical position in a large list - database

I have an as400 table containing roughly 1 million rows of full names / company names which I would like to convert to use another datastore while still matching the speed of the original.
Currently, a user enters the search and almost instantaneously gets the alphabetical position of the search term in the table and and a page of matches. The user can then paginate either up or down through the records very quickly.
There is almost no updating of the data and approximately 50 inserts per week. I'm thinking that any database can maintain an alphabetical index of the names, but I'm unsure of how to quickly find the position of the search within the dataset. Any suggestions are greatly appreciated.

This sounds just like a regular pagination of results, except that instead of going to a specific page based on a page number or offset being requested, it goes to a specific page based on where the user's search fits in the results alphabetically.
Let's say you want to fetch 10 rows after this position, and 10 rows before.
If the user searches for 'Smith', you could do two selects such that:
SELECT
name
FROM
companies
WHERE
name < 'Smith'
ORDER BY
name DESC
LIMIT 10
and then
SELECT
name
FROM
companies
WHERE
name >= 'Smith'
ORDER BY
name
LIMIT 10
You could do a UNION to fetch that in one query, the above is just simplified.
The term the user searched for would fit half way through these results. If there are any exact matches, then the first exact match will be positioned such that it is eleventh.
Note that if the user searches for 'aaaaaaaa' then they'll probably just get the 10 first results with nothing before it, and for 'zzzzzzzz' they may get just the 10 last results.
I'm assuming that the SQL engine in question allows >= and < comparisons between strings (and can optimise that in indexes), but I haven't tested this, maybe you can't do this. If, like MySQL, it supports internationalized collations then you could even have the ordering done correctly for non-ascii characters.

If by "the position of the search" you mean the number of the record if they were enumerated alphabetically, you may want to try something like:
select count(*) from companies where name < 'Smith'
Most databases ought to optimize that reasonably well (but try it--theories you read on the web don't trump empirical data).

Just to add to the ordering suggestions:
Add an index to the name if this is your standard means of data retrieval.
You can paginate efficiently by combining LIMIT and OFFSET.

Related

Fulltext index search has large number of page reads

I have a full-text index on a column in a table that contains data like this:
searchColumn
90210 Brooks Diana Miami FL diana.brooks#email.com 5612233395
The column is an aggregate of Zip, last name, first name, city, state, e-mail and phone number.
I use this column to search for a customer based on any of this possible information.
The issue I am worried about is with the high number of reads that occurs when doing a query on this column. The query I am using is:
declare #searchTerm varchar(100) = ' "FL" AND "90210*" AND "Diana*" AND "Brooks*" '
select *
from CustomerInformation c
where contains(c.searchColumn, #searchTerm)
Now, when running Profiler I can see that this search has about 50.000 page reads to return a single row, as opposed to when using a different approach using regular indexes and multiple variables, broken down like #firstName, #LastName, like below:
WHERE C.FirstName like coalesce(#FirstName + '%' , C.FirstName)
AND C.LastName like coalesce(#LastName + '%' , C.LastName)
etc.
Using this approach I get only around 140 page reads. I know the approaches are quite different, but I'm trying to understand why the full-text version has so much more reads and if there is any way I can bring that down to something closer to the numbers I get when using regular indexes.
I have a couple of thoughts on this. First the Select * will generate a great number of page reads because it has to pull all columns which may or may not be indexed. When you pull every column it most likely will not make use of the best Index plan out there.
As to your Where clauses, when using the #searchTerm and the value of "FL" AND "90210*" AND "Diana*" AND "Brooks*" it has to check the datapages multiple times each time it is run. Think of how you would look up this information if you had to do it. You look at a piece of paper with the info on it and see if the search column contains FL. Now does it contain FL and 90210*. Now does it contain both of those plus Diana...etc.
You can see why it would keep having to go back to the page to read over and over again. The second query only has to look at 2 columns narrowly defined.
If you want more information on this, I would suggest a class by Brent Ozar that is free right now.
How to think like the SQL Server Engine
I hope that helps.

Database design for storing LCS information?

I have a table that contains 2 columns, one is a id, and other is column containing long strings
eg.
Id strings
1 AGTTAGGACCTTACTCTATATCTGTTCTGTTGGTATGGAG
2 GTACTTGTATTCTGATATCTAGGGTTTTCTAATTACTTCTG
3 GTATTCTCTTTCTAGCTGATCGTAATTAAATCTTATCTAA
when the user is performing a search, I would find the longest common subsequence in the search string and all the data in the table.
Eg. the search sequence is
TCTGTTCTG
1. Its a 100% match, with the whole match found.
2. The LCS is TCTGTTCTG, but with some gaps.
3. The LCS is TCTGTTCT, with some gaps in BTW.
Is there a way to store the information about the match that where exactly it started finding the match and then upto where it found the match, and then where it started again and so on?
So, that I can represent data in somewhat this format
First one =>
AGTTAGGACCTTACTCTATATCTGTTCTGTTGGTATGGAG
|||||||||
TCTGTTCTG
Second one =>
GTACTTGTATTCTGATATCTAGGGTTTTCTAATTACTTCTG
| || | |||||
T CT G TTCTG
Basically somehow I could store this, start and end position for each sequence for each subsequence found, so that when I show this page again in future, I don't have to compute
this match again and can somehow pick out this data about start and end from database and just show this in the format shown? I know the question might be a little hazy, but please
let me know how else I can elaborate if you have any doubts?
The first case is easy enough using PATINDEX.
Case 1:
select Id, PATINDEX('%TCTGTTCTG%', strings) FROM table
That should return the Id of all 'Full' matches and the starting position of the match.
Case 2:
select id, PATINDEX('%T%C%T%G%T%T%C%T%G%', strings) FROM table
This one appears to return a value for partial match, does not select 'Best' partial match)
Will get back to it when I can, lots of edge cases from what I see.
(Edge cases: What if there are multiple full matches, do you need to return a match with the least amount of gap in it or just a match with gaps? Same goes for partial matches)
That should give you a start, while I think about the rest of it.

SQL Server - Percent based Full Text Search

I want to conduct search on a particular column of a table in such a way that returning result set should satify following 2 conditions:
Returning result set should have records whose 90% of the characters matches with the given search text.
Returning result set should have records whose 70% of the consecutive characters matches with the given search text.
It implies that when 10 character word Sukhminder is searched, then:
it should return records like Sukhmindes, ukhminder, Sukhmindzr, because it fulfils both of the above mentioned conditions.
But it should not return records like Sukhmixder because it does not fulfil the second condition. Likewise, It should not return record Sukhminzzz because it does not fulfil the first condition.
I am trying to use Full Text Search feature of SQL Server. But, could not formulate the required query yet. Kindly reply ASAP.
You could try using a combination of the SOUNDEX command and DIFFERENCE command with full text searching.
Check out this Google book online which talks about it
Do you mean 70% of the original word? I think the only way you could do this exactly as stated would be to work out all possible string permutations that could match the 70% criteria and bring back records matching any of those
Col LIKE '%min%' AND (
Col LIKE '%Sukhmin%' OR Col LIKE '%ukhmind%'
OR Col LIKE '%khminde%' OR Col LIKE '%hminder%' )
then do further processing to see if the 90% criteria is met.
Edit: Actually you might find this link on Fuzzy Searching to be of interest http://anastasiosyal.com/archive/2009/01/11/18.aspx

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Resources