Need some help (search algorithm) - database

I need some help with this issue:
As an input I have a string, which looks like Blue cat green eyes 2342342, or it can be Cat blue eyes green 23242 or any other permutation of words.
In my DB table I have some data. One of the columns is called, say, keyWords.
Here is an example of this table:
My task is to find record in my DB table column, KEYWORDS, which matches some words from the input string.
For example: for strings "Blue cat green eyes 2342342" "Cat blue eyes green 23242" and "Cat 23242 eyes blue green" the result must be "blue cat" (first row of my table).
The only way I can imagine how to solve this task looks like this:
Consistently take every word from the string.
Search this every word with %like% in a table column.
If it is not found it means this word is not key and we have no interest in it.
If it is found one time - great! No doubt, this is what we are looking for.
If there are more than one result:
From all the words from the string, which were not tested yet consistently take every word.
Search this word with %like% in the results from step 2.
etc…
Graphical schema of this algorithm is here
But it looks like this algorithm will work very slowly if there are a lot of records in a table and if my input string consists of big number of words.
So, my question is: Is there are any special algorithms which can help solving this task?

You can adopt another table such as
ID KeywordID Word
1 1 blue
2 2 blue
3 1 cat
and transform the string
"Blue cat green eyes 2342342"
in a series of indexes and counts:
SELECT KeywordID, COUNT(*) FROM ancillary WHERE Word IN ('blue','cat','green','eyes'...)
This would perform a series of exact matches and return, say,
KeywordID Count
1 2
2 1
Then you know that keyword group with id 1 has two words, which means that a count of 2 matches all of them. So keywordid 1 is satisfied. Group 2 has also two words (black, cat) but only one was found, and the match is there but not complete.
If you also record the keyword set size together with keyword ID, then all keywords from the same ID will have the same KeywordSize, and you can GROUP BY it too:
KeywordID KeywordSize Count
1 2 2
2 2 1
and can even SELECT COUNT(*)/KeywordSize AS match ... ORDER BY match and have keyword matches sorted by relevancy.
Of course, once you have KeywordID, you can find it in the keywords table.
Implementation
You want to add the keyword list "black angry cat" to your existing table.
So you explode this keyword list into words: and get "black", "angry" and "cat".
You insert the keyword list normally in the table that you already have, and retrieve the ID for that newly created row, let's say it is 1701.
Now you insert the words into a new table that we call "ancillary". This table only contains the keyword row ID of your primary table, the single word, and the size of the word list from which that word comes.
We know we are inserting 3 words in all, for table row 1701, so size=3 and we insert these tuples:
(1701, 3, 'black')
(1701, 3, 'cat')
(1701, 3, 'angry')
(These will receive an unique ID of their own, but this does not concern us).
Now some time later we receive a sentence which is,
'Schroedinger cat is black and angry'
We could first run the query against a list of null-words to be removed, such as "is" and "and". But this is not necessary.
Then we could run as many queries as there are words, and thereby discover that no rows anywhere contained "Schroedinger" and we can drop it. But this, too, is not necessary.
Finally we build the real query against ancillary:
SELECT KeywordID, COUNT(*) AS total, ListSize*100/COUNT(*) AS match
FROM ancillary WHERE Word IN ('Schroedinger','cat','is','black','and','angry')
GROUP BY KeywordID;
The WHERE will return, say, these rows:
(1234, 'black') -- from 'black cat'
(1234, 'cat') -- from 'black cat'
(1423, 'angry') -- from 'angry birds'
(1701, 'cat') -- from 'black angry cat'
(1701, 'angry') -- from 'black angry cat'
(1701, 'black') -- from 'black angry cat'
(1999, 'cat') -- from 'nice white cat'
So the GROUP will return the KeywordID of these rows with its cardinality:
1423 1 50%
1701 3 100%
1234 2 100%
1999 1 33%
Now you can sort by matching ratio descending, and then by list size descending (since matching 100% of 3 words is better than matching 100% of 2, and matching 1 in 2 is better than matching 2 in 3):
1701 3 100% -- our best match
1234 2 100% -- second runner
1423 1 50%
1999 1 33%
You can also retrieve your first table in one query, with added match ratio:
SELECT mytable.*, total, match FROM
mytable JOIN (
SELECT KeywordID, COUNT(*) AS total, ListSize*100/COUNT(*) AS match
FROM ancillary WHERE Word IN ('Schroedinger','cat','is','black','and','angry')
GROUP BY KeywordID
) AS ancil ON (mytable.KeywordID = ancil.KeywordID)
ORDER BY match DESC, total DESC;
The largest cost is for the exact match in "ancillary" which has to be indexed on the Word column.

You might wang to look full-text search engine, like sphinx: http://sphinxsearch.com/
Or, another way - make a stored procedure, splitting search string into keywords, using specified separator and look for charindex of each keyword in your DB column (depends on your db managment system)

Related

SQL query takes 20 minutes to finish and only contains 1300 rows

I have a table that has a unique string column and a department description. The length of the unique string column represents the department hierarchy so 4 character length is the lowest level while 2 character length the highest.
My goal is to create new variables so I can show the hierarchy levels and corresponding department descriptions for each row and use these new columns as filters
My SQL code is working; however, it takes more than 20 minutes to generate results for a 1300 row table.
Is there a better way to optimize this query? Note that I’m only using one table and creating multiple copies to create the final version that I’d like to achieve.
m.UniqueDescription as "Department Code",
m.DepartmentDescription as "Department",
Left(m.UniqueDescription,2) as "Level 2 Hierarchy",
Left(m.UniqueDescription,3) as "Level 3 Hierarchy",
Left(m.UniqueDescription,4) as "Level 4 Hierarchy",
l2. DepartmentDescription as "L2 Department",
l3. DepartmentDescription as "L3 Department",
l4. DepartmentDescription as "L4 Department"
From department_table m
LEFT JOIN department_table l2
ON Left(m.UniqueDescription,2) = l2.UniqueDescription
LEFT JOIN department_table l3
ON Left(m.UniqueDescription,3) = l3.UniqueDescription
LEFT JOIN department_table l4
ON Left(m.UniqueDescription,4) = l4.UniqueDescription"
Below is the output that I would like to achieve:
Table Format
First thing, the structure and missing of numeric IDs is not a good practice
Check for index creation.
Do not use functions on the left side of your ON or WHERE clauses, it doesn't allow to the execution planner to index those columns.
Instead of FUNCTION(LeftTable.Column) = value use LeftTable.Column = INVERSE_FUNCTION(value)

SQL Containstable Multiple columns - Which Column Contains Result

I have received a spec to add a relevance score to search results, based on which column the result is in. As an example I have a product table with, amongst other fields, keywords,productNames and brands.
I currently check to find a product by link to
JOIN CONTAINSTABLE(Products, (keywords, productNames, brands), '"NIKE*"')
Now this will find the record with the search term on but I need to weight the results by column eg. keywords scores 1, productNames scores 2, brands 4, etc. The sum of the scores I can then add together to give my relevancy of result. i.e. if "Nike" is in all three columns it would score 7, just in brands 4, etc.
To facilitate this I need to know which columns containstable matches on, but haven't found any details on that.
I've looked at the ISABOUT option, but that's for weighting multiple search terms in a single column.
At the moment I have a case statement
CASE WHEN CONTAINS (Keywords, '"Nike*"') THEN 1 ELSE 0 END +
CASE WHEN CONTAINS (productNames, '"Nike*"') THEN 2 ELSE 0 END +
CASE WHEN CONTAINS (brands, '"Nike*"') THEN 4 ELSE 0
AS Relevance
Which does work, but seems to be very wasteful since containstable must already be doing the work.
If anyone has any ideas then they'll be gratefully received.

Check sql table for values in another table

If I have the following data:
Results Table
.[Required]
I want one grape
I want one orange
I want one apple
I want one carrot
I want one watermelon
Fruit Table
.[Name]
grape
orange
apple
What I want to do is essentially say give me all results where users are looking for a fruit. This is all just example, I am looking at a table with roughly 1 million records and a string field of 4000+ characters. I am expecting a somewhat slow result and I know that the table could DEFINITELY be structured better, but I have no control of that. Here is the query I would essentially have, but it doesn't seem to do what I want. It gives every record. And yes, [#Fruit] is a temp table.
SELECT * FROM [Results]
JOIN [#Fruit] ON
'%'+[Results].[Required]+'%' LIKE [#Fruit].[Name]
Ideally my output should be the following 3 rows:
I want one grape
I want one orange
I want one apple
If that kind of think is doable, I would try the other way round:
SELECT * FROM [Results]
JOIN [#Fruit] ON
[Results].[Required] LIKE '%'+[#Fruit].[Name]+'%'
This topic interests me, so I did a little bit of searching.
Suggestion 1 : Full Text Search
I think what you are trying to do is Full Text Search .
You will need Full-Text Index created on the table if it is not already there. ( Create FULLTEXT Index ).
This should be faster than performing "Like".
Suggestion 2 : Meta Data Search
Another approach I'd take is to create meta data table, and maintain the information myself when the [Result].Required values are updated(or created).
This looks more or less doable, but I'd start from the Fruit table just for conceptual clarity.
Here's roughly how I would structure this, ignoring all performance / speed / normalization issues (note also that I've switched around the variables in the LIKE comparison):
SELECT f.name, r.required
FROM fruits f
JOIN results r ON r.required LIKE CONCAT('%', f.name, '%')
...and perhaps add a LIMIT 10 to keep the query from wasting time while you're testing it out.
This structure will:
give you one record per "match" (per Result row that matches a Fruit)
exclude Result rows that don't have a Fruit
probably be ungodly slow.
Good luck!

fulltext search over data with underscore

I have a indexed table where one of the indexed columns can contains data with an underscore.
ID Name
1 01_A3L
2 02_A3L
3 03_A3L
4 05_A3L
5 some name
6 another name
7 a name
When I search this table with the following query however I don't get any results:
SELECT * FROM MyAmazingTable WHERE( CONTAINS(*,'"a3l*"'))
What is the reason for this? And how can I make sure I do get results I expect (all records that end with A3L)?
Kees C Bakker is 100% correct, but if you just wanted to get the results you require without all of the steps.
The quick/dirty way to do so would be change your search to be a like...
Select * from MyAmazingTable where Name like '%A3L'
The % in this case would represent whatever comes before and make sure the last 3 characters are A3L.
Which will give you the results that you are looking for.

The FREETEXTTABLE on MS SQL 2012 returns strange ranks

I try to find several words in one table but in different fields.
Why the records with one corresponing word have the rank higher than the records with two ones?
The example:
Record 1
Title: Eddie Murphy
Description: An American stand-up comedian, actor, writer, singer, director, and musician.
Record 2
Title: Tom Cruise
Description: An American film actor and producer. He has won three Golden Globe Awards.
SELECT * FROM FREETEXTTABLE(SubjectContent, (Title, Description), 'tom actor')
returns Recrod 1 with rank 61 and Record 2 with rank 47 despite the record 2 contains both words ('tom' and 'actor') and record 1 contains only one word ('actor'). So the user receives the huge amount of unproper records before the proper one.
Though if I set the search parameter 'tom cruise actor' the request returns the high rank.
My fulltext index:
CREATE FULLTEXT INDEX ON SubjectContent(Title, [Description])
KEY INDEX PK_SubjectContent
ON FullTextSearch;
I unsuccessfully tried to change the property 'accent sensitive' and other properties of Full Text Catalog. Thanks for any help.
Looking at the 2 strings, I see that the second one is a larger document from fulltext point of view. this is because of the sentence separator you have in there. So if you pass these strings to the dm_fts_parser, you will see that the max occurrence of first string is 11 and second one is 21. Fulltext normalizes this document length in buckets of 16, 32, 128, 256 .. etc. so your first document falls in first bucket and the second in second bucket. hence first one has higher rank (inversely proportional to the length of the document). reference of all this is here http://msdn.microsoft.com/en-us/library/cc879245.aspx
Thanks
Venkat

Resources