Use Case
When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.
Problem
I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.
Proposed
My proposed solution so far looks something like this
Clean the input. Just remove any special characters
Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
Build a tree out of the input
Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
Convert the tree into a sql query.
Thoughts/suggestions/links?
I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".
Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.
Database Schema
Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:
Keyword ID (not the word),
The Word itself,
A Soundex Alpha value for the Word
Weight
ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:
Product ID,
Keyword ID,
Weight
Keyword Weighting
The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.
I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.
Indexing
Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:
All product keywords are disassociated and deleted if no longer referenced.
Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.
Querying
Take the input query search string in its entirety and look for a direct matching keyword. Retrieve all ProductKeywords for the keyword in an in-memory table along with Keyword weight (different from ProductKeyword weight).
Strip out all non-alphanumeric characters and split query into keywords. Retrieve all existing keywords (only a few will match). Join ProductKeywords to matching keywords to in-memory table along with Keyword weight, which is different from the ProductKeyword weight.
Repeat Step 2 but use soundex values instead, adjusting weights to be less relevant.
Join retrieved ProductKeywords to their related Products and retrieve each product's sales, which is a measure of popularity.
Sort results by Keyword weight, ProductKeyword weight and Sales. The final summing/sorting and/or weighting depends on your implementation.
Limit results and return product search results to client.
What you are looking for is Natural Language Processing. Strangely enough this used to be included free as English Query in SQL Server 2000 and prior. But it's gone now
Some other sources are :
http://devtools.korzh.com/eq/dotnet/
http://www.easyask.com/products/business-intelligence/index.htm
The concept is a meta data dictionary mapping words to table, columns, relationships etc and an English sentence parser combined together to convert a English sentence ( or just some keywords) into a real query
Some people even user English Query with speech recognition for some really cool demos, never saw it used in anger though!
If you're using SQL Server, you can simply use its Full-Text Search feature, which is specifically designed to solve your problem.
You could use a hybrid approach, take the full text search results and further filter them based on the meta data from your #4. For something more intelligent you could create a simple supervised learning solution by tracking what links the user clicks on after the search and store that choice with the key search words in a decision tree. Searches would then be mined from this decision tree
Related
If I have an index with 10,000,000 documents and search text and ask to retrieve the top 1,000 items, is there a way to scope the facets to those 1,000 items?
My current problem is:
We have a very large index with a few different facets, including manufacturer. If I search for a product (WD-40 for instance), that matches a lot of different document and document fields. It finds the product and it is the top scoring match, but because they only make 1 or 2 products, the manufacturer doesn't show up as a top facet option because it is sorted by count.
Is there a way to scope the facets to the top X documents? Or, is there a way to only grab documents which are above a certain #search.score?
The purpose of a refiner is to give users options to narrow down the result set. I would say the $top parameter and returned facets works as it should. Trying to limit the number of refiners to be based on the top 1000 results is a bad idea when we think about it. You'll end up with confusing usability- and recall issues.
Your query for WD-40 returns a large result set. So large that there are 155347 unique manufacturers listed. I'm guessing you have several million hits. The intent of that query is to return the products called WD-40 (my assumption). But, since you search all properties in all types of content, you end up with various products like doors, hinges, and bikes that might have some text saying that "put some WD-40 on it to stop squeaks".
I'm guessing that most of the hits you get are irrelevant. Thus, you should either limit the scope of your initial query by default. For example, limit to searching only the title property. Or add a filter to exclude a category of documents (like manuals, price lists, etc.).
You could also consider submitting different queries from your frontend application. One narrowly scoped query that retrieves the refiners and another, broader query that returns the results.
I don't have a relevant data set to test on, but I believe the $top parameter might do what you want. See this link:
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#top-optional
That said, there are other approaches to solve your use case.
Normalize your data
I don't know how clean your data is. But, for any data set of this size, it's common that the manufacturer name is not consistent. For example, your manufacturer may be listed as
WD40 Company
WD-40 Company
WDFC
WD 40
WD-40 Inc.
...
Normalizing will greatly reduce the number of values in your refiners. It's probably not enough for your use case, but still worth doing.
Consider adding more refiners
When you have a refiner with too many options it's always a good idea to consider having more refiner with course values. For example a category or perhaps a simple refiner that splits the results in two. Like a "Physical vs. Digital" product as a first choice. Or consumer vs. professional product. In stock or on back-order. This pattern allows users to quickly reduce the result set without having to use the brand refiner.
Categorize your refiner with too many options
In your case, your manufacturer refiner contained too many options. I have seen examples where people add a search box within the refiner. Another option is to group your refiner options in buckets. For text values like a manufacturer, you could generate a new property with the first character of the manufacturer's name. That way you could present a refiner that lets users select a manufacturer from A-Z.
I have a list of companies on a spreadsheet that is rarely updated. I'll call it List A.
I also have a constantly updating weekly list of companies (List B) that should have entries that match some entries on List A.
The reality is that the data extracted from List B's company names are often inconsistent due to various business abbreviations (e.g. The Company, Company Ltd., Company Accountants Limited). Sometimes, these companies are under different trading names or have various mispellings.
My initial very not intelligent reaction was to construct a table of employer alias names, with the first column being the true employer name and the following columns holding alises, something like this: [https://i.stack.imgur.com/2cmYv.png]
On the left is a sample table, and the far right is a column where I am using the following array formula template:
=ArrayFormula(INDEX(A30:A33,MATCH(1,MMULT(--(B30:E33=H30),TRANSPOSE(COLUMN(B30:E33)^0)),0)))
I realized soon after that I needed to create a new entry for every single exact match variation (Ltd., Ltd, and Limited), so I looked into fuzzy lookups. I was really impressed by Alan's Fuzzy Matching UDFs, but my needs heavily lean towards using Google Spreadsheets rather than VBA.
Sorry for the long post, but I would be grateful if anyone has any good suggestions for fuzzy lookups or can suggest an alternative solution.
The comments weren't exactly what I was looking for, but they did provide some inspiration for me to come up with a bandaid solution.
My original array formula needed exact matches, but the problem was that there were simply too many company suffixes and alternate names, so I looked into fuzzy lookups.
My current answer is to abandon the fuzzy lookup proposal and instead focus on editing the original data string (i.e. company names) into more simplified substring. Grabbing with a few codes floating around, I came up with a combined custom formula that implements two lines for GApps Script:
var companysuffixremoval = str.toString().replace(/limited|ltd|group|holdings|plc|llp|sons|the/ig, "");
var alphanumericalmin = str.replace(/[^A-Za-z0-9]/g,"")
The first line is simply my idea of removing popular company suffixes and the "the" from the string.
The second line is removing all non-alphanumerical characters, as well as any spaces and periods.
This ensure "The First Company Limited." and "First Company Ltd" become "FirstCompany", which should work with returning the same values from the original array formula in the OP. Of course, I also implemented a trimming and cleaning line for any trailing/leading/extra spaces for the initial string, but that's probably unncessary with the second line.
If someone can come up with a better idea, please do tell. As it stands, I'm just tinkering with a script with minimal experience.
New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)
I have a database with lots of customers in it. A user of the system wants to be able to look up a customer's account by name, amongst other things.
What I have done is create a new table called CustomerFullText, which just has a CustomerId and an nvarchar(max) field "CustomerFullText". In "CustomerFullText" I keep concatenated together all the text I have for the customer, e.g. First Name, Last Name, Address, etc, and I have a full-text index on that field, so that the user can just type into a single search box and gets matching results.
I found this gave better results that trying to search data stored in lots of different columns, although I suppose I'd be interested in hearing if this in itself is a terrible idea.
Many people have names which sound the same but which have different spellings: Katherine and Catherine and Catharine and perhaps someone who's record in the database is Katherine but who introduces themselves as Kate. Also, McDonald vs MacDonald, Liz vs Elisabeth, and so on.
Therefore, what I'm doing is, whilst storing the original name correctly, making a series of replacements before I build the full text. So ALL of Katherine and Catheine and so on are replaced with "KATE" in the full text field. I do the same transform on my search parameter before I query the database, so someone who types "Catherine" into the search box will actually run a query for "KATE" against the full text index in the database, which will match Catherine AND Katherine and so on.
My question is: does this duplicate any part of existing SQL Server Full Text functionality? I've had a look, but I don't think that this is the same as a custom stemmer or word breaker or similar.
Rather than trying to phonetically normalize your data yourself, I would use the Double Metaphone algorithm, essentially a much better implementation of the basic SOUNDEX idea.
You can find an example implementation here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574, and more are listed in the Wikipedia link above.
It will generate two normalized code versions of your word. You can then persist those in two additional columns and compare them against your search text, which you would convert to Double Metaphone on the fly.
I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173