Simple search in App Engine - google-app-engine

I want people to be able to search from a title field and a short description field (max 150 characters), so no real full-text search. Mainly they search for keywords, like "salsa" or "club", but I also want them to be able to search for "salsa" and match words like "salsaclub", so at least some form of partial matching.
Would the new Search API be useful for this kind of search, or would I be better off putting all keywords, including possible partial matches, in a list and filter on this list?

Trying to put all the keywords and partial matches (some sort of support for stemming etc) might work if you limit yourself to small numbers of query terms (ie 1 or 2) anything more complex will become costly. If you want anything more than a one or two terms I would look at the alternatives.
You haven't said if your using python or java, go php. If python have a look at Whoosh for appengine https://github.com/tallstreet/Whoosh-AppEngine or go with the Search API.

Related

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Does GAE Search API do spell checks

I'm talking about this API:
https://cloud.google.com/appengine/docs/java/search/
Does it allow spell checks? For example: if I create an index of documents, and in those documents I have words like "iphone", "android", etc. If I search for "iphoen" instead can it still return the correct results?
No, it cannot. It is just an index - what you put it, you get back.
You need to implement your own logic for spelling errors. If a user searches for "iphoen", you either return all results for "iphoen" and suggest "iphone" query instead, or, if you are very confident that a search term was mis-spelled, do a search for "iphone" right away and ask a user if a "iphoen" should be used. This is how Google search works. This is, obviously, not a trivial task.
No, it will not do this. It does direct text matching. Taken from the link you provided:
The simplest query, sometimes called a "global search" is a string that contains only field values. This search uses a string that searches for documents that contain the words "rose" and "water":
index.search("rose water");
Based on this, it's implied reasonably well that it will not do fuzzy matches for you. However, you could write an extension class that takes a string and tests variants against the Search API. You could then return any successful queries and report the fuzzy match. In this way, your class would take "ipohne" and eventually try "iphone" and return a successful query.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

How can I sort appengine search index results by relevance?

I'm working on a project that uses Google App Engine's text search API to allow users to search for documents that include a words field. I'm sorting using a MatchScorer, which according to the documentation "assigns a score based on term frequency in a document".
When a user enters a query like "business promo", I convert this into a query string that looks like words:business OR words:promo. I would have expected that this would return documents that contain both the words "business" and "promo" before documents that only contain one of the words (since the documentation says it assigns a score based on term frequency in the document). However, I frequently see results that contain only one of the words before documents that contain both.
I've also tried querying using the RescoringMatchScorer, but see the same problem using this scorer.
I've thought about doing separate queries - ones that AND the search terms and ones that OR the search terms - but this would require many queries if the user enters more than two search terms. For example, if I searched for "advanced business solutions", I'd need queries like this to cover all the bases:
words:advanced AND words:business AND words:solutions
words:advanced AND words:business
words:advanced AND words:solutions
words:business AND words:solutions
words:advanced OR words:business OR words:solutions
Does anyone have any hints on how to perform searches that return more relevant results (i.e. more search term matches) before less relevant results?
Perhaps it depends on how you interpret the phrase "term frequency". I think you're interpreting it to mean "how many of my search terms appear in the document". But it could also mean "how many times (any of) the search terms appears in each document", and indeed -- at least according to some simple experiments I've done -- the latter seems to be the actual behavior.
For example, a document that contains the word "business" 20 times and never mentions the word "promo" would be scored higher than a document that contains "business" and "promo" only once each. Does that jibe with the behavior you're seeing?

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?
Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

Resources