GAE Full Text Search API phrase matching

GAE Full Text Search API phrase matching - google-app-engine

I can only find exact phrase matching for queries in the experimental Search API for Google App Engine. For example the query 'best prices hotel' will only match that exact phrase. It will not match texts such as 'best hotel prices' or 'best price hotels'. It's of course a much more difficult task to match text in a general way but I thought the Search API would at least be able to handle some of that.
Another example is the query 'new cars' which will not match the text 'new and used cars'.

You should be able to use the '~' operator to rewrite queries to include plurals.
E.g., ~hotel or ~"best prices hotel".
Documentation about this operator should be added in the next app engine SDK release.

Related

How do you create Solr Queries with wildcard seaches and scoring, fuzzy search, distance searching and other features

I am trying to build a search over my domain with solr, and I am having trouble producing a keyword search that fulfils our requirements. My issue;
When my users search, the requirement is that the search must return results with partial token matches. For example:
Consider the text field: "CA-1234-ABCD California project"
The following keyword searches (what the user puts in the search field) should match this field:
``
"California"
"Cali"
"CA-1234-ABCD"
"ABCD"
"ABCD-1234"
``
etc.
With a text_en field (as configured in the example schema), the tokenization, stemming and grammar processing will allow non-wildcard searches to work for partial words/tokens in many cases, but Solr still seems limited to exact token match in many situations. For example, the following query does not match:
name:cali
The only way I have found to get the user experience that is required is to use a wildcard search:
name:*cali*
The problem with this is that tf scoring (and it seems other functionality like fuzzy searches) don't work with a wildcard search.
The question is, is there a way to get partial token matching (for all tokens not just those that have common stems/etc.) while retaining tf scoring and other advanced query functionality?
My best workaround at the moment is a query that includes both wildcard and non-wildcard clauses, such as:
name:cali OR name:*cali*
but I don't know if that is a good strategy here. Does SOLR provide a way?

How can I tune the Retrieve and Rank ranker with a dictionary/model of domain specific phrases?

We are trying to group phrases together in order to improve results.
For instance, if the user asks a question like "When do I have to change the filter of my air conditioning?" with a domain specific phrase such as “air conditioning”, R&R returns some answers containing the term “air” and no “conditioning” or it returns answers containing other terms like air bag or air filter.
This can be accomplish using a raw Solr instance and set the phrase between quotes. So, the Solr query would look like the following:
...
"debug": {
"rawquerystring": "When do I have to change the filter of my \"air conditioning\" ?",
"querystring": "When do I have to change the filter of my \"air conditioning\" ?",
"parsedquery": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my PhraseQuery(text:\"air conditioning\") text:?",
"parsedquery_toString": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my text:\"air conditioning\" text:?",
...
However, the R&R guide states:
The syntax is different from standard Solr syntax as follows:
You can search for a single term, or a phrase. You do not need to
surround the phrase with double quotation marks as with Solr, but you
can include phrases in the query and they are accounted for by the
ranker models.
We could not find more details regarding the above statement.
But, as we understand, the ranker is supposed to identify phrases. If that is the case, we were wondering if there is a way where we can set a dictionary of phrases in order to tune the ranker?
Or, could we set our own model of legal phrases? What are the options to accomplish this goal?
Thanks

Currently RnR doesn't support strict phrase querying, though there are features that will take term ordering and adjacent terms into consideration. We are working on a new version of service, in which users would be able to use full regular solr query syntax (including specifying phrases) for document retrieving.

Simple search in App Engine

I want people to be able to search from a title field and a short description field (max 150 characters), so no real full-text search. Mainly they search for keywords, like "salsa" or "club", but I also want them to be able to search for "salsa" and match words like "salsaclub", so at least some form of partial matching.
Would the new Search API be useful for this kind of search, or would I be better off putting all keywords, including possible partial matches, in a list and filter on this list?

Trying to put all the keywords and partial matches (some sort of support for stemming etc) might work if you limit yourself to small numbers of query terms (ie 1 or 2) anything more complex will become costly. If you want anything more than a one or two terms I would look at the alternatives.
You haven't said if your using python or java, go php. If python have a look at Whoosh for appengine https://github.com/tallstreet/Whoosh-AppEngine or go with the Search API.

Return stemmed word in Solr

We have stemming in our Solr search and we need to retrieve the word/phrase after stemming. That is if I search for "oranges", through stemming a search for "orange" is carried out. If I turn on debugQuery I would be able to see this, however we'd like to access it through the result if possible. Basically, we need this, because we pass the searched word as a parameter to a 3rd party application which highlights the word in an online PDF reader. Currently, if a user searches for "oranges" and a document contains "orange", then the PDF wouldn't highlight anything since it tries to highlight "oranges" not "orange".
Thanks all in advance,
Krt_Malta

I've no experience with Solr but if you need it just for presentation to users you could stem their queries using the same stemmer Solr uses yourself. This would probably be faster since it would avoid a trip to Solr's index. For English this would presumably be http://tartarus.org/~martin/PorterStemmer/ - or you could check Solr's implementation.
However, a word of caution, most stemming algorithms do not guarantee that stemmed words will be actual words. Check here http://snowball.tartarus.org/algorithms/english/stemmer.html for examples.

You could use the implicit analysis request handler to get the stemmed word.
For your example, if you are using the text_en field and the Snowball Stemmer, the URL
<YOUR SOLR HOST>/solr/<YOUR COLLECTION>/analysis/field?analysis.query=oranges&analysis.fieldtype=text_en&verbose_output=1
would give you a json response, including the following:
"org.apache.lucene.analysis.snowball.SnowballFilter",
[
{
"text": "orang",
...

How can I find a city and country based on a user search?

I am trying to search a SQL Server 2008 table (containing about 7 million records) for cites and countries based on a user input type text. The search string that I get from the user can be anything like:
"Hotels in San Francisco, US" or "New York, NY" or "Paris sddgdfgxx" or "Toronto Canada" terms are not allways separated by comma and not in a specific order and there might be unusefull data.
This is what I tried:
Method 1: FTS with contains:
ex: select * from cityNames where contains(cityname,'word1 and word2') -- with AND
select * from cityNames where contains(cityname,'word1 or word2') -- with OR
This didn't work very well because a term like 'sddgdfgxx' would return nothing if used with 'AND'. Using OR will work for one word cities like 'Paris' but not for 'San Diego' or 'San Francisco'
Method 2: this is actually a reverse search, the logic of it is to search if the user imput string contains any of the cities or countries from my table. This way I'll know for sure that 'Aix en Provence' or 'New York' was searched for.
ex: select * from cityCountryNames where 'Ontario, Canada, Toronto' like cityCountryNames
notes: I wasn't able to get results for two words cities and the query was slow.
Any help is appreciated.

I would strongly recommend using a 3rd-party API like the Google Geocoding API to take such input and parse it into a location with discrete parts (street address, city, state, country, etc.) Then you could use those discrete parts to search your database if necessary.
Map services like Google and Bing have solved this problem way better than you or I ever would, so why not leverage all the work they've done?

SQL isn't designed for the kinds of queries you are performing, certainly not scale.
My recommendation would be as follows:
Index all your places (cities + countries) into a Solr Index. Solr is a FOSS search server built using Lucene and can easily query the 7MM records index in milliseconds or less.
Query solr with the user typed string and voila the first match is the best match.
So even if the user typed "Paris sddgdfgxx", Paris should be your first hit. If you want to get really sophisticated use an n-gram approach (known as Lucene Shingles)
Since Solr offers a RESTful (HTTP) API should easily integrate into whatever platform you are on.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight