Dealing with spell check suggestions - solr

We are trying to use Solr's spell check to do a "Did you mean?" type suggestion.
The problem we are having is that we are replacing the original term in the query with Solr's suggestions.
For example: a search for "10ks" (we are creating an events site) will return a suggestion of "5ks".
However, it seems the spell check is using "ks" rather than "10ks" as the term so when we replace "ks" with "5ks" we get 105ks. This then causes an infinite "did you mean" loop because Solr always uses "ks" rather than "10ks" in the spell check suggestions.
Here's the code that we use to replace suggestions in the original query.
/// <summary>
/// Method that takes the first suggestion for all the spelling and applys them to the keyword
/// </summary>
private string GetSuggestedQuery(string keyword, List<SpellCheck> suggestions)
{
if (suggestions != null)
{
for (var i = 0; i < suggestions.Count; i++)
{
keyword = keyword.Replace(suggestions.ElementAt(i).Query,
suggestions.ElementAt(i).Suggestions.First());
}
return keyword;
}
return null;
}
This works great for two word queries for example "runnig events" would get "running events".
The only thing I can think of is to do something naive like check for spaces in the original query and then replace the whole thing if the query contained spaces.

Look at the spellcheck.collate setting. It will return a re-written query in the way you are suggesting.
https://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

Difficult to answer without looking at the field definition from your schema.xml. The Analyzers that will probably work for your case are:
WordDelimiterFilterFactory with split on letter-number transitions set to off (See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory), along with the StandardTokenizerFactory.

Related

Azure search not behaving as expected for dashes

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc
When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-...
Is there some setting or way to change this behavior?
Current search settings:
TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
Side = EdgeNGramTokenFilterSide.Front,
MinGram = 3,
MaxGram = 20
});
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
TokenFilters =
{
TokenFilterName.Lowercase,
new TokenFilterName("frontEdgeNGram"),
TokenFilterName.Classic,
TokenFilterName.AsciiFolding
}
});
SearchOptions UsersSearchOptions = new SearchOptions
{
QueryType = SearchQueryType.Simple,
SearchMode = SearchMode.All,
};
Using azure.search.documents ver 11.1.1
Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?
Just to add to this..
The portal version is 2020-06-30
The sdk version we use is azure.search.documents ver 11.1.1
abc-123-456 does NOT work as expected
"abc-123-456" does NOT work as expected
"abc-123-456"* does NOT work
"abc-123-456*" does NOT work
If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected.
IE:
abc-123-456* works as expected.
(abc-123-456* | abc-123-457* ) works as expected.
Why is the asterisks required? How can we make this work within a phrase?
This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.
For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.
The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.
Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

On Google App Engine (GAE), how do I search on the Key/ID field?

I've got this code (Java, GAE):
// Much earlier:
playerKey = KeyFactory.keyToString(somePlayer.key);
// Then, later...
PersistenceManager pm = assassin.PMF.get().getPersistenceManager();
Key targetKey = KeyFactory.stringToKey(playerKey);
Query query = pm.newQuery(Player.class);
query.setFilter("__key__ == keyParam");
query.declareParameters("com.google.appengine.api.datastore.Key keyParam");
List<Player> players = (List<Player>) query.execute(targetKey); // <-- line 200
which generates this error:
javax.jdo.JDOFatalUserException: Unexpected expression type while parsing query. Are you certain that a field named __key__ exists on your object?
at org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:354)
at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:252)
at myapp.Player.validPlayerWithKey(Player.java:200)
// [etc., snip]
But I'm not sure what it wants. I'm trying to search on the JDO id field, which I I thought I read had the special name __key__, in the documentation.
I've tried it with both
query.setFilter("__key__ == keyParam");
and
query.setFilter("ID == keyParam");
with the same results. So, what am I doing wrong? Or, more importantly, how do I do it correctly?
Thanks!
Edit: For completeness's sake, here is the final, working code (based on Gordon's answer, which I have accepted as correct):
Player result = null;
if (playerKey == null)
{
log.log(Level.WARNING, "Tried to find player with null key.");
}
else
{
PersistenceManager pm = assassin.PMF.get().getPersistenceManager();
try {
result = (Player) pm.getObjectById(Player.class, playerKey);
} catch (javax.jdo.JDOObjectNotFoundException notFound) {
// Player not found; we will return null.
result = null;
}
pm.close();
}
return result;
If your objective is to get an object by key, then you should use the PersistenceManager's getObjectByID() method. More details here.
As an aside, trying to construct a query to get something by it's key is something you shouldn't need to do. Although this is how you would work with an SQL database, the Google Data Store does things differently, and this is one of those cases where rather than go through the trouble of constructing a query, Google App Engine lets you get what you want directly. After all, you should only have one entity in the database with a particular key, so there's nothing in the rest of the machinery of a GQL query that you need in this case, hence it can all be skipped for efficiency.
I would recommend you to use the JPA ( http://code.google.com/appengine/docs/java/datastore/usingjpa.html ) to access your data in GAE, it has the very important advantage that you can use the widely known and documented JPA standard (and its JPAQL querying language) to do this kind of things, in portable way (if you stick to the JPA standard, your code will work for GAE, for Hibernate or with EclipseLink without modification)

Using CONTAINS from HQL / Criteria API

I'm using NHibernate 2.1.2.4000GA. I'm trying to use SQL Server's CONTAINS function from within HQL and the criteria APIs. This works fine in HQL:
CONTAINS(:value)
However, I need to qualify the table in question. This works fine:
CONTAINS(table.Column, :value)
However, I need to search across all indexed columns in my table. I tried this:
CONTAINS(table.*, :value)
But I get:
NHibernate.Hql.Ast.ANTLR.QuerySyntaxException : Exception of type 'Antlr.Runtime.MissingTokenException' was thrown. near line ... [select table.Id from Entities.Table table where CONTAINS(table.*,:p0) order by table.Id asc]
at NHibernate.Hql.Ast.ANTLR.ErrorCounter.ThrowQueryException()
at NHibernate.Hql.Ast.ANTLR.HqlParseEngine.Parse()
at NHibernate.Hql.Ast.ANTLR.QueryTranslatorImpl.Parse(Boolean isFilter)
at NHibernate.Hql.Ast.ANTLR.QueryTranslatorImpl.DoCompile(IDictionary`2 replacements, Boolean shallow, String collectionRole)
at NHibernate.Hql.Ast.ANTLR.QueryTranslatorImpl.Compile(IDictionary`2 replacements, Boolean shallow)
at NHibernate.Engine.Query.HQLQueryPlan..ctor(String hql, String collectionRole, Boolean shallow, IDictionary`2 enabledFilters, ISessionFactoryImplementor factory)
at NHibernate.Engine.Query.QueryPlanCache.GetHQLQueryPlan(String queryString, Boolean shallow, IDictionary`2 enabledFilters)
at NHibernate.Impl.AbstractSessionImpl.GetHQLQueryPlan(String query, Boolean shallow)
at NHibernate.Impl.AbstractSessionImpl.CreateQuery(String queryString)
So it would seem the HQL parser chokes on the asterisk. I thought of doing this:
CONTAINS(table.Column1, :value) or CONTAINS(table.Column2, :value)
Not only is this inefficient, it can also yields incorrect results depending on the full text predicate in :value.
I tried customizing my dialect as per these instructions, but that doesn't help because you're still left with the same problem: specifying table.* causes the HQL parser to fall over.
I thought of specifying query substitutions:
<property name="query.substitutions">TABLECONTAINS(=CONTAINS(table.*,</property>
And then simply doing:
TABLECONTAINS(:value)
But that does not work. I'm not sure why - judging by the resultant error, the substitution just doesn't take place because "TABLECONTAINS" is still present in the query. Besides, this wouldn't work for all cases because I'd need to know the alias of the table and dynamically substitute it in.
Therefore, I rolled an interception-based approach:
using System;
using NHibernate;
using NHibernate.SqlCommand;
public class ContainsInterceptor : EmptyInterceptor
{
public override SqlString OnPrepareStatement(SqlString sql)
{
var indexOfTableContains = sql.IndexOfCaseInsensitive("TABLECONTAINS(");
if (indexOfTableContains != -1)
{
var sqlPart = sql.ToString();
var aliasIndex = sqlPart.LastIndexOf("Table ", indexOfTableContains, StringComparison.Ordinal);
if (aliasIndex == -1)
{
return sql;
}
aliasIndex += "Table ".Length;
var alias = sqlPart.Substring(aliasIndex, sqlPart.IndexOf(" ", aliasIndex, StringComparison.Ordinal) - aliasIndex);
sql = sql.Replace("TABLECONTAINS(", "CONTAINS(" + alias + ".*,");
}
return base.OnPrepareStatement(sql);
}
}
This works and I will now be able to sleep tonight, but I do feel a sudden desire to attend London's impending papal procession and shout out a confession.
Can anyone suggest a better way to achieve this?
Why not use an ISQLQuery instead?
You can still retrieve entities, see http://nhibernate.info/doc/nh/en/index.html#querysql-creating
I would think that a custom dialect would be the appropriate way to handle this. You can find some guidance in this article. I've used this approach to register SQL Server-specific functions like ISNULL for use in our projects.

Google Datastore problem with query on *User* type

On this question I solved the problem of querying Google Datastore to retrieve stuff by user (com.google.appengine.api.users.User) like this:
User user = userService.getCurrentUser();
String select_query = "select from " + Greeting.class.getName();
Query query = pm.newQuery(select_query);
query.setFilter("author == paramAuthor");
query.declareParameters("java.lang.String paramAuthor");
greetings = (List<Greeting>) query.execute(user);
The above works fine - but after a bit of messing around I realized this syntax in not very practical as the need to build more complicated queries arises - so I decided to manually build my filters and now I got for example something like the following (where the filter is usually passed in as a string variable but now is built inline for simplicity):
User user = userService.getCurrentUser();
String select_query = "select from " + Greeting.class.getName();
Query query = pm.newQuery(select_query);
query.setFilter("author == '"+ user.getEmail() +"'");
greetings = (List<Greeting>) query.execute();
Obviously this won't work even if this syntax with field = 'value' is supported by JDOQL and it works fine on other fields (String types and Enums). The other strange thing is that looking at the Data viewer in the app-engine dashboard the 'author' field is stored as type User but the value is 'user#gmail.com', and then again when I set it up as parameter (the case above that works fine) I am declaring the parameter as a String then passing down an instance of User (user) which gets serialized with a simple toString() (I guess).
Anyone any idea?
Using string substitution in query languages is always a bad idea. It's far too easy for a user to break out and mess with your environment, and it introduces a whole collection of encoding issues, etc.
What was wrong with your earlier parameter substitution approach? As far as I'm aware, it supports everything, and it sidesteps any parsing issues. As far as the problem with knowing how many arguments to pass goes, you can use Query.executeWithMap or Query.executeWithArray to execute a query with an unknown number of arguments.

Do you expect query string parameter names to be case sensitive?

Silverlight is case sensitive for query string parameters so the following code would return false with "callid=5"
string callId;
if (System.Windows.Browser.HtmlPage.Document.QueryString.TryGetValue("callId", out callId))
{
....
}
Microsoft defends the decision by citing the www.w3.org spec, but I think it leads to a less friendly experience for people trying to link to you, or give a URL over the phone.
Looks like Stackoverflow is case insensitive:
https://stackoverflow.com/search?q=silverlight+bug
https://stackoverflow.com/search?Q=silverlight+bug
I think you should focus on your naming conventions rather than the implementations of standards, making sure to avoid similar field names and mixed case. For example, you can use a convention of words that over the phone can be read out stating "all lowercase" or "all uppercase".
I did this. Don't know if it helps.
var keyName = "";
if (!string.IsNullOrEmpty(keyName = someDictionary.SomeKeys.FirstOrDefault(k => k.ToLowerInvariant() == "size")))
{
var someValue = someDictionary[keyName];
}
Yes, I'm used to it being case sensitive, and therefore have been programming to it for a long time. I know of some people that have implemented methods to do intermediate parsing to convert them all to lowercase, or other things server side, and it really depends on what you are working with specifically.
As for usability, yes it is harder to read. BUT, at the same time a URL over the phone that has a querystring is not easy to give out anyway.
This workaround will not use the power of dictionaries because it will iterate through all keys, but it is likely to be a sufficient work-around for most scenarios.
var keyName = HtmlPage.Document.QueryString.Keys.SingleOrDefault(key => key.Equals("callid", StringComparison.OrdinalIgnoreCase));
string callid;
HtmlPage.Document.QueryString.TryGetValue(keyName, out callid)
You could also transform the whole QueryString dictionary to a new dictionary with a case insensitive comparer if you are having many dictionary lookups.
var insensitiveQueryString = HtmlPage.Document.QueryString.ToDictionary(pair => pair.Key, pair => pair.Value, StringComparer.OrdinalIgnoreCase);

Resources