AppEngine full-text-search queries inconsistent for devserver/unit test environment - google-app-engine

I've noticed an odd behavior of the "production" full-text-search functionality of AppEngine:
1) I build a text search index in App Engine with a field like "full_name" with values like "Kenny G" or "G Money" as well as one entry "G G":
final Index index = SearchServiceFactory.getSearchService().getIndex(IndexSpec.newBuilder().setName("MY_IDX").build());
index.put(Document.newBuilder().setId("doc1").addField(Field.newBuilder().setName("name").setText("G G"))
.build());
index.put(Document.newBuilder().setId("doc2").addField(Field.newBuilder().setName("name").setText("G G G"))
.build());
index.put(Document.newBuilder().setId("doc3").addField(Field.newBuilder().setName("name").setText("Kenny G"))
.build());
index.put(Document.newBuilder().setId("doc4").addField(Field.newBuilder().setName("name").setText("G Money"))
.build());
2) I then search the index using a query string of simply "G":
final QueryOptions options = QueryOptions.newBuilder().build();
final Query query = Query.newBuilder().setOptions(options).build("G");
for (final ScoredDocument doc : index.search(query).getResults()) {
for (final Field field : doc.getFields()) {
System.out.println(field.getType() + ": " + field.getName() + " - " + field.getText());
}
}
In the devserver/unit test environment I get output of:
name (TEXT): G G
name (TEXT): G G G
name (TEXT): Kenny G
name (TEXT): G Money
Running the same thing in a JSP on "production" App Engine yields only:
name (TEXT): Kenny G
name (TEXT): G Money
Note that the document with field value "G G" (or "G G G" for that matter) is not returned.
Is there a documented limitation on minimum field length for indexing? Clearly tokenizing accepts tokens of a single character or the "Kenny G" example would fail. Very puzzling and LOOKS like a potential bug but perhaps I've missed something documented.

The production servers implement a feature where a document text field containing consecutive single letters treats the group them as a unit. If you search for [ggg] you should find the document that had the three G's. The idea is that consecutive single letters are probably an acronym, and should be grouped. Otherwise consecutive single letters don't generally appear in normal (English-language) text.
This is an example of an advanced feature that is not simulated in the devserver, unfortunately.
By the way, this behavior is sensitive to the language specified with the document. For example, in French the behavior is different.
We plan to add documentation to clarify this feature.

Related

Solr search with AND operator with strict ordering

Let's say I have a query like:
search_field:A AND search_field:B
that looks for a target that contains both A and B
so the result would be:
AcccccB
BcccccA
...
However, is there a way to keep the ordering of the query so it works like with an order?
For example with pseudo query
search_field:A AND THEN search_field:B
which would yield
AcccccB
...
The logic is that based on the query, it looks for A and B but with that order only. So while BcccccA had both A and B, it was filtered out since B came before A.
Tried
Tried wild card, but it doesn't work for phrases:
AcccccB
Acc ccB
Bcc Acc < can't filter out
Thank you in advance let me know if I can make the question more clear.
1. string: IF the value of search_field is stored as one token (string field type), THEN you may be able to use a wildcard pattern or a regular expression to match the value. To match single-token string type fields, where an A appears before a B:
q=search_field:*A*B*
or
q=search_field:/.*A.*B.*/
For more, see this Solr Regex Tutorial. In the tutorial example the same value is stored twice, once in a string field and once in a text field.
An example of this in the Solr "techproducts" example data is the field pair: author (text) and author_s (string). Order is siginificant: The query q=author_s:*g*t* returns books by George R.R. Martin, and the query
q=author_s:*t*g* returns a book by Grant Ingersoll.
2. text: IF the value of search_field is indexed as multiple tokens (such as when each word is a token), and A is in a separate token from B, THEN you may be able to use the Complex Phrase Query Parser with inOrder=true (default).
2a. text, adjacent tokens: IF A and B must appear in adjacent tokens in the field value, THEN a complex phrase query with no ~ proximity can be used:
{!complexphrase}search_field:"*A* *B*"
{!complexphrase}search_field:"/.*A.*/ /.*B.*/"
Adjacency example: In the "techproducts" sample data, {!complexphrase}author:"*t* *g*" does return the book by Grant Ingersoll, but {!complexphrase}author:"*g* *t*" does not return the books by George R.R. Martin.
2b. text, nearby: IF the tokens are not necessarily adjacent but are nearby, THEN use a complex phrase query, suffixed with a ~ proximity token count. For example, within 10 words or tokens:
{!complexphrase}search_field:"*A* *B*"~10
{!complexphrase}search_field:"/.*A.*/ /.*B.*/"~10
In the "techproducts" sample data, {!complexphrase}author:"*g* *t*"~10 does return the books by George R.R. Martin and not the book by Grant Ingersoll.
Note: neither 2a nor 2b will match single token values where A is followed by B. To also include single-token value matches, specify the OR of a single token pattern and a multiple token pattern:
{!complexphrase} search_field:*A*B* OR search_field:"*A* *B*"~10
If the distance between A and B is less than or 99 positions - and they're tokens by themselves and not part of another token, you can use the surround query parser:
q={!surround}99w(A,B)
99 is the maximum distance, w means ordered search (n would mean unordered).

Entity Framework complex search function

I'm using Entity Framework with a SQL Express database and now I have to make a search function to find users based on a value typed in a textbox, where the end user just can type in everything he wants (like Google)
What is the best way to create a search function for this. The input should search all columns.
So for example, I have 4 columns. firstname,lastname,address,emailaddress.
When someone types in the searchbox foo, all columns need to be searched for everything that contains foo.
So I thought I just could do something like
context.Users.Where(u =>
u.Firstname.Contains('foo') ||
u.Lastname.Contains('foo') ||
u.Address.Contains('foo') ||
u.EmailAddress.Contains('foo')
);
But... The end user may also type in foo bar. And then the space in the search value becomes an and requirement. So all columns should be searched and for example firstname might be foo and lastname can be bar.
I think this is to complex for a Linq query?
Maybe I should create a search index and combine all columns into the search index like:
[userId] [indexedValue] where indexedValue is [firstname + " "+ lastname + " "+ address +" " + emailaddress].
Then first split the search value based on spaces and then search for columns that have all words in the search value. Is that a good approach?
The first step with any project is managing expectation. Find the minimum viable solution for the business' need and develop that. Expand on it as the business value is proven. Providing a really flexible and intelligent-feeling search capability would of course make the business happy, but it can often not do what they expect it to do, or perform to a standard that they need, where a simpler solution would do what they need, be simpler to develop and execute faster.
If this represents the minimum viable solution and you want to "and" conditions based on spaces:
public IQueryable<User> SearchUser(string criteria)
{
if(string.IsNullOrEmpty(criteria))
return new List<User>().AsQueryable();
var criteriaValues = criteria.Split(' ');
var query = context.Users.AsQueryable();
foreach(var value in criteriaValues)
{
query = query.Where(u =>
|| u.Firstname.Contains(value)
|| u.Lastname.Contains(value)
|| u.Address.Contains(value)
|| u.EmailAddress.Contains(value));
}
return query;
}
The trouble with trying to index the combined values is that there is no guarantee that for a value like "foo bar" that "foo" represents a first name and "bar" represents a last name, or that "foo" represents a complete vs. partial value. You'd also want to consider stripping out commas and other punctuation as someone might type "smith, john"
When it comes to searching it might pay to perform a bit more of a pattern match to detect what the user might be searching for. For instance a single word like "smith" might search an exact match for first name or last name and display results. If there were no matches then perform a Contains search. If it contains 2 words then a First & last name match search assuming "first last" vs. "last, first" If the value has an "#" symbol, default to an e-mail address search, if it starts with a number, then an address search. Each detected search option could have a first pass search (expecting more exact values) then a 2nd pass more broad search assumption if it comes back empty. There could be even 3rd and 4th pass searches available with broader checks. When results are presented there could be a "more results..." button provided to trigger a 2nd, 3rd, 4th, etc. pass search if the returned results didn't return what the user was expecting.
The idea being when it comes to searching: Try to perform the most typical, narrow expected search and allow the user to broaden the search if they so desire. The goal would be to try and "hit" the most relevant results early, helping mold how users enter their criteria, and then tuning to better perform based on user feedback rather than try and write queries to return as many possible hits as possible. The goal is to help users find what they are looking for on the first page of results. Either way, building a useful search will add complexity of leverage new 3rd party libraries. First determine if that capability is really required.

How to do better text search in neo4j

I have two types of nodes Article and TAG where TAG have two properties id and name. Now I want to search all the articles according to tags.
(a : Article)-[:TAGGED]->(t : TAG)
e.g If I have tags like "i love my country" and my query string is "country" then search is successfully using the following query.
Match (a : Article)-[:TAGGED]->(t : TAG)
where t.name =~ '*.country.*'
return a;
But its vice-versa is not possible like if my tag is "country" and I search for "i love my country" then it should also display the articles related to country. It should also handle the case when user have entered more than one space between the two words. On searching I came accross lucene and solr but I don't know how to use them. And I am using PHP as my coding language.
[EDITED]
Original Answer
This should work for you:
MATCH (a: Article)-[:TAGGED]->(t:TAG)
WHERE ANY(word IN FILTER(x IN SPLIT({searchString}, " ") WHERE x <> '')
WHERE t.name CONTAINS word)
RETURN a;
{searchString} is your search string, with one or spaces separating words; e.g.:
"i love my country"
This snippet generates a collection of the non-empty words in {searchString}:
FILTER(x IN SPLIT({searchString}, " ") WHERE x <> '')
Improved Answer
This query matches on words (e.g., if the query string is "i love you", the "i" will only match "i" or "I" as a word in the tag, not just any letter "i"). It is also case insensitive.
WITH REDUCE(res = [], w IN SPLIT({searchString}, " ") |
CASE WHEN w <> '' THEN res + ("(?i).*\\b" + w + "\\b.*") ELSE res END) AS res
MATCH (a: Article)-[:TAGGED]->(t:TAG)
WHERE ANY (regexp IN res WHERE t.name =~ regexp)
RETURN a;
The REDUCE clause generates a collection of words from {searchString}, each surrounded by "(?i).*\b" and "\b.*" to become a regular expression for doing a case insensitive search with word boundaries.
NOTE: the backslashes ("\") in the regular expression actually have to be doubled-up because the backslash is an escape charater.
Neo4j uses Lucene indices internally for fulltext search.
Based on this page from the user guide, it appears that the default indexing 'type' is exact using the Lucene Keyword Analyzer which doesn't tokenize the input.
What that means, is that without changing this indexing setting you can only run queries that match the entire tag name (in the case of your example, you're running a wildcard query '*.country.*' which matches the whole tag string).
What I think you actually want, based on your stated requirements is tokenization in whitespace (type=fulltext) at the time you insert the graph data, so that the tag field actually contains one token per word: 1-i 2-love 3-my 4-country, any one of which can match a query term (without needing wildcards: eg "country" or "I love my chocolate")

Is this a bug in the GAE Search API?

I'm implementing a full text search based on the song database on GuitarParty.com. The data consists of lyrics in multiple languages, which is not a problem per se.
However, when search results are returned using snippeted_fields all accented characters within words, such as ÚúÉéÍí, are returned using their generic unaccented versions, UuEeIi.
This is how I form my query:
query = search.Query(
query_string=qs,
options=search.QueryOptions(
sort_options=search.SortOptions(
#match_scorer=search.MatchScorer(),
match_scorer=search.RescoringMatchScorer(),
expressions=[
search.SortExpression(expression='_score + importance * 0.03', default_value=0)
#search.SortExpression(expression='_score', default_value=0)
],
limit=1000,
),
cursor=cursor,
returned_fields=['title','atomtitle','item', 'image'],
snippeted_fields=['title','atomtitle', 'body','item'],
)
)
I'm pretty sure this is is not an encoding issue since everything looks just right if I pull my document fields directly (as I do with the titles). It's only the snippeted exoressions that display incorrectly.
To better see what I'm referring to you can take my test engine for a spin here: http://gp-search.appspot.com/ and search for something Icelandic. Example phrase: Vísur vatnsenda Rósu
This will return a document with this snippet:
Augun min og augun þin. O þa fogru steina. Mitt er þitt og þitt er mitt, þu veist hvað eg mei- na. Langt er siðan sa eg hann sannlega friður var hann.
Correctly spelled snippet should be:
Augun mín og augun þín. Ó þá fögru steina. Mitt er þitt og þitt er mitt, þú veist hvað eg mei- na. Langt er síðan sá ég hann sannlega friður var hann.
Am I better off generating my own snipped from the document data, or is there something I can do to pull snippets with accented characters within words?
The data you put in gets normalized so that you dont have to worry about accents or missing accents when searching it.

App engine - easy text search

I was hoping to implement an easy, but effective text search for App Engine that I could use until official text search capabilities for app engine are released. I see there are libraries out there, but its always a hassle to install something new. I'm wondering if this is a valid strategy:
1) Break each property that needs to be text-searchable into a set(list) of text fragments
2) Save record with these lists added
3) When searching, just use equality filters on the list properties
For example, if I had a record:
{
firstName="Jon";
lastName="Doe";
}
I could save a property like this:
{
firstName="Jon";
lastName="Doe";
// not case sensative:
firstNameSearchable=["j","o", "n","jo","on","jon"];
lastNameSerachable=["D","o","e","do","oe","doe"];
}
Then to search, I could do this and expect it to return the above record:
//pseudo-code:
SELECT person
WHERE firstNameSearchable=="jo" AND
lastNameSearchable=="oe"
Is this how text searches are implemented? How do you keep the index from getting out of control, especially if you have a paragraph or something? Is there some other compression strategy that is usually used? I suppose if I just want something simple, this might work, but its nice to know the problems that I might run into.
Update:::
Ok, so it turns out this concept is probably legitimate. This blog post also refers to it: http://googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html
Note: the source code in the blog post above does not work with the current version of Lucene. I installed the older version (2.9.3) as a quick fix since google is supposed to come out with their own text search for app engine soon enough anyway.
The solution suggested in the response below is a nice quick fix, but due to big table's limitations, only works if you are querying on one field because you can only use non-equality operators on one property in a query:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
If you want to query on more than one property, you can save indexes for each property. In my case, I'm using this for some auto-suggest functionality on small text fields, not actually searching for word and phrase matches in a document (you can use the blog post's implementation above for this). It turns out this is pretty simple and I don't really need a library for it. Also, I anticipate that if someone is searching for "Larry" they'll start by typing "La..." as opposed to starting in the middle of the word: "arry". So if the property is for a person's name or something similar, the index only has the substrings starting with the first letter, so the index for "Larry" would just be {"l", "la", "lar", "larr", "larry"}
I did something different for data like phone numbers, where you may want to search for one starting from the beginning or middle digits. In this case, I just stored the entire set of substrings starting with strings of length 3, so the phone number "123-456-7890" would be: {"123","234", "345", ..... "123456789", "234567890", "1234567890"}, a total of (10*((10+1)/2))-(10+9) = 41 indexes... actually what I did was a little more complex in order to remove some unlikely to-be-used substrings, but you get the idea.
Then your query would be:
(Pseaudo Code)
SELECT * from Person WHERE
firstNameSearchIndex == "lar"
phonenumberSearchIndex == "1234"
The way that app engine works is that if the query substrings match any of the substrings in the property, then that is counted as a match.
In practice, this won't scale. For a string of n characters, you need n factorial index entries. A 500 character string would need 1.2 * 10^1134 indexes to capture all possible substrings. You will die of old age before your entity finishes writing to the datastore.
Implementations like search.SearchableModel create one index entry per word, which is a bit more realistic. You can't search for arbitrary substrings, but there is a trick that lets you match prefixes:
From the docs:
db.GqlQuery("SELECT * FROM MyModel
WHERE prop >= :1 AND prop < :2",
"abc", u"abc" + u"\ufffd")
This matches every MyModel entity with
a string property prop that begins
with the characters abc. The unicode
string u"\ufffd" represents the
largest possible Unicode character.
When the property values are sorted in
an index, the values that fall in this
range are all of the values that begin
with the given prefix.

Resources