Wildcard search on Appengine in python - google-app-engine

I'm just starting with Python on Google App Engine building a contact database. What is the best way to implement wildcard search?
For example can I do query('name=', %ewman%)?

Unfortunately, Google app engine can't do partial text matches
From the docs:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.

App Engine can't do 'like' queries, because it can't do them efficiently. Nor can your SQL database, though: A 'foo LIKE "%bar%"' query can only be executed by doing a sequential scan over the entire table.
What you need is an inverted index. Basic fulltext search is available in App Engine with SearchableModel. Bill Katz has written an enhanced version here, and there's a commercial solution for App Engine (with a free version) available here.

Related

Language support for Google Search API

I'd like to perform a partial text / phrase search against a Datastore record field using Ruby.
I've figured out how to do it with a conditional constraint using >= <"\ufffd" condition, but that only works from the beginning of the field.
This works; querying for "Ener" returns "Energizer AA Batteries" but querying for "AA" does not return the same.
In the docs for the Python Google Client's Search API, it documents the ability to manually create indexes which allow for both atomic and partial word searches.
https://cloud.google.com/appengine/docs/standard/python/search/ says:
Tokenizing string fields When an HTML or text field is indexed, its
contents are tokenized. The string is split into tokens wherever
whitespace or special characters (punctuation marks, hash sign, etc.)
appear. The index will include an entry for each token. This enables
you to search for keywords and phrases comprising only part of a
field's value. For instance, a search for "dark" will match a document
with a text field containing the string "it was a dark and stormy
night", and a search for "time" will match a document with a text
field containing the string "this is a real-time system".
In the docs for Ruby and PHP, I cannot find such an API reference to enable me to do the same thing. Is it possible to perform this type of query in Ruby / PHP with Cloud Datastore?
If so, can you point me to the docs, and if not, is there a workaround, for example, create indexes with the Python Search API, and then configure the PHP/Ruby client to execute it's queries against those indexes?

Azure Search Service a field that has . or # not decoding no results

We are using the .net library for azure search, I have successfully built the index and stored data in the index. One of our fields is called Tags which is a collection of strings and it is marked as searchable. So we put values in this field such as C# .NET.
The problem is when searching the search service will not hit on C#, it will on C, nor will it hit on .NET but it will on NET. I can see thru fiddler that the search term is encoding the # and also the ., but it doesn't seem like it's getting decoded on the azure side.
The behavior you're seeing is the result tokenization performed by the standard analyzer used by Azure Search. By default it breaks on many punctuation characters like # and . (you can get all the details of text analysis in Azure Search here).
We're looking into adding support for custom analyzers that would let you exclude characters such as # and . from word-breaking, but this is still in the planning stages. In the meantime, as a workaround we suggest encoding these characters in your application before indexing and querying (e.g. -- C# -> CSharp, .NET -> dotNET).
Thanks Bruce, for now I have just created a function in our search implementation that removes punctuation from the search term provided by the end user. This way I don't have to go thru and update all search index/records.
private string SanitizeValue(string value)
{
return Regex.Replace(value, #"[^a-zA-Z0-9\s]", "");
}
You could try using Regex search, like searching for this string: /.*c\#.*/. Also make sure you set SearchParameters.QueryType = QueryType.Full.

Implementing keyword search on Google App Engine?

I'm trying to implement a keyword/tags search for a certain entity type in GAE's datastore:
class Docs(db.Model):
title = db.StringProperty()
user = db.StringProperty()
tags = db.StringListProperty()
I also wrote a very basic search function (using a fake list of tags, not the datastore values), that takes a query string and matches it to each set of tags. It ranks Docs based on how many query words match the tags. This is pretty much all I need it to do, except using the actual datastore.
I have no idea how to get the actual datastore values though. For my search function to work I need a list of all the entities in the datastore, which is impossible (?).
I also tried looking into GAE's experimental full-text search, and Relation Index Entities as a way to search the datastore without using the function I wrote. Neither was successful.
Any other ideas on how to search for entities based on tags?
It's a very simple query, if you need to find all Docs with a tag "findme", it's simply:
num_results = 10
query = Docs.all().filter("tags in", "findme")
results = query.fetch(num_results) # get list of results
It's well documented:
https://developers.google.com/appengine/docs/python/datastore/queries

web2py like equivalents with google app engine

Is there any way to generate queries similar to the like, contains, startswith operators with app engine BigTable database?
So that I could do something similar to:
db(db.some_table.like('someting')).select()
with app engine in web2py.
App engine does not support full text search so short answer is no.
What you can do with web2py is create a computed filed with a list of keywords to search for.
def tokenize(r): return [x.lower() for x in re.compile('\w+').findall(r.title)]
db.define_table('data',
Field('title'),
Field('keywords','list:string',compute=tokenize,writable=False,readable=False))
On GAE the keywords field is a StringListProperty().
Then instead of searching in title, you search in keywords:
rows = db(db.data.keywords.contains(my_keyword.lower())).select()
This works on GAE and it is very efficient. The problem now is that you will not be used to combine it in complex queries because of the GAE "exploding" indexes problem. For example is you have N keywords and want to search for two keywords:
rows = db(db.data.keywords.contains(my_keyword1.lower())&
db.data.keywords.contains(my_keyword2.lower())).select()
Your index size becomes N^2. So you have to perform more complex queries locally:
query2=lambda r: my_keyword1.lower() in r.keywords
rows = db(db.data.keywords.contains(my_keyword1.lower())).select().find(query2)
All of this will also work on GAE and not-on-GAE. It is portable.

What is a good pattern for inexact queries in the Google App Engine Datastore?

The Google App Engine Datastore querying language (gql) does not offer inexact operators like "LIKE" or even case insensitivity. One can get around the case sensitive issue by storing a lower-case version of a field. But what if I want to search for a person but I'm not sure of the spelling of the name? Is there an accepted pattern for dealing with this scenario?
Quoting from the documentation:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
Another option is the SearchableModel, however, i dont believe it supports partial matches.
http://billkatz.com/2008/8/A-SearchableModel-for-App-Engine
You could store a soundex http://effbot.org/librarybook/soundex.htm version of the name in the datastore. Then, to query a name, soundex the query, and look that up.

Resources