Co-occurrence of words in documents with Google big table - google-app-engine

Given document-D1: containing words (w1,w2,w3)
and document D2 and words (w2,w3..)
and document Dn and words ( w1,w2, wn)
Can I structure my data in big table to answer the questions like:
which words occur most frequently with w1,
or which words occur most frequently with w1 and w2.
What I am trying to achieve is to find the third word Wx (suggestion) which ocures most frequently in documents togehter with given words W1 and W2
I know the solution in SQL, but is it possible with google-big table?
I know I would have to build my indices by myself, the question is how should I structure them to avoid index explosion
thanks
almir

The only way to do this that I'm aware of is to index all 3-tuples of words, with their counts. Your kind would look something like this:
class Tuple(db.Model):
words = db.StringListProperty()
count = db.IntegerProperty()
Then, you need to insert or update the appropriate tuple entity for each set of 3 unique words in your text. Eg, the string "the king is dead" would result in the tuples (the, king, is), (the, king, dead), (the, is, dead), (king, is, dead)... This obviously results in an exponential explosion in entries, but I'm not aware of any way around that for what you want to do.
To find the suggestions, you'd do something like this:
q = Tuple.all().filter('word =', w1).filter('word =', w2).order('-count')
In the broader sense of recommendation algorithms, however, there is a lot of research into more efficient ways to do this. It's an open question, as evidenced by the existence of the Netflix challenge.

Using list-properties and merge-join is the best way to answer set membership questions in Google App Engine: Building Scalable, Complex Apps on App Engine.
You could setup your model as follows:
class Document(db.Model):
word = db.StringListProperty()
name = db.StringProperty()
...
doc.word = ["google", "app", "engine"]
Then it would be easy to query for co-occurrence. For example, which documents have the words google and engine?
results = db.GqlQuery(
"SELECT * FROM Documents "
"WHERE word = 'google'"
" and word = 'engine'")
docs = [d.name for d in results]
There are some limitations, though. From the presentation:
Index writes are done in parallel on
Bigtable Fast-- e.g., update a list
property of 1000 items with 1000 row
writes simultaneously! Scales linearly
with number of items Limited to 5000
indexed properties per entity
But queries must unpackage all result
entities When list size > ~100, reads
are too expensive! Slow in wall-clock
time Costs too much CPU
You could also create a model of words and save in the StringListProperty only their keys, but depending on the size of your documents even that would not be feasible.

There is nothing inherent to the AppEngine datastore that will help you with this problem. You will need to index the words in the documents programatically.

Related

How to get the count of property in a Kind?

I have a Kind Students which stores the details of favorite colors of all students. They are allowed to pick their favorite color from a set of three colors : {Red,Blue,Green}
Let us assume there are 100 students, my code is like this for every student :
Entity arya = new Entity("Student","Arya");
arya.setProperty("Color","Red");
Entity robb = new Entity("Student","Robb");
robb.setProperty("Color","Green");
..
..
Entity jon = new Entity("Student","Jon");
jon.setProperty("Color","Blue");
How to find out how many students liked a particular color(say Red) in this Student Kind ? What Query I should write to fetch the count ?
Thanks in advance
The number you seek would be the number of items in the result of a query with an equality filter on the Color property.
You could use a keys-only query (a special kind of projection query) for this purpose, faster and less expensive:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
...
A keys-only query is a small operation and counts as only a single
entity read for the query itself.
Something along these lines (but note that I'm not a java user, the snippet is based only on the documentation examples)
Query<Key> query = Query.newKeyQueryBuilder()
.setKind("Student")
.setFilter(PropertyFilter.eq("Color", "Red")
.build();
I agree with the Dan Cornilescu's answer. Here is a direct Datastore API usage. I have prepared the request body for your use-case. You can run it by just adding your Project Id. This will return the entities that matches with the filter then you can count the number of them.

Google Datastore - Search Optimization Technique

I am dealing with a real-estate app. A Home will hvae typical properties like Price, Bed Rooms, Bath Rooms, SqFt, Lot size etc. User will search for Homes and such a query will require multiple inequality filters like: Price between x and y, rooms greater than z, bathrooms more than p... etc.
I know that multiple inequality filters are not allowed. I also do not want to perform any filtering in my code and/because I want to be able to use Cursors.
so I have come up with two solutions. I am not sure if these are right - so wonder if gurus can shed some light
Solution 1: I will discretize the values of each attribute and save them in a list-field, then use IN. For example: If there are 3 bed rooms, instead of storing beds=3, I will store beds = [1,2,3]. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "beds IN [2]" - and my home above [1,2,3] will qualify - so so will any home with 2 beds [1,2] or 4 beds [1,2,3,4] and so on
Solution 2: It is similar to the first one but instead of creating a list-property, I will actually add attributed (columns) to the home. So a home with 3 bed rooms will have the following attributed/columns/properties: col-bed-1:true, col-bed-2:true, col-bed-3:true. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "col-bed-2 = true" - and my home will qualify - so will any home with 2 beds, 3 beds, 4 beds and so on
I know both solutions will work, but I want to know:
1. Which one is better both from a performance and google pricing perspective
2. Is there a better solution to do this?
I do almost exactly your use case with a python gae app that lists posts with housing advertisements (similar to craigslist). I wrote it in python and searching with a filter is working and straightforward.
You should choose a language: Python, Java or Go, and then use the Google Search API (that has built-in filtering for equalities or inequalities) and build datastore indexes that you can query using the search API.
For instance, you can use a python class like the following to populate the datastore and then use the Search API.
class Home(db.Model):
address = db.StringProperty(verbose_name='address')
number_of_rooms = db.IntegerProperty()
size = db.FloatProperty()
added = db.DateTimeProperty(verbose_name='added', auto_now_add=True) # readonly
last_modified = db.DateTimeProperty(required=True, auto_now=True)
timestamp = db.DateTimeProperty(auto_now=True) #
image_url = db.URLProperty();
I definitely think that you should avoid storing permutations for several reasons: Permutations can explode in size and makes the code difficult to read. Instead you should do like I did and find examples where someone else has already solved an equal or similar problem.
This appengine demo might help you.

Search entries in Go GAE datastore using partial string as a filter

I have a set of entries in the datastore and I would like to search/retrieve them as user types query. If I have full string it's easy:
q := datastore.NewQuery("Products").Filter("Name =", name).Limit(20)
but I have no idea how to do it with partial string, please help.
q := datastore.NewQuery("Products").Filter("Name >", name).Limit(20)
There is no like operation on app engine but instead you can use '<' and '>'
example:
'moguz' > 'moguzalp'
EDIT: GAH! I just realized that your question is Go-specific. My code below is for Python. Apologies. I'm also familiar with the Go runtime, and I can work on translating to Python to Go later on. However, if the principles described are enough to get you moving in the right direction, let me know and I wont' bother.
Such an operation is not directly supported on the AppEngine datastore, so you'll have to roll your own functionality to meet this need. Here's a quick, off-the-top-of-my-head possible solution:
class StringIndex(db.Model):
matches = db.StringListProperty()
#classmathod
def GetMatchesFor(cls, query):
found_index = cls.get_by_key_name(query[:3])
if found_index is not None:
if query in found_index.matches:
# Since we only query on the first the characters,
# we have to roll through the result set to find all
# of the strings that matach query. We keep the
# list sorted, so this is not hard.
all_matches = []
looking_at = found_index.matches.index(query)
matches_len = len(foundIndex.matches)
while start_at < matches_len and found_index.matches[looking_at].startswith(query):
all_matches.append(found_index.matches[looking_at])
looking_at += 1
return all_matches
return None
#classmethod
def AddMatch(cls, match) {
# We index off of the first 3 characters only
index_key = match[:3]
index = cls.get_or_insert(index_key, list(match))
if match not in index.matches:
# The index entity was not newly created, so
# we will have to add the match and save the entity.
index.matches.append(match).sort()
index.put()
To use this model, you would need to call the AddMatch method every time that you add an entity that would potentially be searched on. In your example, you have a Product model and users will be searching on it's Name. In your Product class, you might have a method AddNewProduct that creates a new entity and puts it into the datastore. You would add to that method StringIndex.AddMatch(new_product_name).
Then, in your request handler that gets called from your AJAXy search box, you would use StringIndex.GetMatchesFor(name) to see all of the stored products that begin with the string in name, and you return those values as JSON or whatever.
What's happening inside the code is that the first three characters of the name are used for the key_name of an entity that contains a list of strings, all of the stored names that begin with those three characters. Using three (as opposed to some other number) is absolutely arbitrary. The correct number for your system is dependent on the amount of data that you are indexing. There is a limit to the number of strings that can be stored in a StringListProperty, but you also want to balance the number of StringIndex entities that are in your datastore. A little bit of math with give you a reasonable number of characters to work with.
If the number of keywords is limited you could consider adding an indexed list property of partial search strings.
Note that you are limited to 5000 indexes per entity, and 1MB for the total entity size.
But you could also wait for Cloud SQL and Full Text Search API to be avaiable for the Go runtime.

GQL query with "like" operator [duplicate]

Simple one really. In SQL, if I want to search a text field for a couple of characters, I can do:
SELECT blah FROM blah WHERE blah LIKE '%text%'
The documentation for App Engine makes no mention of how to achieve this, but surely it's a common enough problem?
BigTable, which is the database back end for App Engine, will scale to millions of records. Due to this, App Engine will not allow you to do any query that will result in a table scan, as performance would be dreadful for a well populated table.
In other words, every query must use an index. This is why you can only do =, > and < queries. (In fact you can also do != but the API does this using a a combination of > and < queries.) This is also why the development environment monitors all the queries you do and automatically adds any missing indexes to your index.yaml file.
There is no way to index for a LIKE query so it's simply not available.
Have a watch of this Google IO session for a much better and more detailed explanation of this.
i'm facing the same problem, but i found something on google app engine pages:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2",
"abc",
u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
maybe this could do the trick ;)
Altough App Engine does not support LIKE queries, have a look at the properties ListProperty and StringListProperty. When an equality test is done on these properties, the test will actually be applied on all list members, e.g., list_property = value tests if the value appears anywhere in the list.
Sometimes this feature might be used as a workaround to the lack of LIKE queries. For instance, it makes it possible to do simple text search, as described on this post.
You need to use search service to perform full text search queries similar to SQL LIKE.
Gaelyk provides domain specific language to perform more user friendly search queries. For example following snippet will find first ten books sorted from the latest ones with title containing fern
and the genre exactly matching thriller:
def documents = search.search {
select all from books
sort desc by published, SearchApiLimits.MINIMUM_DATE_VALUE
where title =~ 'fern'
and genre = 'thriller'
limit 10
}
Like is written as Groovy's match operator =~.
It supports functions such as distance(geopoint(lat, lon), location) as well.
App engine launched a general-purpose full text search service in version 1.7.0 that supports the datastore.
Details in the announcement.
More information on how to use this: https://cloud.google.com/appengine/training/fts_intro/lesson2
Have a look at Objectify here , it is like a Datastore access API. There is a FAQ with this question specifically, here is the answer
How do I do a like query (LIKE "foo%")
You can do something like a startWith, or endWith if you reverse the order when stored and searched. You do a range query with the starting value you want, and a value just above the one you want.
String start = "foo";
... = ofy.query(MyEntity.class).filter("field >=", start).filter("field <", start + "\uFFFD");
Just follow here:
init.py#354">http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/init.py#354
It works!
class Article(search.SearchableModel):
text = db.TextProperty()
...
article = Article(text=...)
article.save()
To search the full text index, use the SearchableModel.all() method to get an
instance of SearchableModel.Query, which subclasses db.Query. Use its search()
method to provide a search query, in addition to any other filters or sort
orders, e.g.:
query = article.all().search('a search query').filter(...).order(...)
I tested this with GAE Datastore low-level Java API. Me and works perfectly
Query q = new Query(Directorio.class.getSimpleName());
Filter filterNombreGreater = new FilterPredicate("nombre", FilterOperator.GREATER_THAN_OR_EQUAL, query);
Filter filterNombreLess = new FilterPredicate("nombre", FilterOperator.LESS_THAN, query+"\uFFFD");
Filter filterNombre = CompositeFilterOperator.and(filterNombreGreater, filterNombreLess);
q.setFilter(filter);
In general, even though this is an old post, a way to produce a 'LIKE' or 'ILIKE' is to gather all results from a '>=' query, then loop results in python (or Java) for elements containing what you're looking for.
Let's say you want to filter users given a q='luigi'
users = []
qry = self.user_model.query(ndb.OR(self.user_model.name >= q.lower(),self.user_model.email >= q.lower(),self.user_model.username >= q.lower()))
for _qry in qry:
if q.lower() in _qry.name.lower() or q.lower() in _qry.email.lower() or q.lower() in _qry.username.lower():
users.append(_qry)
It is not possible to do a LIKE search on datastore app engine, how ever creating an Arraylist would do the trick if you need to search a word in a string.
#Index
public ArrayList<String> searchName;
and then to search in the index using objectify.
List<Profiles> list1 = ofy().load().type(Profiles.class).filter("searchName =",search).list();
and this will give you a list with all the items that contain the world you did on the search
If the LIKE '%text%' always compares to a word or a few (think permutations) and your data changes slowly (slowly means that it's not prohibitively expensive - both price-wise and performance-wise - to create and updates indexes) then Relation Index Entity (RIE) may be the answer.
Yes, you will have to build additional datastore entity and populate it appropriately. Yes, there are some constraints that you will have to play around (one is 5000 limit on the length of list property in GAE datastore). But the resulting searches are lightning fast.
For details see my RIE with Java and Ojbectify and RIE with Python posts.
"Like" is often uses as a poor-man's substitute for text search. For text search, it is possible to use Whoosh-AppEngine.

Google App Engine: Is it possible to do a Gql LIKE query?

Simple one really. In SQL, if I want to search a text field for a couple of characters, I can do:
SELECT blah FROM blah WHERE blah LIKE '%text%'
The documentation for App Engine makes no mention of how to achieve this, but surely it's a common enough problem?
BigTable, which is the database back end for App Engine, will scale to millions of records. Due to this, App Engine will not allow you to do any query that will result in a table scan, as performance would be dreadful for a well populated table.
In other words, every query must use an index. This is why you can only do =, > and < queries. (In fact you can also do != but the API does this using a a combination of > and < queries.) This is also why the development environment monitors all the queries you do and automatically adds any missing indexes to your index.yaml file.
There is no way to index for a LIKE query so it's simply not available.
Have a watch of this Google IO session for a much better and more detailed explanation of this.
i'm facing the same problem, but i found something on google app engine pages:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2",
"abc",
u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
maybe this could do the trick ;)
Altough App Engine does not support LIKE queries, have a look at the properties ListProperty and StringListProperty. When an equality test is done on these properties, the test will actually be applied on all list members, e.g., list_property = value tests if the value appears anywhere in the list.
Sometimes this feature might be used as a workaround to the lack of LIKE queries. For instance, it makes it possible to do simple text search, as described on this post.
You need to use search service to perform full text search queries similar to SQL LIKE.
Gaelyk provides domain specific language to perform more user friendly search queries. For example following snippet will find first ten books sorted from the latest ones with title containing fern
and the genre exactly matching thriller:
def documents = search.search {
select all from books
sort desc by published, SearchApiLimits.MINIMUM_DATE_VALUE
where title =~ 'fern'
and genre = 'thriller'
limit 10
}
Like is written as Groovy's match operator =~.
It supports functions such as distance(geopoint(lat, lon), location) as well.
App engine launched a general-purpose full text search service in version 1.7.0 that supports the datastore.
Details in the announcement.
More information on how to use this: https://cloud.google.com/appengine/training/fts_intro/lesson2
Have a look at Objectify here , it is like a Datastore access API. There is a FAQ with this question specifically, here is the answer
How do I do a like query (LIKE "foo%")
You can do something like a startWith, or endWith if you reverse the order when stored and searched. You do a range query with the starting value you want, and a value just above the one you want.
String start = "foo";
... = ofy.query(MyEntity.class).filter("field >=", start).filter("field <", start + "\uFFFD");
Just follow here:
init.py#354">http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/init.py#354
It works!
class Article(search.SearchableModel):
text = db.TextProperty()
...
article = Article(text=...)
article.save()
To search the full text index, use the SearchableModel.all() method to get an
instance of SearchableModel.Query, which subclasses db.Query. Use its search()
method to provide a search query, in addition to any other filters or sort
orders, e.g.:
query = article.all().search('a search query').filter(...).order(...)
I tested this with GAE Datastore low-level Java API. Me and works perfectly
Query q = new Query(Directorio.class.getSimpleName());
Filter filterNombreGreater = new FilterPredicate("nombre", FilterOperator.GREATER_THAN_OR_EQUAL, query);
Filter filterNombreLess = new FilterPredicate("nombre", FilterOperator.LESS_THAN, query+"\uFFFD");
Filter filterNombre = CompositeFilterOperator.and(filterNombreGreater, filterNombreLess);
q.setFilter(filter);
In general, even though this is an old post, a way to produce a 'LIKE' or 'ILIKE' is to gather all results from a '>=' query, then loop results in python (or Java) for elements containing what you're looking for.
Let's say you want to filter users given a q='luigi'
users = []
qry = self.user_model.query(ndb.OR(self.user_model.name >= q.lower(),self.user_model.email >= q.lower(),self.user_model.username >= q.lower()))
for _qry in qry:
if q.lower() in _qry.name.lower() or q.lower() in _qry.email.lower() or q.lower() in _qry.username.lower():
users.append(_qry)
It is not possible to do a LIKE search on datastore app engine, how ever creating an Arraylist would do the trick if you need to search a word in a string.
#Index
public ArrayList<String> searchName;
and then to search in the index using objectify.
List<Profiles> list1 = ofy().load().type(Profiles.class).filter("searchName =",search).list();
and this will give you a list with all the items that contain the world you did on the search
If the LIKE '%text%' always compares to a word or a few (think permutations) and your data changes slowly (slowly means that it's not prohibitively expensive - both price-wise and performance-wise - to create and updates indexes) then Relation Index Entity (RIE) may be the answer.
Yes, you will have to build additional datastore entity and populate it appropriately. Yes, there are some constraints that you will have to play around (one is 5000 limit on the length of list property in GAE datastore). But the resulting searches are lightning fast.
For details see my RIE with Java and Ojbectify and RIE with Python posts.
"Like" is often uses as a poor-man's substitute for text search. For text search, it is possible to use Whoosh-AppEngine.

Resources