Clarify the usage of AppEngine Search API - google-app-engine

I have started to try out to use the new Search API, the demo is running smoothly, however, there are some points I am still confused about being outsider of the search world.
First of all is how to build a document. Obviously you can't hard-coded each line into a document, but what else can I do. Say if I have a user class (I'm using Java, but I guess Python makes no difference here), and I would add the user's information into the document, and be able to do a full-text search against the field of address.
class User {
String username;
String password;
String address;
}
In my datastore, I have this entity with 10000 instances there, and if I will need to build this document, do I have to
Step 1: retrieve the 10000 instance from datastore
Step 2: Iterate through each of the user entity, and create 10000 documents
Step 3: Add all 10000 docs into an index, and then I will be able to search
Please correct me if above three steps I mentioned is wrong.
If that is the case, then does it that later each time a new User registered, we need to create a new document, and add to the index?

Unfortunately I haven't play around with that much. I learned a few things.
When first implementing it, I hade to create a lot of documents as well (as you describe). But kept running in to deadline exceptions. So I ended upp using the task queue for building documents for all my old records.
Remember to create a cross-reference between the search Document and you datastore entity. So you can easily update your document record. And from a search result get the match entity.
For cross-reference add a new property on your datastore model called something like search_document_id where you store the doc_id (I prefixed all my doc_id's with the datastore model name). And add a text field on you Document containing the entity key as a string.
But I would say in a nutshell you are correct.

Related

Google Search API Wildcard

I have a Python project running on Google App Engine. I have a set of data currently placed at datastore. On user side, I fetch them from my API and show them to the user on a Google Visualization table with client side search. Because the limitations I can only fetch 1000 record at one query. I want my users search from all records that I have. I can fetch them with multiple queries before showing them but fetching 1000 records already taking 5-6 second so this process can exceed 30 seconds timeout and I don't think putting around 20.000 records on a table is good idea.
So I decided to put my records on Google Search API. Wrote a script to sync important data between datastore and Search API Index. When perform a search, couldn't find anything like wildcard character. For example let's say I have user field stores a string which contains "Ilhan" value. When user search for "Ilha" that record not show up. I want to show record includes "Ilhan" value even if it partially typed. So basically SQL equivalent of my search should be something like "select * from users where user like '%ilh%'".
I wonder if there is a way to that or is this not how Search API works?
I setup similar functionality purely within datastore. I have a repeated computed property that contains all the search substrings that can be formed for a given object.
class User(ndb.Model):
# ... other fields
search_strings = ndb.ComputedProperty(
lambda self: [i.lower() for i in all_substrings(strings=[
self.email,
self.first_name,
self.last_name,], repeated=True)
Your search query would then look like this:
User.query(User.search_strings == search_text.strip().lower()).fetch_page(20)
If you don't need the other features of Google Search API and if the number of substrings per entity won't put you at risk of hitting the 900 properties limit, then I'd recommend doing this instead as it's pretty simple and straight forward.
As for taking 5-6 seconds to fetch 1000 records, do you need to fetch that many? why not fetch only 100 or even 20 and use the query cursor for the user to pull the next page only if they need it.

Does search API automatically create an index corresponding to model?

I recently created some search documents and added them to a custom search index using the following which generates an index per user for the model type.
def _post_put_hook(self, future):
document = self.create_search_document()
index = self.search_index
index.put(document)
However I noticed that in the admin panel there is an index for my model type that seems to be always automatically generated and added to. Is this correct?
Admin Panel: Full Text Search: Image shows the top indexes are the ones I am creating while the bottom one has been made automatically.
If so how would I got about cleaning up the document that gets added to this when the corresponding entities are deleted? (I clean up my own index using a delete hook).
No, Search API doesn't create an index automatically.
Maybe the index has been created during development, i.e. at some point, your code used a different name pattern for these indices.
In the Google Cloud Developer console, click on that index. You should be able to see and search documents in this index. If it is empty, or it has old documents, then my theory would explain this.
If the index contains up-to-date documents, maybe there is a part of your code causing this behavior and you are just not aware of it. If the code above is the only one that writes documents into search, then you should investigate if self.search_index somehow is redefined during handling a request.
One more note though:
You can delete all documents in an index in Search API, but you cannot delete the index itself (at least not the last time I checked the docs).

Updating documents with SOLR

I have a community website with 20.000 members using everyday a search form to find other members. The results are sorted by the last connected members. I'd like to use Solr for my search (right now it's mysql) but I'd like to know first if it's good practice to update the document of every member who would login in order to change their login date and time ? There will be around 20.000 update of documents a day, I don't really know if it's too much updating and could alter performances ? Tank you for your help.
20k updates/day is not unreasonable at all for Solr.
OTOH, for very frequently updating fields (imagine one user could log in multiple times a day so you might want to update it all those times), you can use External Fields to keep that field stored outside the index (in a text file) and still use it for sorting in solr.
Generally, Solr does not be used for this purpose, using your database is still better.
However, if you want to use Solr, you will deal with it in a way like the database. i.e every user document should has a unique field, id for example. When the user make log in, you may use an update query for that user's document last_login_date field by its id. You could able to know more about Partial Update from this link.

how to get a list of objects, in which each object contains a field with substring "ho"

I have an object Person with fields firstName,lastName etc.
After finding the list of persons available. I need to find persons whose firstName contains substring "ho" . How can I do this?
I would have used LIKE with wild cards but my application is hosted on google app engine, so I cant use LIKE in the SQL Query. Tried it before did not work. Any suggestions how I can do this without traversing each object in the list?
You really need to think of the datastore in a different manner than a relational database. What that essentially means is that you have to be smart about how you store your data to get at it. Without having full text search, you can use a strategy to mimic full text search by creating a key list of searchable words and storing them in a child entity in an entity group. Then you can construct your query to return the keys of the parent object that match your "query string". This allows you to have indexing without the overhead of full text search.
Here's a great example of it using Objectify but you can use anything to accomplish the same thing (JPA, JDO, low level API).
http://novyden.blogspot.com/2011/02/efficient-keyword-search-with-relation.html
You can't, at least, not if you're using the BigTable-based datastore.

Data storage: "grouping" entities by property value? (like a dictionary/map?)

Using AppEngine datastore, but this might be agnostic, no idea.
Assume a database entity called Comment. Each Comment belongs to a User. Every Comment has a date property, pretty standard so far.
I want something that will let me: specify a User and get back a dictionary-ish (coming from a Python background, pardon. Hash table, map, however it should be called in this context) data structure where:
keys: every date appearing in the User's comment
values: Comments that were made on date.
I guess I could just iterate over a range of dates an build a map like this myself, but I seriously doubt I need to "invent" my own solution here.
Is there a way/tool/technique to do this?
Datastore supports both references and list properties. This let's you build one-to-many relationships in two ways:
Parent (User) has a list property containing keys of Child entities (Comment).
Child has a key property pointing to Parent.
Since you need to limit Comments by date, you'd best go with option two. Then you could query Comments which have date=somedate (or date range) and where user=someuserkey.
There is no native grouping functionality in Datastore, so to also "group" by date, you can add a sort on date to the query. Than when you iterate over the result, when the date changes you can use/store it as a grouping key.
Update
Designing no-sql databases should be access-oriented (versus datamodel oriented in sql): for often-used operations you should be getting data out as cheaply (= as few operations) as possible.
So, as a rule of thumb you should, in one operation, only get data that is needed at that moment (= shown on that page to user). I'm not sure about your app's design, but I doubt you need all user's full comments (with text and everything) at one time.
I'd start by saying you shouldn't apologize for having a Python background. App Engine started supporting only Python. Using the db module, you could have a User entity as the parent of several DailyCommentBatch entities each a parent of a couple Comment entities. IIRC, this will keep all related entities stored together (or close).
If you are using the NDB (I love it) you may have employ a StructuredProperty either at the User or DailyCommentBatch levels.

Resources