Search API, create documents and indexes - google-app-engine

I need help in the search API
I am Brazilian and I'm using the google translator to communicate.
My question is:
For each item in the datastore persisted I create a document and an index?
And for those objects that are already persisted in the datastore, I go all the bank to create a document and an index for each, if I want to search for Search API?
I am using java.

It's reasonable to use the Search API to search for objects that are also stored in the Datastore. You can create a Search document for each Datastore entity (so that there's a one-to-one correspondence between them). But you don't need to use a separate Search Index for each one: all the Search documents can be added to one index. Or, if you have a huge number of documents, and if there is some natural partitioning between them, you could distribute them over some modest number of indexes. Assuming you can know via some external means which (single) index to choose for searching, preventing them from getting too big can help performance.
I've tried to answer the question that I think you're asking. It's difficult for me to understand the English that the Google translator has produced. In particular, what does "I go all the bank ..." mean?

Related

What's the difference between GAE Search API and Datastore queries?

I'm trying to understand which one of the search API or querying the datastore will be the most accurate for a search engine in my app.
I want something very scalable and fast. That's mean be able to find something among up to millions results and very quickly. I mean as soon as a data has been registered , this one must be immediately searchable.
I'm seeking to make an autocomplete search system inspired by the google search system (with suggested results in real time).
So, what is the more appropriate option to use for A google app engine user ?
Thanks
Both Search API and the Datastore are very fast, and their performance is not dependent on the number of documents that you index.
Search API was specifically designed to index text documents. This means you can search for text inside these text documents, and any other fields that you store in the index. Search API only supports full-word search.
The Datastore will not allow you to index text inside your documents. The Datastore, however, allows searching for entries that start with a specific character sequence.
Neither of these platforms will meet your requirements out of the box. Personally, I used the Datastore to build a search engine for music meta-data (artist names, album and recording titles) which supports both search inside a text string and suggestions (entries that contain words at any position which start with a specific character sequence). I had to implement my own inverted index to accomplish that. It works very well, especially in combination with Memcache.

Trying to understand the idea of Search Documents on the Google App Engine

I'm trying to understand the concept of Documents on Google App Engine's Search API. The concept I'm having trouble with is the idea behind storing documents. So for example, say in my database I have this:
class Business(ndb.Model):
name = ndb...
description = ndb...
For each business, I am storing a document so I can do full-text searches on the name and description.
My questions are:
Is this right? Does these mean we are essentially storing each entity TWICE, in two different places, just to make it searchable?
If the answer to above is yes, is there a better way to do it?
And again if the answer to number 1 is yes, where do the documents get stored? To the high-rep DS?
I just want to make sure I am thinking about this concept correctly. Storing entities in docs means I have to maintain each entity in two separate places... doesn't seem very optimal just to keep it searchable.
You have it worked out already.
Full Text Search Overview
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
As you don't get to search "inside" the contents of the models in the datastore the search API provides the ability to do that for text and html.
So to link a searchable text document (e.g a product description) to a model in the datastore (e.g. that product's price) you have to "manually" make that link between the documents and the data-store objects they relate to. You can use the search api and the datastore totally independently of each other also so you have to build that in. AFAIK there is no automatic linkage between them.

Big Query vs Text Search API

I wonder if Big Query is going to replace/compete with Text Search API? It is kinda stupid question, but Text Search API is in beta for few months and has very strict API calls limit. Bug Big Query is already there and looks very promising. Any hints what to chose to search over constantly coming error logs?
Google BigQuery and the App Engine Search API fulfill the needs of different types of applications.
BigQuery is excellent for aggregate queries (think: full table scans) over fixed schema data structures in very very large tables. The aim is speed and flexibility. BigQuery lacks the concept of indexes (by design). While it can be used for "needle in a haystack" type searches, it really shines over large, structured datasets with a fixed schema. In terms of document type searches, BigQuery records have a fixed maximum size, and so are not ideal for document search engines. So, I would use BigQuery for queries such as: In my 200Gb log files, what are the 10 most common referral domains, and how often did I see them?
The Search API provides sorted search results over various types of document data (text, HTML, geopoint etc). Search API is really great for queries such as finding particular occurrences of documents that contain a particular string. In general, the Search API is great for document retrieval based on a query input.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

Visual understanding of the GAE datastore

I am trying to understand how the Google App Engine (GAE) datastore is designed and how to use it. I am having a bit of a hard time to visualise the structure from the description at the getting started page.
Can somebody explain the datastore using figures for us visually oriented people? Or point to a good tutorial again with visual learning in mind?
I am specifically looking for answers with diagrams/figures that explains how GAE is used.
The 2008 IO session "Under the Covers of the Google App Engine Datastore" has a good visual overview of the datastore.
https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore
http://snarfed.org/datastore_talk.html
For more IO talks go to:
https://developers.google.com/appengine/docs/videoresources
Very simplified I've understood that GAE can be viewed as a hashmap of hashmaps.
That said you could view it like this:
I guess there's no correct answer here, just different mind models. Depending on your programming background you may find mine enlightning, disturbing or both. I picture the datastore as a single huge distributed key-value collection of buckets that comprises all entity data of any kind in any namespace and all GAE apps of all users. A single bucket is called an entity group. It has a root key which (under the hood) consists of your appID, a namespace, a kind, an entity ID or name. In an entity group resides one ore more entities which have keys extending the root key. The entity belonging to the root key itself may or may not exist. Operations within a single entity group are atomic (transactional). An entity is a simple map-like datastructure. The 2 built-in indexes (ascending and descending) again are 2 giant sorted collections of index entries. Each index entry is a datastructure of appID,namespace,kind,property name,property type,property value,entity key - in that order.
Each (auto-)indexed value of each property of each entity creates 2 such index entries. There's another index with just entity keys in it. Custom indexes however go to yet another sorted collection with entries containing appID,namespace,index type,combined index value, entity key. That's the only part of the whole datastore that uses meta-data. It stores an index definition which tells the store how the combined index value is formed from the entity. This is the picture that's burnt into my mind and from which I know how to make the datastore happy.

Resources