I'm trying to understand which one of the search API or querying the datastore will be the most accurate for a search engine in my app.
I want something very scalable and fast. That's mean be able to find something among up to millions results and very quickly. I mean as soon as a data has been registered , this one must be immediately searchable.
I'm seeking to make an autocomplete search system inspired by the google search system (with suggested results in real time).
So, what is the more appropriate option to use for A google app engine user ?
Thanks
Both Search API and the Datastore are very fast, and their performance is not dependent on the number of documents that you index.
Search API was specifically designed to index text documents. This means you can search for text inside these text documents, and any other fields that you store in the index. Search API only supports full-word search.
The Datastore will not allow you to index text inside your documents. The Datastore, however, allows searching for entries that start with a specific character sequence.
Neither of these platforms will meet your requirements out of the box. Personally, I used the Datastore to build a search engine for music meta-data (artist names, album and recording titles) which supports both search inside a text string and suggestions (entries that contain words at any position which start with a specific character sequence). I had to implement my own inverted index to accomplish that. It works very well, especially in combination with Memcache.
Related
I am currently working on a solution for searching brand names, so far we have about 10M different brands and we are using Google Cloud Search API. We are currently indexing the 3-grams for each brand name, getting an user query and again extracting the 3-grams, then we search for documents containing all the 3-grams.
What we would like to do is to find not only documents having all 3-grams but also documents having at least one and sorting the results by the number of matches. Would it be possible to do that using the Google Cloud Search API? Or should I be looking into something like Elastic Search?
Best.
For anyone on a similar situation we ended up using Elastic Search and it has proven to be a lot more flexible than Google Full Text Search.
And even thought searching for a limited amount of N-grams was not possible Elastic allows edit distance queries which helped us to find misspellings and similar words which was essential in our use case.
We also noticed a great improvement on the search speed and specially on indexing.
I need help in the search API
I am Brazilian and I'm using the google translator to communicate.
My question is:
For each item in the datastore persisted I create a document and an index?
And for those objects that are already persisted in the datastore, I go all the bank to create a document and an index for each, if I want to search for Search API?
I am using java.
It's reasonable to use the Search API to search for objects that are also stored in the Datastore. You can create a Search document for each Datastore entity (so that there's a one-to-one correspondence between them). But you don't need to use a separate Search Index for each one: all the Search documents can be added to one index. Or, if you have a huge number of documents, and if there is some natural partitioning between them, you could distribute them over some modest number of indexes. Assuming you can know via some external means which (single) index to choose for searching, preventing them from getting too big can help performance.
I've tried to answer the question that I think you're asking. It's difficult for me to understand the English that the Google translator has produced. In particular, what does "I go all the bank ..." mean?
I'm trying to understand the concept of Documents on Google App Engine's Search API. The concept I'm having trouble with is the idea behind storing documents. So for example, say in my database I have this:
class Business(ndb.Model):
name = ndb...
description = ndb...
For each business, I am storing a document so I can do full-text searches on the name and description.
My questions are:
Is this right? Does these mean we are essentially storing each entity TWICE, in two different places, just to make it searchable?
If the answer to above is yes, is there a better way to do it?
And again if the answer to number 1 is yes, where do the documents get stored? To the high-rep DS?
I just want to make sure I am thinking about this concept correctly. Storing entities in docs means I have to maintain each entity in two separate places... doesn't seem very optimal just to keep it searchable.
You have it worked out already.
Full Text Search Overview
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
As you don't get to search "inside" the contents of the models in the datastore the search API provides the ability to do that for text and html.
So to link a searchable text document (e.g a product description) to a model in the datastore (e.g. that product's price) you have to "manually" make that link between the documents and the data-store objects they relate to. You can use the search api and the datastore totally independently of each other also so you have to build that in. AFAIK there is no automatic linkage between them.
I wonder if Big Query is going to replace/compete with Text Search API? It is kinda stupid question, but Text Search API is in beta for few months and has very strict API calls limit. Bug Big Query is already there and looks very promising. Any hints what to chose to search over constantly coming error logs?
Google BigQuery and the App Engine Search API fulfill the needs of different types of applications.
BigQuery is excellent for aggregate queries (think: full table scans) over fixed schema data structures in very very large tables. The aim is speed and flexibility. BigQuery lacks the concept of indexes (by design). While it can be used for "needle in a haystack" type searches, it really shines over large, structured datasets with a fixed schema. In terms of document type searches, BigQuery records have a fixed maximum size, and so are not ideal for document search engines. So, I would use BigQuery for queries such as: In my 200Gb log files, what are the 10 most common referral domains, and how often did I see them?
The Search API provides sorted search results over various types of document data (text, HTML, geopoint etc). Search API is really great for queries such as finding particular occurrences of documents that contain a particular string. In general, the Search API is great for document retrieval based on a query input.
I want to create prospective search subscription with empty query, but GAE raises exception
QuerySyntaxError: query:'' detail:'Query is empty.'
which is not compatible with Search API, which allows empty queries. Any workarounds? Should I file an issue?
The Prospective Search Service is intended to support applications that filter a stream of documents; applications that want less than all documents matched. In such an application, an "empty query" would normally be considered evidence of a bug. Admittedly, empty queries might sometimes be useful for various debugging purposes, however, the decision was made to design the interface's contracts with production use in mind.
As suggested by Will Brown, if you want a subscription that will match all documents, then insert some dummy field with a constant value into your documents and then create a query that matches just that field and value. Given that there is such an easy work-around available for those rare cases when "all documents" are needed, I think it unlikely that we would provide support for empty queries. It might also be interesting to note that the prohibition against empty queries is not just in the AppEngine code but also in the backend servers that AppEngine accesses to provide the Prospective Search Service.
Although the "Search API" (which really should be called the "Retrospective Search API") may support empty queries, it is important to realize that resource utilization patterns for prospective search are very, very different from those of retrospective search. For instance, you might have an application that is streaming hundreds of documents per second into both a document index (using retrospective search) and through a query index (using prospective seach). In such a system, an empty retrospective query is only going to return just a few documents whenever that query is submitted. On the other hand, a prospective query would generate a real-time stream of all documents. The presence of just a few prospective queries could thus generate significant loads on your application. In general, if you want a firehose, real-time push feed of everything published, it is best to code that up explicitly.
You can file a feature request for this, but it is by design (I don't know why). If you know that incoming documents will have something in common, you can write a query for those; for example, if you add a field "alldocuments" with content "yes" to the document when you send the request, you could register a query like "alldocuments:yes" to match all documents.