Complex Google App Engine Search - google-app-engine

A couple quick questions related to GAE search and datastore:
(1) Why is it that I can inequality filter on more than one property using the search service, but that I can only inequality filter on at most one property when querying the datastore? It seems odd that this limitation would exist in one service but not the other.
(2) I intend to use google app engine search to query very many objects (thousands or hundreds of thousands, maybe more). I plan to be doing many inequalities, for example: "time created" before x, "price" greater than y, "rating" less than z, "latitude" between a and b, "longitude" between c and d etc. This seems like a lot of filters and potentially expensive. Is App Engine Search an appropriate solution for this?
Thanks so much.

1) The SearchService basically gives you an API to perform the sorts of things you can't using the datastore. If you could do them on the datastore, you wouldn't really need the SearchService. While not a very satisfactory answer, many of the common operations you might do with a traditional RDBMS were not really even possible before the Search API was available.
2) is a bit harder. Currently the search api doesn't handle failure conditions very well, usually you'll get a SearchServiceException without a meaningful message. The team seem to have been improving this over the last year or so, although fixes in this space seem to have been coming very slowly.
From the tickets I've raised, failures are usually a result of queries running too long. This is usually represented as queries that are too complex. You can actually tune queries quite a lot with combinations of the query string as well as the parameters you apply to your search request. The downside is that its all totally black box, I haven't seen any guides or tools on optimising queries. When they fail, they just fail.
The AppEngine search api is designed to solve the problems you describe, whether in your case it does may be hard to determine. You could set up some sample queries and deploy to a test environment to see if it even basically works for your typical set of data. I would expect that it will work fine for the example you gave. I have successfully been running similar searches in large scale production environments.

Related

How to implement properly multiple inequality filters on Google App Engine Datastore?

According to App Engine restriction on inequality filters, there are some suggestions to implement something like advanced searches do (filtering the results by limiting the ranges on many properties) by filtering the properties manually in RAM:
how to effectively run two inequality filters on queries in app engine
So, is it feasible to do this amount of sorting and filtering in RAM for large datasets? is there any Java sample code to demonstrates proper implementation? is it a good idea to stick with traditional RDBMS in order to avoid this drawback?
As Andrei has mentioned, there isn't a general solution to your problem of needing multiple inequality filter conditions. It really depends on your data, queries and application requirements.
Here are some possible solutions you could use:
Perform some filtering in the application. If you have two inequality conditions, A and B, and know that majority (e.g. > 80%) of the entities that meet condition A will meet condition B, then you could query without condition B against Datastore, and filter the returned results in your application code. This lets you continue to use Datastore, and the efficiency hit shouldn't be too bad, since you know > 80% will match.
However, extending this solution to more inequalities, or cases where the overlap between condition A and condition B is not great, will result in very inefficient data retrieval.
Secondary Search Index. It's possible that if you have very complicated filtering / sorting logic, you have something more akin to a search problem, for which Google App Engine Search might be more suitable. Search allows you to run very flexible queries over documents in a search index, including multiple inequality queries.
I will point out that search only offers eventual consistency, and indexes are limited to 10GB (but can be extended to 200GB on request).

Using Neo4j and Lucene in a distributed system

I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.

Google App Engine ndb dynamic indexes alternative

Background
I'm creating an application that allows users to define their own datasets with custom properties (and property types).
A user could, while interacting with the application, define a dataset that has the following columns:
Name:String
Location: Geo
Weight:float
Notes:Text(not indexed)
How many:int
etc...
While there will be restrictions on the total number of properties (say 10-20 or something), there are no restrictions on the property types.
Google's ndb datastore allows this to happen and will auto-generate simple indexes for searches involving combinations of equality operators and no sorts, or only sorts.
Ideal
Multiple sorts
Equality and sorts
Combinations of inequalities
I'm trying to determine if I should use NDB at all, or switch to something else (SQL seems extremely expensive, comparatively, which is one of the reasons I'm hesitant).
For the multiple sorts, I could write server-side code that queries for the first, then sorts in memory by the second, third, etc. I could also query for the data and do the sorting on the client side.
For the combinations of inequalities, I could do the same (more or less).
These solutions are obviously not performant, and won't scale if there are a large number of items that match the first query.
BaaS providers like Kinvey (which runs on GAE unless I'm quite mistaken) use schemaless databases and allow you to both create them on the fly and make compound, complicated queries over the data.
Sanity Check:
Trying to force NDB into what I want seems like a bad idea, unless there's something I'm overlooking (possible) that would make this more doable. My solutions would work, but wouldn't scale well (though I'm not sure how far. Would they work for 10k objects? 100k? 1M?).
Options I've investigated:
Kinvey, which charges by user and by data stored (since they just changed their pricing model), and ends up costing quite a bit.
Stackmob is also nice, but cloud code is crazy expensive ($200/month), and hosting and such all just costs more. Tasks cost more. The price looks very high.
Question:
I've done a fair bit of investigating, but there are just so many options. Assuming that my sanity check is correct (if it's not, and doing in-memory operations is sort-of-scalable, then fantastic!), are there other options out there that are inexpensive (BaaS providers get quite expensive once the applications scale), fast, and easily scalable that would solve my problem? Being able to run custom code in the cloud easily (and cheaply) and have API calls and bandwidth cost next to nothing is one of the reasons I've been investigating GAE (full hosting provider, any code I want in the cloud, etc).

How can I relate search indexes to models in MVC?

I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.

App Engine Full Text Search vs Geohashing for location queries

I'm thinking of porting an application from RoR to Python App Engine that is heavily geo search centric. I've been using one of the open source GeoModel (i.e. geohashing) libraries to allow the application to handle queries that answer questions like "what restaurants are near this point (lat/lng pair)" and things of that nature.
GeoModel uses a ListProperty which creates a heavy index which had me concerned about pricing as i have about 10 million entities that would need to be loaded into production.
This article that I found this morning seems pretty scary in terms of costs:
https://groups.google.com/forum/?fromgroups#!topic/google-appengine/-FqljlTruK4
So my question is - is geohashing a moot concept now that Google has released their full text search which has support for geo searching? It's not clear what's going on behind the scenes with this new API though and I'm concerned the index sizes might be just as big as if I used the GeoModel approach.
The other problem with the search API is that it appears I'd have to create not only my models in the datastore but then replicate some of that data (GeoPtProperty and entity_key for the model it represents at a minimum) into Documents which greatly increases my data set.
Any thoughts on this? At the moment I'm contemplating scraping this port as being too expensive although I've really enjoyed working in the App Engine environment so far and would love to get away from EC2 for some of my applications.
You're asking many questions here:
is geohashing a moot concept: Probably not, I suspect the Search API uses geohashing, or something similar for its location search.
can you use the Search API vs implementing it yourself: yes, but I don't know the cost one way or the other.
is geohashing expensive on app engine: in the message thread the cost is bad due to high index write costs. you'll have to engineer your geohashing data to minimize the indexing. If GeoModel puts a lot of indexed values in the list, you may be in trouble - I wouldn't use it directly without knowing how the indexing works. My guess is that if you reduce the location accuracy you can reduce the number of indexed entries, and that could save you a lot of cost.
As mentioned in the thread, you could have the geohashing run in CloudSQL.

Resources