How does appengine implement query on a list efficiently? - google-app-engine

From the appengine blog:
Advanced Query Planning - We are removing the need for exploding indexes and reducing the custom index requirements for many queries. The SDK will suggest better indexes in several cases and an upcoming article will describe what further optimizations are possible.
As a test, I have an Entity in appengine that has a listProperty
class Entity(db.Model):
tags = db.StringListProperty()
I have 500,000 entities, half of them have tags = ['1'], and the other half have tags = ['2']
my query is
SELECT FROM Entity WHERE tags='1' and tags='2'
It returns no results really quickly. What plan is it using to achieve this? How is the list indexed to achieve this? In the old days, an exploding index would have been needed.

The algorithm used internally ('merge-join') was described in the Google I/O 2009 tech talk Building Scalable, Complex Apps on App Engine. This functionality has also been available since the launch of GAE; the 'exploding indexes' only happen if you create a compound index of multiple StringListProperties.
It's worth noting that this functionality is actually a bit more general than you may realize - any combination of multiple equality filters on any arbitrary combination of properties can be satisfied without any compound indices, provided they're all equality filters and you don't have a sort order. They don't all have to be from a StringListProperty, and can even be split across multiple StringListProperty.

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Is there any text or known algorithms or strategies for Database sharding?

I was building a scalabale solution, and hence require sharding of my data.
I know specific usage map of my present shard and based on that I wanted to break them and create new shards based on that usage map. [Higher usage key-range gets broken down into smaller parts and ditributed to different machine to equalize load across nodes].
Is there any theory/text/algo which gives the most efficient shardings strategy (sharding as such without breaking their sequence/index), if its known which key-ranges are used the most.
It is better to match sharding algorithms/strategies and business scenario.
There are some regular algorithms, such as: Hash, Range, Mod, Tag, HashMod, Time, etc.
And maybe we need more algorithms need to be customized, for example: use user_id mod for database sharding, and use order_id mod for table sharding.
Maybe you can have a look with Apache ShardingSphere, this project just defined some standard sharding algorithms and can permit developers customization.
The documentation related is: https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
The source code FYI: https://github.com/apache/shardingsphere/blob/master/shardingsphere-features/shardingsphere-sharding/shardingsphere-sharding-core/src/main/resources/META-INF/services/org.apache.shardingsphere.sharding.spi.ShardingAlgorithm

App Engine NDB query with multiple inequalities?

The only two answers on here involve essentially restructuring the database to accommodate this limitation, but I am unsure how to do that in my case.
I have a list of thousands of contacts, each with many many properties. I'm making a page that has an ability to filter on multiple properties at once.
For example: Age < 15, Date Added > 15 days ago, Location == Santa Cruz, etc. Potentially a ton of inequality filters required. How does one achieve this in GAE?
According to the docs (for python),
Limitations: The Datastore enforces some restrictions on queries.
Violating these will cause it to raise exceptions. For example,
combining too many filters, using inequalities for multiple
properties, or combining an inequality with a sort order on a
different property are all currently disallowed. Also filters
referencing multiple properties sometimes require secondary indexes to
be configured.
If you check back in a few months, this may change. GAE keeps changing pretty quickly.
For now, though, you'll have to make multiple queries and combine them in your code.

Lucene - few or a lot of indexes

Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Tagging schema for AppEngine

Hey,
I'm using AppEngine for an application that I'm writing. So I need to assign tags each object. I wanted to know what is the best way of doing this.
Should I create a space seperated string of tags and then query something like %search_tag% (I'm not sure if you can do that in JDOQL)?
What other options do I have ?
Should I create another class which will map every object to a tag?
Which would be the best from the point of view of scalability, performance and ease of use?
Thanks
First, '%search_tag%' type 'LIKE' queries do not work on App Engine's datastore. The best you can do is a prefix search.
It is difficult to answer very general questions like this. The best solution will depend on several factors, how many tags do you expect per entity? Is there a limit to the number of tags? How will you use the tags? For searching? For display only? The answers to all these questions impact how you should design your models.
One general solution for tagging is to use a multi-valued property, such as a list of tags.
http://code.google.com/appengine/docs/java/datastore/dataclasses.html#Collections
Be aware, if you will have many tags on your entities it will add overhead at write time, since the indexes writes need time too. Also, you should try to avoid using multi-valued properties multiple times (or multiple multi-value properties) in queries with inequalities or orders. That can lead to 'exploding indexes,' since one index row gets written for every combination of the indexed fields.

Resources