From what I've understood, App Engine indexes are costly both in terms of increased overall storage size and by slowing down your writes.
But do indexes only cost when they're actually used in queries and are explicitly defined in index.yaml? Or do properties such as StringProperty cost more than their non-indexed counterpart (e.g. TextProperty) simply by existing, even though they're not used in index.yaml?
You could also set indexed=False for properties which you don't want indexed:
http://code.google.com/appengine/docs/python/datastore/propertyclass.html#Property
There are built-in/default indices which would contribute overhead even if you don't explicitly define any additional indices in your index.yaml file. Thus you should use TextProperty instead of StringProperty for a field if you know you will never need to filter on that field.
Details on these implicit indices are provided by the article How Entities and Indices are Stored.
Related
It is well documented that fast writing into an entity kind with monotonically increasing values as key or indexed properties is a bad idea for performance.
How about indexing the entities on boolean properties or properties with enum-like values such as Genders?
My guess is indexing on a low-cardinality property will probably suffer from the same problem, because there is no built-in type for such properties. But maybe there is special treatment for boolean properties?
Cloud Datastore has optimizations in place for low-cardinality data such as booleans and enums. Each index entry also contains the entity key, which can then allow our underlying Bigtable tablets to efficiently split and hence handle larger load. This works since we don't need to consider sort order for the same value, so having them randomly distributed within their own key space makes no difference to queries, and the entity key is guaranteed to be unique so we avoid collisions.
When we index a value we also add a 'scatter key' property to the end, which is essentially a randomized integer. This scatter key can then be used for query splitting later, allowing things like Cloud Dataflow to efficiently parallelize queries against this dataset.
I'm having some problem in understanding how indexes work in GAE Datastore, in particular somthing really unclear to me are the limits related to indexes.
For what I understood one can create some custom indexes in the datastore-indexes.xml file and additionally the Datastore will generate some automatic indexes to match the user queries.
A first question is: the "Number of indexes" quota limit defined in the quotas page (https://cloud.google.com/appengine/docs/quotas#Datastore) is referred only to the custom indexes defined in datastore-indexes.xml, or it applies also to indexes automatically generated?
Another concept eluding me is the "index entry for a single query".
Assume I don't have multi-dimensional properties (i.e. not lists) and I have some entities of kind "KindA". Then I define two groups of entity properties:
- Group1: properties with arbitray name and boolean value
- Group2: properties with arbitray name and double value
In my world any KindA entity can have at most N properties of Group1 and N properties of Group2. For any property P an index table is created and each entity having that P set will add a row in the P index table (right?). Thus initially any KindA entity will have 1 entry for each of the max. 2N properties (thus in total max 2N index entries per entity) right?
If this is correct than it follows that I can create an entity with a limited number of properties, however this is strange since I 've always read that an entity can have unlimited properties...(without taking in account the size limit).
However let assume now that my application allows users to query for KindA entities using an arbitrarly long sequence of AND filters on properties of Group1 (boolean one). Thus one can query something like:
find all entities in KindA where prop1=true AND prop2=true ... AND propM = true
This is a situation in which query only contains equalities and thus no custom indexes are required (https://cloud.google.com/appengine/docs/python/datastore/indexes#Index_configuration).
But what if I want to order using properties of GroupB ? In this case I need an index for any different query right (different in terms of combination of filtering properties names)?
In my developmnet server I tried without specifying any custom index and GAE generates them for me (however any time I restart previous generated indexes get removed). In this case how many index entries does a signle KindA entity have in a single query index? I say 1 because of what GAE docs says:
The property may also be included in additional, custom indexes declared in your index configuration file (index.yaml). Provided that an entity has no list properties, it will have at most one entry in each such custom index (for non-ancestor indexes) or one for each of the entity's ancestors (for ancestor indexes)
Thus in theory if N is limited I'm safe with respect to the "Maximum number of index entries for an entity" (https://cloud.google.com/appengine/docs/java/datastore/#Java_Quotas_and_limits) is it right?
But what about receiving over 200 different queries? does it leads GAE to automatically generate over 200 custom indexes (one for distinct query)? If yes, do these indexes automatically generate influence the Indexes number limit (which is 200) ?
If yes, then it follows that I can't let user do this (IMHO very basic) queries. Am I misunderstanding something?
first of all I was trying to understand your question which I find difficult to follow.
The 200 index limit only counts towards the indexes you (or are define for you automatically by the devappserver) define by using queries. This means that the indexes that will be created alone for your indexed properties are not counted towards this limit.
Your are correct in the 2N automatic indexes created for every indexed property.
You can have any number of properties indexed in any entity as long as you don't get over the 1MB limit per entity. But.. this really depends on the content of the properties stored.
For the indexes created for you on your indexed properties... you don't really have an actual limit rather than an increasing cost as your writes and storage per entity put will increase for each added property.
When using sort orders, you are limited to one sort order when using automatic indexes. More sort orders will require a composite index (your custom index). Thus if you are already using an equality filter you need anyway a custom index.
So, yes, on your example the devapp server will create a composite index for each query you will be executing. However you can reduce this indexes manually by deleting the ones not needed. The query planner can use query time to find your results by merging different indexes as explained here:
https://cloud.google.com/appengine/articles/indexselection
Yes, every index definition on your index.yaml will count towards the 200 limit.
I find out that you really don't use composite indexes too much when you know how gae apps can be programmed. You need to balance what users need to do and what not. And also balance between doing query side job, or just query all and filter by code (it really depends on how many max entities you can have in that particular kind).
However, if your trying to do some complex queries available to your users then maybe the datastore is not the choice.
I had the following question about GAE NDB - Index.
I assume you can specify index via index.yaml or within the model definition using property option, indexed = true. Am I correct? If yes is one preferred over the other?
Is there a way to add/drop index during the life cycle of the data objects?
Can I specify an index on a structured property field?
If so, then can you please let me know** as the syntax for this?
Thanks in advance
By default, the properties that can be indexed (i.e. those that aren't variants of Blob) are indexed, which means you can filter or sort by them on their own. Adding single-property indexes to index.yaml would be unusual. Setting indexed=False for a property will mean fewer write-operations when saving entities, but will mean filtering or sorting by the property is no longer possible. I'd suggest reading the documentation on indexes.
If you want to filter or sort (in combination) by more than one property, then you need to include them in index.yaml. However, as you run code in the development server, if it requires an index that hasn't yet been specified, then index.yaml will be modified to contain a suitable index for the query being run. Adding indexes manually isn't necessarily something you'll ever have to do.
You can't index an entire StructuredProperty, the properties of Structured Properties are individually indexed, and don't need to think about them any differently than for regular properties. If you want to manually specify a multi-property index that includes a sub-property, then you should be able to do so by using 'property.subproperty' (e.g. 'address.city').
s1) Yes, you can set certain properties as being indexed. Some property types do not allow indexing at all. It's preferable to set the indexes programmatically within each model definition.
2) Although you can drop the index programmatically (i.e. remove indexed=True), I would not recommend it. It will leave your data store in inconsistent state.
3) It's not possible to set index on a structured property, however, you can set a Key relationship between your model and the models in the structured property.
See:
https://developers.google.com/appengine/docs/python/ndb/entities
https://developers.google.com/appengine/docs/python/ndb/properties
"You can specify the usual property options for structured properties
(except indexed)."
As I discovered the hard way (see GAE python NDB projection query working in development but not in production), there's a big difference between having an index (and therefore needing an entry in index.yaml) and marking properties as indexed or not indexed. These things are there for different purposes:
Having an index allows you to search, sort, or do projection queries.
Marking properties as indexed or not tells an index which entities to include in the index and which to ignore.
Yea, absolutely, you can add or drop index at any time:
Update your index.yaml
Then run one of the gcloud commands such as '$ gcloud datastore indexes create index.yaml' or '$ gcloud datastore indexes cleanup index.yaml'
No, you can't create an index on a structured property. See more info here https://cloud.google.com/appengine/docs/standard/python/ndb/entity-property-reference#structured
I'm a bit confused by some of the GAE documentation. While I intend to add indexes to optimize performance of my application, I wanted to get some clarification on if they are only suggested for this purpose or if they are truly required.
Queries can't find property values
that aren't indexed. This includes
properties that are marked as not
indexed, as well as properties with
values of the long text value type
(Text) or the long binary value type
(Blob).
A query with a filter or sort order on
a property will never match an entity
whose value for the property is a Text
or Blob, or which was written with
that property marked as not indexed.
Properties with such values behave as
if the property is not set with regard
to query filters and sort orders.
from http://code.google.com/appengine/docs/java/datastore/queries.html#Introduction_to_Indexes
The first paragraph leads me to believe that you simply cannot sort or filter on unindexed properties. However, the second paragraph makes me think that this limitation is only confined to Text or Blob properties or properties specifically annotated as unindexed.
I'm curious about the distinction because I have some numeric and string fields that I am currently sorting/filtering against in a production environment which are unindexed. These queries are being run in a background task that mostly doesn't care about performance (would rather optimize for size/cost in this sitation). Am I somehow just lucky that these are returning the right data?
In the GAE datastore, single property indexes are automatically created for all properties that are not unindexable (explicitly marked, or of those types).
The language in that doc, I suppose, is a tad confusing.
You only need to explicitly define indexes when you want to index by more than one property (say, for sorting by two different properties.)
In GAE, unfortunately if the property is marked as unindexed
num = db.IntegerProperty(required=True, indexed=False)
Then it is impossible to include it in the custom index... This is counterproductive (Most built-in indices are never used by my code, but take lots of space). But it is how GAE currently works.
Datastore Indexes - Unindexed properties:
Note: If a property appears in an index composed of multiple properties, then setting it to unindexed will prevent it from being indexed in the composed index.
Never add a property to a model without EXPLICITLY entering either indexed=True or indexed=False. Indices take substantial resources: space, write ops costs, and latency increases when doing put()s. We never, never add a property without explicitly stating its indexed value even if the index=False. Saves costly oversights, and forces one to always think a bit about whether or not to index. (You will at some point find yourself cursing the fact that you forgot to override the default=True.) GAE Engineers would do a great service by not allowing this to default to True imho. I would simply not provide a default if I was them. HTH. -stevep
you must use index if you want to use two or more filter function in one single query.
e.g:
Foobar.filter('foo =', foo).filter('bar =', bar)
if you just query with one filter, no need to use index, which is auto-generated.
for Blob and Text, you can't generate index for them, even you specify it in index.yaml, meanwhile you can't use filter in them.
e.g.
class Foobar(db.Model):
content = db.TextProperty()
Foobar.filter('content =', content)
codes above will raise an Error because TextProperty can't be assigned a index and can't be matched.
One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.