Reducing Google App Engine Built-in Index Size

Reducing Google App Engine Built-in Index Size - google-app-engine

I've got an Google App Engine entity with over 2 million entries, which takes up about 2GB.
According to the datastore statistics, the built-in indexes are 13GB (75 million entries), and the composite indexes are 1GB (4 million entries).
I understand that the size of my composite indexes are related to how many indexes I have defined in my index.yaml file.
However, why are my built-in indexes so much larger that the data itself, and what can I do to reduce the built-in indexes?

Most of the model properties are indexed by default, refer here:
https://developers.google.com/appengine/docs/python/ndb/properties
picke, json, localstructured, blob & json properties are not indexed by default. Meaning if you didn't specify indexed=False on any other property it will have the built-in index.
class User(ndb.Model):
display_name = ndb.StringProperty(indexed=False) # will not be indexed
modified = ndb.DateTimePropert(indexed=False) # will not be indexed
Most of the time you have a lot of these things that you never query for. But right now you can't have composite index on a an un-indexed property which is already reported feature. https://code.google.com/p/googleappengine/issues/detail?id=4231
Then once you added indexed=False and want to remove all existing built-in indexes you'll need to rerun entity.put() on all of the existing entities.

Related

How can you delete individual indexes on properties in an Entity?

I know you can delete composite indexes with gcloud datastore indexes cleanup but what about deleting individual indexes on properties in an Entity?
For example, let's say you have been indexing a property on an entity for a while, but then you decide not to anymore and upload a new version of your app that excludes it. I presume the index is still somewhere there in a table somewhere. Is there a way to clear these out?

The index is updated for an entity when you put the entity. You could put all of your entities to clear that index for all of them.

How to ensure that a document field is unique in ArangoDB Cluster

I have a 3 nodes Arango Cluster (Community edition).
I created a database with writeConcern=3 and replicationFactor=3 and a collection with shards=3, and replicationFactor=3.
I have a Hash index on a field of that collection with the unique property set to true. However I am still able to create different documents with the same field value.
I would like to know if there is some strategies to ensure uniqueness of a collection field in the cluster.

The section Indexes On Shards in the Arango docs says the following:
Unique indexes (hash, skiplist, persistent) on sharded collections are only allowed if the fields used to determine the shard key are also included in the list of attribute paths for the index
The reason behind this is simple - it would be very expensive to ensure uniqueness of an attribute x if it is not guaranteed that all documents with identical values of x are stored on the same node.

Is query by key faster than query by indexed property in Google Datastore?

Consider the below datastore entity:
public class Employee {
#Id String id;
#Index String userName
}
My understanding is that only those properties which are part of the filter criteria in the queries need to be annotated with #Index. Indexing in datastore is not for performance but for fetching the data.
Should id also be annotated with #Index to query by id? If no, does datastore automatically create indexes for keys?
#Id annotation makes sure to manage uniqueness, but it has no performance advantage over indexed properties. Is that right?
Will query by id be faster than query by userName in the above example?

1:
No, you don't need to explicitly index it. Datastore uses your key as a primary key for your entities (in the Entities table).
2 & 3:
Querying by primary key is more efficient (you only require a single scan on the primary table instead of a scan on the index followed by a lookup in the primary table. However, it also allows you to do a Lookup instead of a query:
Employee e = ofy().load().type(Employee.class).id("<id>").now();
Besides avoiding the query planning and index scan to lookup this Employee, this is Strongly Consistent. If you don't do this, you may write a new Employee but then not actually see them when you query for them.
While Strong Consistency is important from an application correctness point-of-view, it will be slower. In particular, when you do a strongly consistent lookup, Datastore may need to talk to the other replicas (in other data centers) to catch up your entity group.
If you are ok with eventual consistency, you can perform a Lookup with eventual consistency to avoid the index scans and the replica catch up using a read policy. In objectify, this looks like:
Employee e = ofy().consistency(Consistency.EVENTUAL).load()
.type(Employee.class).id("<id>).now();
Note: This answer talks a lot about indexes and tables. In generally I recommend not thinking about Datastore in terms of indexes and table (since it is not a relational storage system). However, it is implemented on a relational DB, so useful for answering your questions. This page has a lot of good background.

No, will be created automatically
#Id makes sure it's Key
Can't find confirmation, but must be faster. Also it's cheaper than query, 1 read for get vs 2 read for query. See https://cloud.google.com/datastore/docs/pricing
Also, keep in mind that if you decide to add #Index annotation later, then it will be created only for new entities, all existing entities will be unindexed. Which means you need to reindex db, or only new records will be returned from Query with a filter by this field.

Objectify always does a get by key - if you run a query, it does a keys only query, then fetches results by id. This works well because it has cache integration and it also means that you get accurate results (as in the data is strongly consistent, even though they query results aren't). You can control this using the .hybrid(boolean) method on a query.
You cannot query by id - you can only get by key. If you want to do that, you need a duplicate indexed field, and to query on that. This is an artifact of how keys work in the datastore.

Why do entities need keys in Appengine datastore

What is the usage of keys in the appengine datastore: I am new to Appengine, any info on it would be great.

Comparison
To keep things simple, let's assume MySQL stores all the rows of a table in a single file. That way, it can find all the rows by scanning that file.
App Engine's datastore (BigTable) does not have a concept of tables. Each entity (~row in MySQL) is stored separately. [It can also have a individual structure (~columns).] Because entities are not connected in any way, there is no "default" method to go through all of them. Each entity needs an ID and must be indexed.
Key Structure
A key consists of:
App ID (the closest thing in MySQL is a database).
Kind (the closest thing in MySQL is a table).
ID or name (the closest thing in MySQL is a primary key).
(Optionally) Parent key (all the above of another entity). (Details omitted for the sake of simplicity.)
Please note that what is meant by the closest thing is conceptual similarity. Technically, these things are not related. In MySQL, databases and tables represent actual storage structures. In BigTable they are just IDs, and the storage is actually flat, i.e. every entity is essentially a file.
In other words, identity-wise, a key is to an entity as the database + table + primary key are to a row in a MySQL table.
Key's Responsibilities
An entity's key:
States what application the entity belongs to.
What kind (class, table) it is of.
By the means of the above and either a numeric key ID or a textual key name, identifies the entity uniquely.
(Optionally) What the parent entity of the entity is. (Details omitted for the sake of simplicity.)
Usage
So that you can retrieve all entities of a kind, App Engine automatically builds indexes. That means App Engine maintains a list of all your entities. More specifically, it maintains a list of your entities' keys.
Complex indexes may be defined to run queries on multiple properties (~columns).
In contrast to MySQL, every BigTable query requires an index. Whenever a query is run, the corresponding index is scanned to find the entities that meet the query's conditions, and then the individual entities are retrieved by key.
A common high-level use is to identify an entity in a URL, as every key can be represented as a URL-safe string. When an entity's key is passed in the URL, the entity can be retrieved unambiguously, as the key identifies it uniquely.
Moreover, retrieving an entity by its key is strongly consistent, as opposed to queries on indexes, which means that when entity is retrieved by its key, it's guaranteed to be the latest version.
Tips
Every entity stored in BigTable has a key. Such a key may be programmatically created in your application and given an arbitrary key name. If it's not, an numeric ID will be allocated transparently, as the entity is being stored.
Once an entity is stored, its key may not be changed.
The optional parent component might be used to define a hierarchy of entities, but what it's really important for is transactions and strong consistency.
Entities that share a parent are said to belong to the same entity group.
Queries within a group are strongly consistent.
Just to reiterate, retrieving an entity by its key or querying an index by a parent key are strongly consistent. Retrieving entities in other ways (e.g. by a query on a property) is eventually consistent.
Glossary
Entity - a single key-value document.
Eventual consistency - retrieving an entity (often a number of them) without the guarantee that the replication has completed, which may result in some entities being an old version and some being missing, as they have not yet been brought from the server they were stored on.
Key - an entity's ID.
Kind - arbitrary textual name of a class of entities, such as User or Article.
Key ID - a numeric identifier of a key. Usually automatically allocated.
Key name - a textual identifier of a key.
Strong consistency - retrieving an entity in such a way that its latest version is retrieved.
(I intentionally used MySQL in the examples, as I'm much more familiar with it than with any other relational database.)

Please read https://developers.google.com/appengine/docs/java/datastore/#Java_Entities ... you may want to delete your question and ask again after you have studied this documentation section.
(This is meant to help you, not complain.)

Sort entities in query by reverse order of creation w/o timestamp. GAE/J

We use GAE w Java and JDO 2.3.
Is there way to sort entities of JDO query in reverse of creation order?
I think that if we will use index columns instead of timestamp it will increase performance. Is it right?

You can't rely on ids to be continuously allocated by the datastore service.
But you can either set a timestamp as part of your keyname, or allocate a continuous id range using DatastoreService.allocateIds, to ensure your keys monotonically increasing.
You should then be able to sort the entities key using KEY_RESERVED_PROPERTY.
Compared to an indexed timestamp property, you would would save the additional index lookup if you are querying the full entities, but it would make no different for key only queries.
Note that a descending sort order will require an additional index as described in the App Engine JDO Query documentation
Also beware of hot tablet issues if you have a high write-throughput of entities with monotonically keys or indexes.

If you're doing a query and sorting on a single timestamp column appengine will create an index for you and the query will be very fast.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight