I need to do some continuous aggregation on a data set. I am using app engines High Replication Datastore.
Lets say we have a simple object with a property that holds a string of the date when it's created. There's other fields associated with the object but it's not important in this example.
Lets say I create and store some objects. Below is the date associated with each object. Each object is stored in the order below. These objects will be created in separate transactions.
Obj1: 2012-11-11
Obj2: 2012-11-11
Obj3: 2012-11-12
Obj4: 2012-11-13
Obj5: 2012-11-14
The idea is to use a cursor to continually check for new indexed objects. Aggregation on the new indexed entities will be performed.
Here are the questions I have:
1) Are objects indexed in order? As in is it possible for Obj4 to be indexed before Obj 1,2, and 3? This will be a issue if i use a ORDER BY query and a cursor to continue searching. Some entities will not be found if there is a delay in indexing.
2) If no ORDER BY is specified, what order are entities returned in a query?
3) How would I go about checking for new indexed entities? As in, grab all entities, storing the cursor, then later on checking if any new entities were indexed since the last query?
Little less important, but food for thought
4) Are all fields indexed together? As in, if I have a date property, and lets say a name property, will both properties appear to be indexed at the same time for a given object?
5) If multiple entities are written in the same transaction, are all entities in the transaction indexed at the same time?
6) If all entities belong to the same entity group, are all entities indexed at the same time?
Thanks for the responses.
All entities have default indexes for every property. If you use ORDER BY someProperty then you will get entities ordered by values of that property. You are correct on index building: queries use indexes and indexes are built asynchronously, meaning that it's possible that query will not find an entity immediately after it was added.
ORDER BY defaults to ASC, i.e. ascending order.
Add a created timestamp to you entity then order by it and repeat the cursor. See Cursors and Data Updates.
Indexes are built after put() operation returns. They are also built in parallel. Meaning that when you query some indexes may be build, some not. See Life of a Datastore Write. Note that if you want to force "apply" on an entity you can issue a get() after put(), which will force the changes to be applied (= indexes written).
and 6. All entities touched in the same transaction must be in the same entity group (=have common parent). Transaction isolation docs state that transactions can be unapplied, meaning that query after put() will not find new entities. Again, you can force entity to be applied via a read or ancestor query.
Related
Say I have an objectify entity with 1 unindexed and 5 indexed fields. If I were to update the entity by modifying the unindexed property alone, would it cause to rewrite the indices for the five indexed fields as well? Essentially I am worried about the write cost here.
Google charges per-entity write, irrespective of the number of indexes.
See https://cloud.google.com/appengine/pricing#costs-for-datastore-calls
Yes, every update of an entity causes updates of all indexed properties. In other words, the write costs are the same whether only one property is updated or all of them.
This is not specific to Objectify - it's how the Datastore works.
I have a table in ndb datatstore. In that I have
updated = ndb.DateTimeProperty(auto_now_add=True, indexed=False)
created = ndb.DateTimeProperty(auto_now_add=True, indexed=False)
With this structure I have many records in the table. Now I am updating the unindexed fields indexed=True. So, will it index all updated and created data present till date in that table or it will start indexing data to be filled in after indexing ?
And how do I index the unindexed rows of these columns ?
These properties will not be indexed on existing entities, until you rewrite them with the index enabled. This is because indexes are set at a per entity level.
To ensure you index all these fields, you'll need to read every entity then write it back down. For smaller datasets, you do go this with a simple query and loop. For larger datasets you will want to explore something like Cloud Dataflow.
If you have a large dataset and concerns on costs, you could do some optimizations. For example, do a keys-only query against the indexed fields, then if any read entity matches that list, don't write it back (since it's already indexed).
There are many properties in my model that I currently don't need indexed but can imagine I might want indexed at some unknown point in the future. If I explicitly set indexed=False for a property now but change my mind down the road, will Datastore rebuild the entire indices automatically at that point, including for previously written data? Are there any other repercussions for taking this approach?
No, changing indexed=True to indexed=False (and vice-versa) will only affect entities written after that point to the datastore. Here is the documentation that talks about it and the relevant paragraph:
Similarly, changing a property from indexed to unindexed only affects entities subsequently written to the Datastore. The index entries for any existing entities with that property will continue to exist until the entities are updated or deleted. To avoid unwanted results, you must purge your code of all queries that filter or sort by the (now unindexed) property.
If you decide later that you want to starting indexing properties, you'll have to go through your entities and re-put them into the datastore.
Note, however, that changing a property from unindexed to indexed does not affect any existing entities that may have been created before the change. Queries filtering on the property will not return such existing entities, because the entities weren't written to the query's index when they were created. To make the entities accessible by future queries, you must rewrite them to the Datastore so that they will be entered in the appropriate indexes. That is, you must do the following for each such existing entity:
Retrieve (get) the entity from the Datastore.
Write (put) the entity back to the Datastore.
To index properties of existing entities (as per the documentation):
Retrieve (get) the entity from the Datastore.
Write (put) the entity back to the Datastore.
didn't work for me. I employed appengine-mapreduce library and wrote a MapOnlyMapper<Entity, Void> using DatastoreMutationPool for indexing all the existing entities in Datastore.
Lets assume the property name was unindexed and I want to index this in all the existing entities. What I had to do is:
#Override
public void map(Entity value) {
String property = "name";
Object existingValue = value.getProperty(property);
value.setIndexedProperty(property, existingValue);
datastoreMutationPool.put(value);
}
Essentially, you will have to set the property as indexed property using setIndexedProperty(prop, value) and then save (put) the entity.
I know I am very late in posting an answer. I thought I could help someone who might be struggling with this problem.
I have entities in app engine which I query as:
foo = Foo.all().filter('bar =', baz).get()
#baz is unicode, bar is a StringProperty
#Foo inherits from db.Model
This works for most entities, but for some value of baz, no entity is returned, even though the entity certainly exists, as can be verified at https://console.cloud.google.com/datastore/entities/ The cause is that for that specific entity there is no index on it's value of bar, as evidenced by the lack of a checkmark in the 'Indexed' column at that web page.
The docs state that
Indexes for simple queries, such as queries over a single property, are created automatically
So I would have expected that all entities of that type would have an index on that property, but evidently that is incorrect. Questions:
Q1: when the index is created, is it added to entities that were put prior to the first time a query is run using that index? (or is the index created the first time any entity of that type is put?)
Q2: if not, what changes to the entity (if any) will cause the index to be added to that property? (i tried changing a property other than bar, and putting, and that did not cause the entity to be added)
Q3: would explicitly listing the index in index.yaml change this behavior?
Q4: is there a way to programatically determine whether an entity has an index on a specific property?
Q5: (bonus) is there any google documentation on the above?
thanks
Q1) The index for individual properties is created automatically created when you write the first entity that has that property (with indexed=true). However, whether or not a property is added to the index is an entity/property level attribute that is set when you write it.
Q2) Every property there is a flag that tells the back-end if it should index the property.If you read the entity and write it back down with the flag set to true on bar it will be inserted into the index.
Q3) index.yaml is only for composite indexes (multi-property indexes). Individual properties are controlled by a property-level flag when you write/update the entity and do not need to be pre-configured.
Q4) Only by reading back every entity and checking the index flag for the property in question.
Q5) For composite indexes you can read the Datastore Indexes. For property indexes, read the Entities, Properties, and Keys page down at the "Property and Value Types" section - you'll see lots about indexes there.
What's the length of the data you're storing? Documentation says:
Short strings (up to 1500 bytes) are indexed and can be used in query filter conditions and sort orders.
Long strings (up to 1 megabyte) are not indexed and cannot be used in query filters and sort orders.
More information on index creation in general here + its "related articles".
I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!