I am trying to understand how to tackle there two limitations of google datastore
write throughput limit of about one transaction per second for a
single entity group.
All the data accessed by a transaction must be contained in at most 25
entity groups.
suppose I wanted to store users info. Due to the 1st limitation I cant store them in a entity group, as several users may update their info at same time. Now if i try to save all users as root entities the 2nd limitation says i cant use any query on users (like find a user whose age > 10). Now i am wondering how is datastore even usable with such limitations.
You're misinterpreting the 2nd limitation: you can, of course, query all users for those whose age > 10, only you must not do it inside a transaction.
If consistency is important you can:
perform a keys-only query outside a transaction
obtain a list of up to 25 keys to operate on (for example by using Query Cursors)
inside a transaction access (by key) the entities corresponding to the keys in your list - these accesses will be consistent
A query isn't a transaction - the returned results can be spread across any number of entity-groups.
Related
Summary
I have an issue where the database writes from my task queue (approximately 60 tasks, at 10/s) are somehow being overwritten/discarded during a concurrent database read of the same data. I will explain how it works. Each task in the task queue assigns a unique ID to a specific datastore entity of a model.
If I run a indexed datastore query on the model and loop through the entities while the task queue is in progress, I would expect that some of the entities will have been operated on by the task queue (ie.. assigned an ID) and others are still yet-to-be effected. Unfortunately what seems to be happening is during the loop through the query, entities that were already operated on (ie.. successfully assigned an ID) are being overwritten or discarded, saying that they were never operated on, even though -according to my logs- they were operated on.
Why is this happening? I need to be able to read the status of my data without affecting the taskqueue write operation in the background. I thought maybe it was a caching issue so I tried enforcing use_cache=False and use_memcache=False on the query, but that did not solve the issue. Any help would be appreciated.
Other interesting notes:
If I allow the task queue to complete fully before doing a datastore query, and then do a datastore query, it acts as expected and nothing is overwritten/discarded.
This is typically an indication that the write operations to the entities are not performed in transactions. Transactions can detect such concurrent write (and read!) operations and re-try them, ensuring that the data remains consistent.
You also need to be aware that queries (if they are not ancestor queries) are eventually consistent, meaning their results are a bit "behind" the actual datastore information (it takes some time from the moment the datastore information is updated until the corresponding indexes that the queries use are updated accordingly). So when processing entities from query results you should also transactionally verify their content. Personally I prefer to make keys_only queries and then obtain the entities via key lookups, which are always consistent (of course, also in transactions if I intend to update the entities and, on reads, if needed).
For example if you query for entities which don't have a unique ID you may get entities which were in fact recently operated on and have an ID. So you should (transactionally) check if the entity actually has an ID and skip its update.
Also make sure you're not updating entities obtained from projection queries - results obtained from such queries may not represent the entire entities, writing them back will wipe out properties not included in the projection.
I have an Entity that represents a Payment Method. I want to have an entity group for all the payment attempts performed with that payment method.
The 1 write-per-second limitation is fine and actually good for my use case, as there is no good reason to charge a specific credit card more frequently than that, but I could not find any specifications on the max size of an entity group.
My concern is would a very active corporate account hit any limitations in terms of number of records within an entity group (when they perform their 1 millionth transaction with us)?
No, there isn't a limit for the entity group size, all datastore-related limits are documented at Limits.
But be aware that the entity group size matters when it comes to data contention, see Keep entity groups small. Please note that contention is not only happening when writing entities, but also when reading them inside transaction (see Contention problems in Google App Engine) or, occasionally, maybe even outside transactions (see TransactionFailedError on GAE when no transaction).
IMHO your usage case is not worth the risk of dealing with these issues (fairly difficult to debug and address), I wouldn't use a single entity group in this case.
I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.
In Google App Engine Datastore HRD in Java,
We can't do joins and query multiple table using Query object or GQL directly
I just want to know that my idea is correct approach or not
If We build Index in Hierarchical Order Like Parent - Child - Grand child by node
Node
- Key
- IndexedProperty
- Set
In case if we want to collect all the sub child's & grand child's. We can collect all the keys which are matching within the hierarchy filter condition and provide the result of keys
and In Memcache we can hold each key and pointing to DB entity, if the cache does not have also in a single query using set of keys we can get all the records from DB.
Pros
1) Fast retrieval - Google recommends using get entities by keys.
2) Single Transaction is enough to collect multiple table data.
3) Memcache and Persistent Datastore will represent the same form.
4) It will scan only the related data to the group like user or parent node.
Cons
1) Meta Data of the DB size will increase so the DB size increase.
2) If the Index of the Single Parent is going to take more than 1MB then we have to split and Save as blob in the DB.
This structure is good approach or not.
In case If we have long deeper levels in the hierarchy, this will solve running lot of query operation to collect all the items dependent to parents.
In case of multiple parents -
Collect all the Indexes and Get the Keys related to the Query.
Collect all the data in single transactions using list of keys.
If any one found some more Pros or Cons Please add them and justify this approach will correct or not.
Many thanks
Krishnan
There are quite a few things going on here that are important to think about:
Datastore is not a relational database. You definitely should not be approaching your data storage from a tables and join perspective. It will lead to a messy and most likely inefficient setup.
It seems like you are trying to restructure your use of Datastore to provide complete transactional and strongly consistent use of your data. The reason Datastore cannot provide this natively is that it is too inefficient to provide these guarantees along with high availability.
With the Datastore, you want to be able to provide the ability to support many (thousands, hundreds of thousands, millions, etc) writes per second to different entities. The reason that the Datastore provides the notion of an entity group is that it allows the developer to specify a specific scope of consistency.
Consider an example todo tracking service. You might define a User and a Todo kind. You wouldn't want to provide strong consistency for all Todos, since every time a user adds a new note, the underlying system would have to ensure that it was put transactionally with all other users writing notes. On the other hand, using entity groups, you can say that a single User represents your unit of consistency. This means that when a user writes a new note, this has to be updated transactionally with any other modification to that user's notes. This is a much better unit of consistency since as your service scales to more users, they won't conflict with each other.
You are talking about creating and managing your own indexes. You almost certainly don't want to do this from an efficiency point of view. Further, you'd have to be very careful since it seems you would have a huge number of writes to a single entity / range of entities which represent your table. This is a known Datastore anti-pattern.
One of the hard parts about the Datastore is that each project may have very different requirements and thus data layout. There is definitely not one size fits all for how to structure your data, but here are some resources:
What actually happens when you do a write to Datastore
How Datastore stores data
Datastore Entity relationship modeling
Datastore transaction isolation
I'm thinking about introducing entity groups in my application to enable strong consistency. Propose I have an Order entity and a OrderRow entity with each Order as a parent for it's OrderRows. Then it would be normal to update the Order with the sum of all OrderRows when adding an OrderRow.
But because the datastore is limited to 1 write per second, each time I edit/add an OrderRow it would take at least one second because of the updating of the Order.
Is this correct? If so, the one second limit is extremely limiting because it's very often you update two entities within the same entity group in one user request?
If it is within a single request, then you can run them all within the same transaction, (which is the purpose of the entity group).