I am using Java SDK of app-engine.
I am using Master-slave datastore.
I have only two tables, each having 30 columns and none of them has size greater than 20 bytes.
After entering 300 rows in each table, it shows Datastore write operations 0.03 Millions out of 0.05 Millions.
I have checked the tables. They contain 300 entries only. There is no infinite loop kind of bug in my code.
Would someone please help me to point me out where I might be going wrong?
Thanks,
Amrish.
As noted in the previous answer, those write totals include your index writes.
All entity properties have associated default indexes (unless the property is configured to be unindexed), even if you have not defined any custom indexes.
See http://code.google.com/appengine/articles/indexselection.html for more detail, and http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost for more specifics on write costs.
For example, a new entity 'put' is:
2 Writes + 2 Writes per indexed property value (these are for the default indexes for that property) + 1 Write per composite index value (for any relevant custom indexes that you have defined).
Datastore writer operations include index updates. Make sure you don't have any exploding indexes. Keep in mind also that by default all fields have a built-in index; make any fields that you're not using unindexed to save quota.
Also, for better reliability and availability, consider switching to the high-reliability datastore (this doesn't directly fix your problem though).
I think there is problem due to size of list_flightinfo. Also, this code might have been called several times per second.
The key of the entity is:
src+"_"+dest
Which is not getting changed in the loop, hence same entity is getting overwritten again and again.
Related
I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.
Datastore documentation is very clear that there is an issue with "hotspots" if you include 'monotonically increasing values' (like the current unix time), however there isn't a good alternative mentioned, nor is it addressed whether storing the exact same (rather than increasing values) would create "hotspots":
"Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates."
https://cloud.google.com/datastore/docs/best-practices
I would like to store the time when each particular entity is inserted into the datastore, if that's not possible though, storing just the date would also work.
That almost seems more likely to cause "hotspots" though, since every new entity for 24 hours would get added to the same index (that's my understanding anyway).
Perhaps there's something more going on with how indexes work (I am having trouble finding great explanations of exactly how they work) and having the same value index over and over again is fine, but incrementing values is not.
I would appreciate if anyone has an answer to this question, or else better documentation for how datastore indexes work.
Is your application actually planning on querying the date? If not, consider simply not indexing that property. If you only need to read that property infrequently, consider writing a mapreduce rather than indexing.
That advice is given due to the way BigTable tablets work, which is described here: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
To the best of my knowledge, it's more important to have the primary key of an entity not be a monotonically increasing number. It would be better to have a string key, so the entity can be stored with better distribution.
But saying this as a non-expert, I can't imagine that indexes on individual properties with monotonic values would be as problematic, if it's legitimately needed. I know with the Nomulus codebase for example, we had a legitimate need for an index on time, because we wanted to delete commit logs older than a specific time.
One cool thing I think happens with these monotonic indexes is that, when these tablet splits don't happen, fetching the leftmost or rightmost element in the index actually has better latency properties than fetching stuff in the middle of the index. For example, if you do a query that just grabs the first result in the index, it can actually go faster than a key lookup.
There is a key quote in the page that Justine linked to that is very helpful:
As a developer, what can you do to avoid this situation? ... Lower your write rate, or figure out how to better distribute values.
It is ok to store an indexed time stamp as long as that entity has a low write rate.
If you have an entity where you want to store an indexed time stamp and the entity has a high write rate, then the solution is to split the entity into two entities. Entity A will have properties that need to be updated frequently and entity B will have the time stamp and properties that don't get updated often.
When I do this, I have a common ID for the two entities to make it really easy to get from one to the other.
You could try storing just the date and put random hours, minutes, and seconds into the timestamp, then throw away that extra data later. (Or keep the hours and minutes and use random seconds, for example). I'm not 100% sure this would work but if you need to index the date it's worth trying.
I've read throughout the Internet that the Datastore has a limit of 1 write per second for an Entity Group. Most of what I read indicate a "write to an entity", which I would understand as an update. Does the 1 write per second also apply to adding entities into the group?
A simple case would be a Thread where multiple posts can be added by different users. The way I see it, it's logical to have the Thread be the ancestor of the Posts. Thus, forming a wide entity group. If the answer to my question above is yes, a "trending" thread would be devastated by the write limit.
That said, would it make sense to get rid of the ancestry altogether or should I switch to the user as the ancestor? What I'd like to avoid is having the user be confused when they don't see the post due to eventual consistency.
A quick clarification to start with
1 write per second doesn't mean 1 entity per second. You can batch writes together, up to a maximum of 500 entities (transactions also have a 10 MiB limit). So if you can patch posts, you can improve your write rate.
Note: you can technically go higher than 1 per second, although your risk of contention errors increases the longer you exceed that limit as well as the eventual consistency of the system.
You can read more on the limits here.
Client-side sharding
If you need to use ancestor queries for strong consistency AND 1 write per second is not enough, you could implement client-side sharding. This essentially means that you write the posts to a up to N different entity-groups using a known key scheme, For example:
Primary parent: "AncestorA"
Optional shard 1: "AncestorA-1"
Optional shard N: "AncestorA-(N-1)"
To query for your posts, issue N ancestor queries. Naturally, you'll need to merge these results on the client-side to display it in the correct order.
This will allow you to do N writes per second.
I might have an Entity with possibly thousands of columns, and was wondering if it would pose any problem (nothing will be indexed):
Will queries be slower if the number of columns increases?
Can there be in theory an unlimited number of columns?
While I never had thousands of columns to know about the speed and performance, as it looks from the data viewer on the dashboard the number of columns should be unlimited:
Considering that the GAE Datastore is essentially a very large key-value store right down to property level, in principle an unlimited number of properties are allowed. Just not all together in one record for space reasons, as others already said.
Datastore is schemaless, but many libraries such as JDO, JPA and Objectify aim to "fix" this "deficiency" by introducing some schema of their own. That is unhelpful in your scenario.
I suggest you bypass those libraries and directly call the Datastore low-level API as per this example instead. You can avoid the overheads of indexing if you change the setProperty calls to setUnindexedProperty as often as possible. Remember to test for a null return from a getProperty call for a property that may be absent in some records.
I have couple of entities with properties numbering in the range of 40 - 50. All these properties are unindexed. These entities are a part of a larger entitygroup tree structure, and are always retrieved by using their key. None of the properties (except the key property) are indexed. I am using Objectify to work with entities on BigTable.
I want to know if there is any performance impact in reading or writing an entity with large number of properties from/to BigTable.
Since these large entities are only fetched by their keys are never participate in any query, I was wondering if I should serialize the entity pojo and store as a blob. It is pretty straightforward to do this in Objectify using the #Serialized annotation. I understand that by serializing my entity and storing it as a blob, I render the blob totally opaque to any other program or non-Java code, but this is not a concern.
I am yet to benchmark the performance difference, but before doing so, I want to know if anybody has done this before or has any advice/opinion to share.
there is always an overhead for number of properties. and serializing won't help much as it just moves processing from one point to another.
i have entities with number of property up to 25 and i fetch them almost on all request by key. the performance difference is negligible for me. hardly +- 1ms. performance problems normally occurs on query parts. number of unindexed property wont count much in performance. while indexed property can significantly delayed put due to modification of index.
if you must, you can break up property in to multiple table if you not going to need them at once.
Going purely by what little I know of how it works, I'd say having a bunch of unindexed properties wouldn't be any different from having the whole thing serialized.