Google Cloud Datastore queries too slow when fetching all records - google-app-engine

I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.

You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.

Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.

Related

Appengine Datastore Read Operations limit exceeded

I keep reaching the limit of Read Operations exceeded which is 50k for less than two hours without much activity. In the datastore I have about 200 records of a class that contains 8 variables in the short type in Java. The user can add new instances in this class in the datastore.
Each time the user reaches the website I have to show the results so I can show to max 50 000/ 200 = 250 users (usually even much less).
Is there any other way I can store the results persistently? Maybe I can put the 200 records as one and parse them manually in the code.
I read about blobstore but I understand it's more about uploading files rather than database and querying. Should I use it. I want to keep the application in the free tier.
If you need to show the same records to all users, keep them in Memcache - or even in your instance memory.
Check out Objectify if you're using Java. It has first and second level cache (second level uses Memecache as Andrei recommended). Objectify will help you avoid repeated trips to the datastore--it all happens out of the box with no re-coding on your part. Just read about the #Cache annotation for entity objects as well as the Objectify.cache(true) method.

Index Builder for Fast Retrieval similar Multiple table retrieval in Single Query in App Engine

In Google App Engine Datastore HRD in Java,
We can't do joins and query multiple table using Query object or GQL directly
I just want to know that my idea is correct approach or not
If We build Index in Hierarchical Order Like Parent - Child - Grand child by node
Node
- Key
- IndexedProperty
- Set
In case if we want to collect all the sub child's & grand child's. We can collect all the keys which are matching within the hierarchy filter condition and provide the result of keys
and In Memcache we can hold each key and pointing to DB entity, if the cache does not have also in a single query using set of keys we can get all the records from DB.
Pros
1) Fast retrieval - Google recommends using get entities by keys.
2) Single Transaction is enough to collect multiple table data.
3) Memcache and Persistent Datastore will represent the same form.
4) It will scan only the related data to the group like user or parent node.
Cons
1) Meta Data of the DB size will increase so the DB size increase.
2) If the Index of the Single Parent is going to take more than 1MB then we have to split and Save as blob in the DB.
This structure is good approach or not.
In case If we have long deeper levels in the hierarchy, this will solve running lot of query operation to collect all the items dependent to parents.
In case of multiple parents -
Collect all the Indexes and Get the Keys related to the Query.
Collect all the data in single transactions using list of keys.
If any one found some more Pros or Cons Please add them and justify this approach will correct or not.
Many thanks
Krishnan
There are quite a few things going on here that are important to think about:
Datastore is not a relational database. You definitely should not be approaching your data storage from a tables and join perspective. It will lead to a messy and most likely inefficient setup.
It seems like you are trying to restructure your use of Datastore to provide complete transactional and strongly consistent use of your data. The reason Datastore cannot provide this natively is that it is too inefficient to provide these guarantees along with high availability.
With the Datastore, you want to be able to provide the ability to support many (thousands, hundreds of thousands, millions, etc) writes per second to different entities. The reason that the Datastore provides the notion of an entity group is that it allows the developer to specify a specific scope of consistency.
Consider an example todo tracking service. You might define a User and a Todo kind. You wouldn't want to provide strong consistency for all Todos, since every time a user adds a new note, the underlying system would have to ensure that it was put transactionally with all other users writing notes. On the other hand, using entity groups, you can say that a single User represents your unit of consistency. This means that when a user writes a new note, this has to be updated transactionally with any other modification to that user's notes. This is a much better unit of consistency since as your service scales to more users, they won't conflict with each other.
You are talking about creating and managing your own indexes. You almost certainly don't want to do this from an efficiency point of view. Further, you'd have to be very careful since it seems you would have a huge number of writes to a single entity / range of entities which represent your table. This is a known Datastore anti-pattern.
One of the hard parts about the Datastore is that each project may have very different requirements and thus data layout. There is definitely not one size fits all for how to structure your data, but here are some resources:
What actually happens when you do a write to Datastore
How Datastore stores data
Datastore Entity relationship modeling
Datastore transaction isolation

is the number of colums limited for an Entity?

I might have an Entity with possibly thousands of columns, and was wondering if it would pose any problem (nothing will be indexed):
Will queries be slower if the number of columns increases?
Can there be in theory an unlimited number of columns?
While I never had thousands of columns to know about the speed and performance, as it looks from the data viewer on the dashboard the number of columns should be unlimited:
Considering that the GAE Datastore is essentially a very large key-value store right down to property level, in principle an unlimited number of properties are allowed. Just not all together in one record for space reasons, as others already said.
Datastore is schemaless, but many libraries such as JDO, JPA and Objectify aim to "fix" this "deficiency" by introducing some schema of their own. That is unhelpful in your scenario.
I suggest you bypass those libraries and directly call the Datastore low-level API as per this example instead. You can avoid the overheads of indexing if you change the setProperty calls to setUnindexedProperty as often as possible. Remember to test for a null return from a getProperty call for a property that may be absent in some records.

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

Insert thousands entities in a reasonnable time into BigTable

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:
import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery
def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)
It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:
Why is it so slow ?
Is there any way to do it in a reasonnable time ?
Thanks :)
First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.
The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:
Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.
In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.
Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.
FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.
Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)
Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.
Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

Resources