Appengine Datastore Read Operations limit exceeded - google-app-engine

I keep reaching the limit of Read Operations exceeded which is 50k for less than two hours without much activity. In the datastore I have about 200 records of a class that contains 8 variables in the short type in Java. The user can add new instances in this class in the datastore.
Each time the user reaches the website I have to show the results so I can show to max 50 000/ 200 = 250 users (usually even much less).
Is there any other way I can store the results persistently? Maybe I can put the 200 records as one and parse them manually in the code.
I read about blobstore but I understand it's more about uploading files rather than database and querying. Should I use it. I want to keep the application in the free tier.

If you need to show the same records to all users, keep them in Memcache - or even in your instance memory.

Check out Objectify if you're using Java. It has first and second level cache (second level uses Memecache as Andrei recommended). Objectify will help you avoid repeated trips to the datastore--it all happens out of the box with no re-coding on your part. Just read about the #Cache annotation for entity objects as well as the Objectify.cache(true) method.

Related

Google Cloud Datastore queries too slow when fetching all records

I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.

What's the maximum resultset size in Google Cloud SQL? If there is how can it be increased?

I have an issue with trying to read data from a google cloud sql database and write it to a google app engine datastore. The read of the database looks like it is not returning all the values it should when the value of entries retrieved is around 210, as i've checked that if it retrieves 189 entries or lower it's able to write all retrieved entries properly, and if it's higher I've achieved it to write at most 209 entries un datastore, although sometimes, it writes less than those ones.
I understand than if million of rows with thousands of columns were returned, yeah, of course, a limit should be stablished in memory usage of the app, but it's not even 300 entries with 5 columns, I've checked how much it would occupy in plain text and it doesn't even reach the 15K, so I think it should be able to deal with this with no problems.
Anyy idea of what might be happening?

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

How does google appengine measure datastore put operations

With the appengine pricing changes, we've been paying attention to our datastore puts. According to the pricing comparison chart we're making 2.18 million puts a day. This seems a lot higher than expected. We receive about 0.6 queries per second which means that each request is making about 60 puts!!
Using the sample code for db profiling http://code.google.com/appengine/articles/hooks.html
we measured this for a day and the most we counted was ~14,000 which seems more reasonable. Does anyone have experience with something similar on their site?
The discrepancy you're seeing is because every index write is counted separately. When you do a datastore put, you're charged for the number of rows that have to be modified, so if you modified a single indexed field, you'd expect to be charged for:
One write for the entity itself
Two writes for the ascending index for the modified property
Two writes for the descending index for the modified property
For a total of 5 writes. As you can see, setting properties to indexed=False can have a big impact on your quota usage here.

Insert thousands entities in a reasonnable time into BigTable

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:
import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery
def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)
It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:
Why is it so slow ?
Is there any way to do it in a reasonnable time ?
Thanks :)
First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.
The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:
Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.
In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.
Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.
FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.
Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)
Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.
Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

Resources