What counts towards the GAE datastore quotas? - google-app-engine

The docs say that there are 50,000 free Read, Write, and Small Operations to the datastore.
I guess read and write are obvious, but what falls under the "small ops" category? If I ran a query would it be a small or read operation?

Here's the documentation on this. As I understand it, queries by key are considered "small operations," so a query for entities, a query for keys, and creating a new key all deplete the small operations quota.
A query is both a small and read operation: it costs 1 read + 1 small per entity retrieved.

Related

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.
I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

Limits on the amount of data in Google App Engine

I'm writing a Google App Engine database that once it goes live will probably hold over 10 million records with fairly constant queries, insertions and deletions.
Is this much data going to be a problem? I'm not worried about the cost ($$$) just the performance of the database. The queries will be based on two fields that are both StringProperty and return less than 100 records.
The database has two 'tables', the one that will be getting most of the queries against it has records that take around 100 bytes. The larger table won't get as many queries (maybe 1/10th the number as the small table) and those records are around 30K each.
Are deletions an expensive operation? Is it better to not delete old records and just mark then as deleted and maybe delete them in bulk in a cron job?
I am aware of the distributed nature of Google App Engine and replication and those issues won't be a problem.
10 million records are not a big amount for the datastore, so you don't have to worry, as long as your queries can take advantage of indexes. For instance if you've to walk a larger data set 100 records a time, instead of saying that you want to start from a certain position in the dataset, you can remember the last ORDER BY field value at the end of the page and ask for elements coming after it (WHERE field > '...' -- supposing ascending order).
You can use task queues instead of cron jobs to do deletions, it all depends how fast you want to get back to the user. Datastore operations tends to be slow, but if it's just one record to delete, it could be acceptable. However, if you've to do multiple operations it can get really slow, thus is better to execute these kind of tasks in a task queue and keep great responsiveness in the application.
Datastore records can't exceed 1Mb, 30Kb is a big record size, but shouldń't cause any problems. Remember that only short strings (500 characters or less) can be indexed.

Appengine: help needed in designing my entities to be more scalable yet inexpensive

As an e.g. consider it as online survey site.
Entities:
Survey (created with questions, answers)
Respondent (who takes surveys in parallel. huge in no.)
survey id
List (question id, answer id)
Problem:
Need to get summary of responses i.e. for any particular survey, for any question, no. of respondents who chose answer 1 vs 2 vs 3 (say)
Summary to be retrieved cheaply i.e. with less calls as possible
This code is just to help you get more understanding:
Survey sampleSurvey = ..
// get all respondents of above survey
List<Respondent> r = getAllRespondents(sampleSurvey);
// update summary per chosen question, answer
for each respondent:
List<QuestionAnswer> qa = respondent.getChosenAnswers()
for each chosen question, answer:
// increments corresponding answer count by 1
sampleSurvey.updateSummary(question.getId(), answer.getId())
// summary update done
// process summary
Summary summary = sampleSurvey.getSummary();
for each available question, answer:
print 'No. of respondents who chose answer %s for question %s are %s' % (answer.text(), question.text(), answer.count())
My thoughts:
a. Creating a summary entity for each survey as and when a respondent takes the survey by updating the counters inside summary entity(Q1 -> A1 -> 4, Q1 -> A2 -> 222,...).
Pros: get summary by reading just 1 entity; cheap
Cons: since huge no. of respondents take same survey in parallel, it means datastore contention; sharding a solution? no. of shards to be dynamic depending on no. of respondents for surveys.
b. Querying the count against indexes. With my little knowledge about appengine indexing, i dont know
how the index will be formed for above Respondent entity and how large it will be. Also am worried about no. of those extra writes needed for indexes, index exploding might happen?
query should be something like
select count(*) from Respondent where surveyId=xx and questId=yy and ansId=zz
Any other better solutions? And what about above? which one you recommend and why. Thanks a lot for looking and for your suggestions. Ping if something is unclear.
I think this depends on two main factors:
Do you know the queries you'll want to run ahead-of-time? (i.e. while the respondents are answering, as opposed to slicing up the data later.)
How many respondents do you expect? (Both # and rate.)
If you don't know the queries ahead-of-time, then I think the best you can do is fetch all the entities and compute the information you need. (And then cache that, perhaps, in another entity or memcache or both.) If you have a lot of respondents, you might have to do this computation on a backend or via a task queue to avoid hitting the request timeout/quotas. If you have a truly enormous amount of data, you might even consider Mapreduce, which is currently experimental and Python-only.
If you do know the queries ahead-of-time, then I think your approach of a single entity is on the right track. You can a standard sharded counter technique to reduce contention if you expect more than about one write per second. If you don't expect more than that, you can just use a single entity group.
If you only expect around one write per second, with possible spikes above this, another option would be to use a single entity but use a task in a task queue to update its counter asynchronously; you can throttle the task queue rate to reduce contention, so long as you won't be creating tasks faster than it can complete them. This might be easier to write, especially if you have lots of statistics to compute, though I think the sharded counters technique above is ultimately more scalable.
Updating a summary with every write is not practical, since you'll quickly run into contention issues; counting the results dynamically will be extremely inefficient. In this case, you're better off computing aggregates using a batch process such as mapreduce - just write a task that scans over all the survey answers and accumulates the relevant statistics, and run this task periodically.

How to get the number of rows in a table in a Datastore?

In many cases, it could be useful to know the number of rows in a table (a kind) in a datastore using Google Application Engine. There is not clear and fast solution . At least I have not found one.. Have you?
You can efficiently get a count of all entities of a particular kind (i.e., number of rows in a table) using the Datastore Statistics. Simple example:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "NameOfYourModel").get()
count = kind_stats.count
You can find a more detailed example of how to get the latest stats here (GAE may keep multiple copies of the stats - one for 5min ago, one for 30min ago, etc.).
Note that these statistics aren't constantly updated so they lag a little behind the actual counts. If you really need the actual count, then you could track counts in your own custom stats table and update it every time you create/delete an entity (though this will be quite a bit more expensive to do).
Update 03-08-2015: Using the Datastore Statistics can lead to stale results. If that's not an option, another two methods are keeping a counter or sharding counters. (You can read more about those here). Only look at these 2 if you need real-time results.
There's no concept of "Select count(*)" in App Engine. You'll need to do one of the following:
Do a "keys-only" (index traversal) of the Entities you want at query time and count them one by one. This has the cost of slow reads.
Update counts at write time - this has the benefit of extremely fast reads at a greater cost per write/update. Cost: you have to know what you want to count ahead of time. You'll pay a higher cost at write time.
Update all counts asynchronously using Task Queues, cron jobs or the new Mapper API. This has the tradeoff of being semi-fresh.
You can count no. of rows in Google App Engine using com.google.appengine.api.datastore.Query as follow:
int count;
Query qry=new Query("EmpEntity");
DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
count=datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources