Optimizing tasks to reduce CPU in a trading application - google-app-engine

I have designed a trading application that handles customers stocks investment portfolio.
I am using two datastore kinds:
Stocks - Contains unique stock name and its daily percent change.
UserTransactions - Contains information regarding a specific purchase of a stock made by a user : the value of the purchase along with a reference to Stock for the current purchase.
db.Model python modules:
class Stocks (db.Model):
stockname = db.StringProperty(multiline=True)
dailyPercentChange=db.FloatProperty(default=1.0)
class UserTransactions (db.Model):
buyer = db.UserProperty()
value=db.FloatProperty()
stockref = db.ReferenceProperty(Stocks)
Once an hour I need to update the database: update the daily percent change in Stocks and then update the value of all entities in UserTransactions that refer to that stock.
The following python module iterates over all the stocks, update the dailyPercentChange property, and invoke a task to go over all UserTransactions entities which refer to the stock and update their value:
Stocks.py
# Iterate over all stocks in datastore
for stock in Stocks.all():
# update daily percent change in datastore
db.run_in_transaction(updateStockTxn, stock.key())
# create a task to update all user transactions entities referring to this stock
taskqueue.add(url='/task', params={'stock_key': str(stock.key(), 'value' : self.request.get ('some_val_for_stock') })
def updateStockTxn(stock_key):
#fetch the stock again - necessary to avoid concurrency updates
stock = db.get(stock_key)
stock.dailyPercentChange= data.get('some_val_for_stock') # I get this value from outside
... some more calculations here ...
stock.put()
Task.py (/task)
# Amount of transaction per task
amountPerCall=10
stock=db.get(self.request.get("stock_key"))
# Get all user transactions which point to current stock
user_transaction_query=stock.usertransactions_set
cursor=self.request.get("cursor")
if cursor:
user_transaction_query.with_cursor(cursor)
# Spawn another task if more than 10 transactions are in datastore
transactions = user_transaction_query.fetch(amountPerCall)
if len(transactions)==amountPerCall:
taskqueue.add(url='/task', params={'stock_key': str(stock.key(), 'value' : self.request.get ('some_val_for_stock'), 'cursor': user_transaction_query.cursor() })
# Iterate over all transaction pointing to stock and update their value
for transaction in transactions:
db.run_in_transaction(updateUserTransactionTxn, transaction.key())
def updateUserTransactionTxn(transaction_key):
#fetch the transaction again - necessary to avoid concurrency updates
transaction = db.get(transaction_key)
transaction.value= transaction.value* self.request.get ('some_val_for_stock')
db.put(transaction)
The problem:
Currently the system works great, but the problem is that it is not scaling well… I have around 100 Stocks with 300 User Transactions, and I run the update every hour. In the dashboard, I see that the task.py takes around 65% of the CPU (Stock.py takes around 20%-30%) and I am using almost all of the 6.5 free CPU hours given to me by app engine. I have no problem to enable billing and pay for additional CPU, but the problem is the scaling of the system… Using 6.5 CPU hours for 100 stocks is very poor.
I was wondering, given the requirements of the system as mentioned above, if there is a better and more efficient implementation (or just a small change that can help with the current implemntation) than the one presented here.
Thanks!!
Joel

There are several obvious improvements to be made:
You should use a keys_only query in the first snippet: since you don't actually refer to the properties of the stock object at any point, there's no point in retrieving it. You may as well retrieve only the key.
You can add tasks in bulk using the Queue object's .add method, documented here. This is more efficient than adding tasks individually.
Your tasks chain new ones every 10 transactions, but tasks can run for up to 10 minutes, and 10 datastore transactions are likely to take no more than a second or two. Instead, set a timer at the beginning of your request, and check it each time around the loop, aborting and chaining the next task when you get close to the 10 minute limit.
If you expect to iterate over a large number of entities, use .fetch and cursors, rather than iterating; iterating fetches in small batches of 20 entities.
In the individual entity update, you're again doing a regular query, but only using the key. Do a keys_only query instead.
Is the task the only thing that will update UserTransaction entities after they're originally written? If so, you can skip the transaction and update them in batches.
Finally, I'd suggest an overall refactoring: instead of starting a new task for each stock, run the outer loop inside a task, with the timer mentioned above. When you chain the next task, use cursors to pass the current state and pick up where you left off.
The only other thing to consider is if there's some way you can restructure your data to avoid the need for so many updates. Can you, for instance, make the UserTransaction entities reference some value in the Stock entities, so that you can calculate their actual value at runtime, and you only need to update the single Stock entity with the change?

Related

how to use Google Cloud Memcach to save/update unique items

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]

GAE Datastore Concurrently update a record's field

Theoretical scenario: I have an ecomerce website that keeps an inventory in the GAE Datastore. Each record contains the name of the item and a count. This store has very little traffic and a large inventory so it is unlikely that multiple customers purchase the same item at the same time; this is here so that we do not focus on the datastore limitation about updating the same record several times per second for a sustained period of time.
How do I correctly update the record for an item in the unlikely event that two customers purchase the same item at the same time? It is crucial that we keep the right count for each item.
In the mongo world I would use something like $inc operation. How would you address this scenario for GAE Datastore?
Thanks
Cloud Datastore transactions use optimistic concurrency, so you can attempt 2 transactions at the same time. If the entity changes, one will fail - at which point you simply retry the transaction that failed.
For example, in the Python NDB you can specific number of retry attempts in the transaction decorator.
Example code:
from google.appengine.ext import ndb
...
#ndb.transactional(retries=5)
def inc_item_count(item_key, count_purchased):
item = item_key.get()
item.count_purchased += count_purchased
item.put()
return True
Web scale?
If you scale beyond where this solution work, you'll want to either explore a Taskqueue based system, or move towards sharding counters the per-item counters.

Best way to access averaged static data in a Database (Hibernate, Postgres)

Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Database design - google app engine

I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.
You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.
As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number
Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".

How to get the number of rows in a table in a Datastore?

In many cases, it could be useful to know the number of rows in a table (a kind) in a datastore using Google Application Engine. There is not clear and fast solution . At least I have not found one.. Have you?
You can efficiently get a count of all entities of a particular kind (i.e., number of rows in a table) using the Datastore Statistics. Simple example:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "NameOfYourModel").get()
count = kind_stats.count
You can find a more detailed example of how to get the latest stats here (GAE may keep multiple copies of the stats - one for 5min ago, one for 30min ago, etc.).
Note that these statistics aren't constantly updated so they lag a little behind the actual counts. If you really need the actual count, then you could track counts in your own custom stats table and update it every time you create/delete an entity (though this will be quite a bit more expensive to do).
Update 03-08-2015: Using the Datastore Statistics can lead to stale results. If that's not an option, another two methods are keeping a counter or sharding counters. (You can read more about those here). Only look at these 2 if you need real-time results.
There's no concept of "Select count(*)" in App Engine. You'll need to do one of the following:
Do a "keys-only" (index traversal) of the Entities you want at query time and count them one by one. This has the cost of slow reads.
Update counts at write time - this has the benefit of extremely fast reads at a greater cost per write/update. Cost: you have to know what you want to count ahead of time. You'll pay a higher cost at write time.
Update all counts asynchronously using Task Queues, cron jobs or the new Mapper API. This has the tradeoff of being semi-fresh.
You can count no. of rows in Google App Engine using com.google.appengine.api.datastore.Query as follow:
int count;
Query qry=new Query("EmpEntity");
DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
count=datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());

Resources