I am building a demo for a banking application in App Engine.
I have a Users table and Stocks table.
In order for me to be able to list the "Top Earners" in the application, I save a "Total Amount" field in each User's entry so I will later be able to SELECT it with ORDER BY.
I am running a cron job that runs over the Stocks table and update each user's "Total Amount" in the User's table. The problem is that I often get TIMEOUTS since the Stocks table is pretty big.
Is there anyway to overcome the time limit restriction in App Engine, or is there any workaround for these kind of updates (where you MUST select many entries from a table that result a timeout)?
Joel
The usual way is to split the job into smaller tasks using the task queue.
You have several options, all will involve some form of background processing.
One choice would be to use your cron job to kick off a task which starts as many tasks as needed to summarize your data. Another choice would be to use one of Brett Slatkin's patterns and keep the data updated in (nearly) realtime. Check out his high performance data-pipelines talk for details.
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
You could also check out the mapper api (app engine map reduce) and see if it can do what you need.
http://code.google.com/p/appengine-mapreduce/
Related
In our application we currently use dynamoDb to store the notification details. So a scheduler runs twice a day which queries "notificationType"(pk -> notifiactionType, sk -> userId).
In each item there is an attribute(timestamp), based on which if the timestamp is more than the current time will send a trigger(more business logic that for some records one day after the timestamp a mail needs to be sent). Now once the user performs the activity for which the notification is sent, then will delete the entry
My query is that, if the data grows large for a notificationType, then retrieval of all the data is redundant because for some records the notification is not going to be sent. Hence more read capacity is used and that might potentially increase the cost in later point of time.
In this case would it be wise to use the existing dynamoDb or move to any other db like mongoDb, cassandra or any other db.
Note: My primary concern is the cost
Another option is to use a workflow engine that can model the notification process per user instead of a batch job. This way you can avoid scanning large amounts of data as the engine would rely on durable timers to execute actions at the appropriate time.
My open-source project temporal.io which I led at Uber is used by multiple companies for notification like scenarios and was tested up 200 million open parallel workflows.
I'm starting to build a bulk upload tool and I'm trying to work out how to accomplish one of the requirements.
The idea is that a user will upload a CSV file and the tool will parse it and send each row of the CSV to the task queue as a task to be run. Then once all the tasks (relating to that specific CSV file) are completed, a summary report will be sent to the user.
I'm using Google App Engine and in the past I've used the standard Task Queue to handle tasks. However, with the standard Task Queue there is no way of knowing when the queue has finished, no event is fired to trigger the report generation so I'm not sure how to achieve this?
I've looked into it more and I understand that Google also offers Google PubSub. This is more sophisticated and seems more suited, but I still can't find out how to trigger and event when a PubSub queue is finished, any ideas?
Seems that you could use a counter for this. Create an entity with an Integer property that is set to the number of lines of the CSV file. Each task will decrement the counter in a transaction when it finishes processing the row (in a transaction). One task will set the counter to 0, and that task could trigger the event. This might cause too much contention though.
Another possibility could be to have each task create an entity of a specific kind when it finishes processing a row. You can then count the number of these entities to determine when all the rows have been processed.
It might be easier to use the The GAE Pipeline API, which would take care of this as a basic portion of its functionality.
There's a nice article explaining it a bit here.
And a related SO question which happens to mention the same reason for moving to this API and has an excellent answer: Google AppEngine Pipelines API
I didn't use it myself yet, but it's just a matter of time :)
It's also possible to implement a scheme to track the related tasks still being active, see Figure out group of tasks completion time using TaskQueue and Datastore.
You can also check the queue (approximate) status, see Get number of tasks in a named queue?
I faced a similar problem earlier this week and managed to find a nice workaround for it. What i did was i created an extra column in the table where a task inserts data into. And once a specific task is completed, it updates this 'task_status' column with 'done', otherwise it's left as the default null. Then when the user refreshes the page or goes to a specific URL or you do an AJAX call to query the task status for a specific id in your table, you can see if it is complete or not.
select * from table where task_status is not null and id = ?;
You can also create a 'tasks' table where you can store relevant columns there instead of modifying existing tables.
Hope this finds you some use.
I am working on a Salesforce integration for an high-traffic app where we want to be able to automate the process of importing records from Salesforce to our app. To be clear I am not working from the Salesforce side (i.e. Apex), but rather using the Salesforce Rest API from within the other app.
The first idea was to use the cutoff time for when the record was created where we would increase that time on each poll based on the creation time of the applicant in the last poll. It was quickly realized this wouldn't work for this. There can be other filters in the query that might include a status field in Salesforce, for example, where the record should only import after a certain status is set. This would make checking creation time or anything like that unreliable since an older record could later become relevant to our auto importing.
My next idea was to poll the Salesforce API to find records every few hours. In order to avoid importing the same record twice, the only way I could think to do this is by keeping track of the IDs we already attempted to import and using these to do a NOT IN condition:
SELECT #{columns} FROM #{sobject_name}
WHERE Id NOT IN #{ids_we_already_imported} AND #{other_filters}
My big concern at this point was whether or not Salesforce had a limitation on the length of the WHERE clause. Through some research I see there are actually several limitations:
https://developer.salesforce.com/docs/atlas.en-us.salesforce_app_limits_cheatsheet.meta/salesforce_app_limits_cheatsheet/salesforce_app_limits_platform_soslsoql.htm
The next thing I considered was doing queries to find the all of the IDs in Salesforce that meet the conditions of the other filters without checking the ID itself. Then we could take that list of IDs and remove the ones we already tracked on our end to find a smaller IN condition we could set to find all of the data on the records we actually need.
This still doesn't seem completely reliable though. I see a single query can only return 2000 rows and only have an offset up to 2000. If we already imported 2000 records the first query might not have any necessary rows we'd want to import, but we can't offset it to get the relevant rows because of these limitations.
With these limitations I can't figure out a reliable way to find the relevant records to import as the number of records we already imported grows. I feel like this would be common usage of a Salesforce integration, but I can't find anything on this. How can I do this without having to worry about issues when we reach a high volume?
Not sure what all of your requirements are or if the solution needs to be generic, but you could do a few of things.
Flag records that have been imported, but that means making a call back to salesforce to update the records, but that can be bulkified to reduce the number of calls and modify your query to exclude the flag
Reverse the way you get the data to push instead of pull, so have salesforce push records that meet the criteria to you app whenever the record meets the criteria with workflow and outbound messages
Use the streaming API to setup a push topic that you app can subscribe to that would get notified when a records meets the criteria
I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.
A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?
Thanks!
The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.
In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.
If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.
I want some ideas on the best practice to implement an activity stream for a social network im building in app engine (PYTHON)
I first want to keep a log for all activities of each user - so that we have a history. i.e. someone became a friend, added a picture, changed their address etc. This way we have a users history available should we need it. Also mean we can remove friendship joins, change user data but have a historical log.
I also want to stream a users activity to their friends. for this only the last X activities need to be kept - that is in the scenario that messages are sent to friends when an activity occurs.
Its pretty straight forward designing a history log - ie: when, what, where. The complication comes as to how we notify friends of a user as to their activity.
In our app friendships are not mutual - ie they are based on the twitter following model. Some accounts could have thousands of followers.
What is the best approach to model this.
using a many to many join table and doing a costly query -
using a feed class that fired a copy of the activity to all the subscribers - maybe into mcache? As their maybe a need to fire thousands of messages i would imagine a cron job would need to be used.
Any help ideas thoughts on this
Thx
There's a great talk by Brett Slatkin called Building Scalable, Complex Apps on App Engine from last year's Google I/O, in which the example is a Twitter-like application, where users' updates are pushed to their followers. Basically exactly what you're trying to do.
I highly recommend the video for anyone writing an App Engine app, it's really helpful.
Don't do joins. They're too expensive, you'll burn through your quota in no time.
You can use a task queue, it's a bit like a cron job (i.e. stuff happens outside of the original request) but you can start them at will. memcache would be good if you're ok with loosing some activity at times the cache is flushed...