Calculating unique elements from huge list in Google App Engine - google-app-engine

I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.
A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?
Thanks!

The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.
In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.

If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.

Related

Google App Engine - Event when queue finishes

I'm starting to build a bulk upload tool and I'm trying to work out how to accomplish one of the requirements.
The idea is that a user will upload a CSV file and the tool will parse it and send each row of the CSV to the task queue as a task to be run. Then once all the tasks (relating to that specific CSV file) are completed, a summary report will be sent to the user.
I'm using Google App Engine and in the past I've used the standard Task Queue to handle tasks. However, with the standard Task Queue there is no way of knowing when the queue has finished, no event is fired to trigger the report generation so I'm not sure how to achieve this?
I've looked into it more and I understand that Google also offers Google PubSub. This is more sophisticated and seems more suited, but I still can't find out how to trigger and event when a PubSub queue is finished, any ideas?
Seems that you could use a counter for this. Create an entity with an Integer property that is set to the number of lines of the CSV file. Each task will decrement the counter in a transaction when it finishes processing the row (in a transaction). One task will set the counter to 0, and that task could trigger the event. This might cause too much contention though.
Another possibility could be to have each task create an entity of a specific kind when it finishes processing a row. You can then count the number of these entities to determine when all the rows have been processed.
It might be easier to use the The GAE Pipeline API, which would take care of this as a basic portion of its functionality.
There's a nice article explaining it a bit here.
And a related SO question which happens to mention the same reason for moving to this API and has an excellent answer: Google AppEngine Pipelines API
I didn't use it myself yet, but it's just a matter of time :)
It's also possible to implement a scheme to track the related tasks still being active, see Figure out group of tasks completion time using TaskQueue and Datastore.
You can also check the queue (approximate) status, see Get number of tasks in a named queue?
I faced a similar problem earlier this week and managed to find a nice workaround for it. What i did was i created an extra column in the table where a task inserts data into. And once a specific task is completed, it updates this 'task_status' column with 'done', otherwise it's left as the default null. Then when the user refreshes the page or goes to a specific URL or you do an AJAX call to query the task status for a specific id in your table, you can see if it is complete or not.
select * from table where task_status is not null and id = ?;
You can also create a 'tasks' table where you can store relevant columns there instead of modifying existing tables.
Hope this finds you some use.

AppEngine & BigQuery - Where would you put stat/monitoring data?

I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.

Preventing duplicates with MapReduce to BigQuery pipeline

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

How to do bulk deletes from Google App Engien Datastore

I need to delete bulk records from datastore, I went through all the previous links but all just talked about fetching the entities from datastore and then deleting them one by one , problem in my case is that I have got around 80K entities and read gets timed out whenever i try to do it using datastore db.delete() method .
Does any one here by any chance know a method more close to SQL to perform a bulk delete ?
You can use Task Queue + DB Cursor for deletion.
Task can be executed up to 10 minutes, it's probably enought time to delete all entities. But if it takes longer, you can get current cursor position, and call task itself one more time, with this cursor as paramtere, and start processing from last position.
Define what API you're using. JDO? GAE? JPA? You refer to some db.delete, yet tag this as JDO; they are not the same. JDO obviously provides pm.deletePersistentAll(), and if wanting more than that you can make use of Google Mapper API
You can use Cloud Dataflow to bulk delete entities in Datastore. You can use a GQL query to select the entities to delete:
https://cloud.google.com/datastore/docs/bulk-delete

timeout workarounds in app engine

I am building a demo for a banking application in App Engine.
I have a Users table and Stocks table.
In order for me to be able to list the "Top Earners" in the application, I save a "Total Amount" field in each User's entry so I will later be able to SELECT it with ORDER BY.
I am running a cron job that runs over the Stocks table and update each user's "Total Amount" in the User's table. The problem is that I often get TIMEOUTS since the Stocks table is pretty big.
Is there anyway to overcome the time limit restriction in App Engine, or is there any workaround for these kind of updates (where you MUST select many entries from a table that result a timeout)?
Joel
The usual way is to split the job into smaller tasks using the task queue.
You have several options, all will involve some form of background processing.
One choice would be to use your cron job to kick off a task which starts as many tasks as needed to summarize your data. Another choice would be to use one of Brett Slatkin's patterns and keep the data updated in (nearly) realtime. Check out his high performance data-pipelines talk for details.
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
You could also check out the mapper api (app engine map reduce) and see if it can do what you need.
http://code.google.com/p/appengine-mapreduce/

Resources