What I am trying to do is make a datastore put, and if it is successful, make a Search API put as well. This link has some useful info on how to do this with python, but I need an Objectify (Java) example.
I read that the way to do this is to have a transaction and then a task queue inside of that if it succeeds? Not sure how to do that so looking for a solid example.
There is an overload for Queue.add() that accepts a transaction object. So, during your transaction, enqueue a deferred task that will sync to search, passing it ofy().getTransaction().
The task can be quite simple - the only state is a key object. You load the entity, make the search document, write it to the index.
Related
As part of migrating my Google App Engine Standard project from python2 to python3, it looks like I also need to switch from using the Taskqueue API & Library to google-cloud-tasks.
In the taskqueue library I could enqueue upto 100 tasks at a time like this
taskqueue.Queue('default').add([...task objects...])
as well as enqueue tasks asynchronously.
In the new library as well as the new API, it looks like you can only enqueue tasks one at a time
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create
https://googleapis.dev/python/cloudtasks/latest/gapic/v2/api.html#google.cloud.tasks_v2.CloudTasksClient.create_task
I have an endpoint where it receives a batch with thousands of elements, each of which need to get processed in an individual task. How should I go about this?
According to the official documentation (reference 1, reference 2) the feature of adding task to queues asynchronously (as this post suggests for adding bulk number of tasks to a queue), is NOT an available feature via Cloud Tasks API. It is available for the users of App Engine SDK though.
However, there is a reference in the documentation regarding adding a large number of Cloud Tasks to a queue via double-injection pattern workaround (this post might seem useful too).
To implement this scenario, you'll need to create a new injector queue, whose single task would contain information to add multiple(100) tasks of the original queue that you're using. On the receiving end of this injector queue would be a service which does the actual addition of the intended tasks to your original queue. Although the addition of tasks in this service will be synchronous and 1-by-1, it will provide an asynchronous interface to your main application to bulk add tasks. In such a way you can overcome the limits of synchronous, 1-by-1 task addition in your main application.
Note that the 500/50/5 pattern of task addition to queue is a suggested method, in order to avoid any (queue/target) overloads.
As I did not find any examples of this implementation, I will edit the answer as soon as I find one.
Since you are in a migration process, I figured out that this link would be useful, as it concerns migrating from Task Queue to Cloud Tasks (as you stated you are thinking to do).
Additional information on migrating your code with all the available details you can find here and here, regarding Pull queues to Cloud Pub/Sub Migration and Push queues to Cloud Tasks Migration correspondingly.
In order to recreate a batch pull mechanism, you would have to switch to Pub/Sub. Cloud Tasks does not have pull queues. With Pub/Sub you can batch push and batch pull messages.
If you are using a push queue architecture, I would recommend passing those elements as the task payload; however the max task size is limited to 100kb.
I'm starting to build a bulk upload tool and I'm trying to work out how to accomplish one of the requirements.
The idea is that a user will upload a CSV file and the tool will parse it and send each row of the CSV to the task queue as a task to be run. Then once all the tasks (relating to that specific CSV file) are completed, a summary report will be sent to the user.
I'm using Google App Engine and in the past I've used the standard Task Queue to handle tasks. However, with the standard Task Queue there is no way of knowing when the queue has finished, no event is fired to trigger the report generation so I'm not sure how to achieve this?
I've looked into it more and I understand that Google also offers Google PubSub. This is more sophisticated and seems more suited, but I still can't find out how to trigger and event when a PubSub queue is finished, any ideas?
Seems that you could use a counter for this. Create an entity with an Integer property that is set to the number of lines of the CSV file. Each task will decrement the counter in a transaction when it finishes processing the row (in a transaction). One task will set the counter to 0, and that task could trigger the event. This might cause too much contention though.
Another possibility could be to have each task create an entity of a specific kind when it finishes processing a row. You can then count the number of these entities to determine when all the rows have been processed.
It might be easier to use the The GAE Pipeline API, which would take care of this as a basic portion of its functionality.
There's a nice article explaining it a bit here.
And a related SO question which happens to mention the same reason for moving to this API and has an excellent answer: Google AppEngine Pipelines API
I didn't use it myself yet, but it's just a matter of time :)
It's also possible to implement a scheme to track the related tasks still being active, see Figure out group of tasks completion time using TaskQueue and Datastore.
You can also check the queue (approximate) status, see Get number of tasks in a named queue?
I faced a similar problem earlier this week and managed to find a nice workaround for it. What i did was i created an extra column in the table where a task inserts data into. And once a specific task is completed, it updates this 'task_status' column with 'done', otherwise it's left as the default null. Then when the user refreshes the page or goes to a specific URL or you do an AJAX call to query the task status for a specific id in your table, you can see if it is complete or not.
select * from table where task_status is not null and id = ?;
You can also create a 'tasks' table where you can store relevant columns there instead of modifying existing tables.
Hope this finds you some use.
I have a Rails app that uses Sunspot, and it is generating a high volume of individual updates which are generating unnecessary load on Solr. What is the best way to send these updates to Solr in batches?
Assuming, the changes from the Rails apps also update a persistence store, you can check for Data Import Handler (DIH) handler which can be scheduled periodically to update Solr indexes.
So instead of each update and commits triggered on Solr, the frequency can be decided to update Solr in batches.
However, expect a latency in the search results.
Also, Are you updating the Individual records and commit ? If using Solr 4.0 you can check for Soft and Hard Commits as well.
Sunspot makes indexing a batch of documents pretty straightforward:
Sunspot.index(array_of_docs)
That will send off just the kind of batch update to Solr that you're looking for here.
The trick for your Rails app is finding the right scope for those batches of documents. Are they being created as the result of a bunch of user requests, and scattered all around your different application processes? Or do you have some batch process of your own that you control?
The sunspot_index_queue project on GitHub looks like a reasonable approach to this.
Alternatively, you can always turn off Sunspot's "auto-index" option, which fires off updates whenever your documents are updated. In your model, you can pass in auto_index: false to the searchable method.
searchable auto_index: false do
# sunspot setup
end
Then you have a bit more freedom to control indexing in batches. You might write a standalone Rake task which iterates through all objects created and updated in the last N minutes and index them in batches of 1,000 docs or so. An infinite loop of that should stand up to a pretty solid stream of updates.
At a really large scale, you really want all your updates going through some kind of queue. Inserting your document data into a queue like Kafka or AWS Kinesis for later processing in batches by another standalone indexing process would be ideal for this at scale.
I used a slightly different approach here:
I was already using auto_index: false and processing solr updates in the background using sidekiq. So instead of building an additional queue, I used the sidekiq-grouping gem to combine Solr update jobs into batches. Then I use Sunspot.index in the job to index the grouped objects in a single request.
We need to keep our Firebase data in sync with other databases for full-text search (in ElasticSearch) and other kinds of queries that Firebase doesn't easily support.
This needs to be as close to real-time as possible, we can't just export a nightly dump of the Firebase JSON or anything like that, aside from the fact that this will get rather large.
My initial thought was to run a Node.js client which listens to child_changed, child_added, child_removed etc... events of all the main lists, but this could get a bit unweildy and would it be a reliable way of syncing if the client re-connects after a period of time?
My next thought was to maintain a list of "items changed" events and write to that every time an item is created/updated, similar to the Firebase work queue example. The queue could contain the full path to the data which has changed and the worker just consumes that and updates the local database accordingly.
The problem here is every bit of code which makes updates has to remember to write to this queue otherwise the two systems will get out of sync. Some proxy code shouldn't be too hard to write though.
Has anyone else done anything similar with any success?
For search queries, you can integrate directly with ElasticSearch; there is no need to sync with a secondary database. Firebase has a blog post about integrating and a lib, Flashlight, to make this quick and painless.
Another option is to use the logstash-input-firebase Logstash plugin in order to listen to changes in your Firebase real-time database(s) and forward the data in real-time to Elasticsearch using an elasticsearch output.
is there a way to determine when a set of Google App Engine tasks (and child tasks they spawn) have all completed?
Let's say that I have 100 tasks to execute and 10 of those spawn 10 child tasks each. That's 200 tasks. Let's also say that those child tasks might spawn more tasks, recursively, etc...
Is there a way to determine when all tasks have completed? I tried using the app engine pipeline API, but it doesn't look like it's going to work out for my particular use case, even though it is a great API.
My use case is that I want to make a whole bunch of rate limited URL fetch calls while concurrently writing to a blob. At the end of all the URL fetch calls, I want to finalize the blob.
I found the solution with the pipeline API, but it does so much writing to the datastore that it wouldn't be cost effective for me with how often I need to run the pipeline.
There's no way around writing to a persistent storage medium of some sort, and the datastore is the only game in town. You could write your own server to track completions using a backend, but that's an awful lot of overhead for a simple task. Using the pipeline API is your best bet.