When I add a task to the task queue, sometimes the task goes missing. I dont get any errors but I just dont find the tasks in my logs. Suppose I add n tasks. The computation cannot go forward without these n tasks finishing. However, I find that one or more of these n tasks just went missing after they were added and my whole algorithm stops in the middle.
What could be the reason ?
I keep a variable w to check the number of times the task was added. I observe w = n though some tasks were not created.
def addtask_whx(index,user,seqlen,vp_compress,iseq_compress):
global w
while True :
timeout_ms = 100
taskq_name = 'whx'+'--'+str(index[0])+'-'+str(index[1])+'-'+str(index[2])+'-'+str(index[3])+'-'+str(index[5]) + '--' + user
try :
taskqueue.add(name=taskq_name+str(timeout_ms),queue_name='whx',url='/whx', params={'m': index[0],'n': index[1],'o': index[2],'p': index[3],'q':0,'r':index[5],'user': user,'seqlen':seqlen,'vp':vp_compress,'iseq':iseq_compress})
w = w+1
break
except DeadlineExceededError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Timeout Retrying")
except TransientError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Transient Error Retrying")
except TombstonedTaskError:
logging.error("WHX Task Queue Tombstoned Error")
break
Disclaimer: this is not the answer you're looking for, but I hope it will help you nonetheless.
The computation cannot go forward
without these n tasks finishing
It sounds like you are using the task queue for something it was not designed to do. You should read: http://code.google.com/appengine/docs/java/taskqueue/overview.html#Queue_Concepts
Tasks are not guaranteed to be executed in the order they arrive, and they are not guaranteed to be executed exactly once. In some cases, a single task may be executed more than once or not at all. Further, a task can be cancelled and re-queued at the discretion of the App Engine based on available resources. For example, your timeout_ms = 100 is very low; if a new JVM has to be started, which could take several seconds, tasks n+1 and n+2 may be executed before task n.
In short, the task queue is not a reliable mechanism for performing strictly sequential computation. You've been warned.
-tjw
Related
This may be a trivial question, but I was just hoping to get some practical experience from people who may know more about this than I do.
I wanted to generate a database in GAE from a very large series of XML files -- as a form of validation, I am calculating statistics on the GAE datastore, and I know there should be ~16,000 entities, but when I perform a count, I'm getting more on the order of 12,000.
The way I'm doing counting is basically I perform a filter, fetch a page of 1000 entities, and then spin up task queues for each entity (using its key). Each task queue then adds "1" to a counter that I'm storing.
I think I may have juiced the datastore writes too much; I set the rate of my task queues to 50/s.. I did get some writing errors, but not nearly enough to justify the 4,000 difference. Could it be possible that I was rushing the counting calls too much that it lead to inconsistency? Would slowing the rate that I process task queues to something like 5/s solve the problem? Thanks.
You can count your entities very easily (no tasks and almost for free):
int total = 0;
Query q = new Query("entity_kind").setKeysOnly();
// set your filter on this query
QueryResultList<Entity> results;
Cursor cursor = null;
FetchOptions queryOptions = FetchOptions.Builder.withLimit(1000).chunkSize(1000);
do {
if (cursor != null) {
queryOptions.startCursor(cursor);
}
results = datastore.prepare(q).asQueryResultList(queryOptions);
total += results.size();
cursor = results.getCursor();
} while (results.size() == 1000);
System.out.println("Total entities: " + total);
UPDATE:
If looping like I suggested takes too long, you can spin a task for every 100/500/1000 entities - it's definitely more efficient than creating a task for each entity. Even very complex calculations should take milliseconds in Java if done right.
For example, each task can retrieve a batch of entities, spin a new task (and pass a query cursor to this new task), and then proceed with your calculations.
I need to loop over a large dataset within appengine. Ofcourse, as the datastore times out after a small amount of time, I decided to use tasks to solve this problem, here's an attempt to explain the method I'm trying to use:
Initialization of task via http post
0) Create query (entity.query()), and set a batch_size limit (i.e. 500)
1) Check if there are any cursors--if this is the first time running, there won't be any.
2a) If there are no cursors, use iter() with the following options: produce_cursors = true, limit= batch_size
2b) If there are curors, use iter() with same options as 2a + set start_cursor to the cursor.
3) Do a for loop to iterate through the results pulled by iter()
4) Get cursor_after()
5) Queue new task (basically re-run the task that was running) passing the cursor into the payload.
So if this code were to work the way I wanted, there'd only be 1 task running at any particular time in the queue. However, I started running the task this morning and 3 hours later when I looked at the queue, there were 4 tasks in it! This is weird because the new task should only be launched at the end of the task launching it.
Here's the actual code with no edits:
class send_missed_swipes(BaseHandler): #disabled
def post(self):
"""Loops across entire database (as filtered) """
#Settings
BATCH_SIZE = 500
cursor = self.request.get('cursor')
start = datetime.datetime(2014, 2, 13, 0, 0, 0, 0)
end = datetime.datetime(2014, 3, 5, 0, 0, 00, 0)
#Filters
swipes = responses.query()
swipes = swipes.filter(responses.date>start)
if cursor:
num_updated = int(self.request.get('num_updated'))
cursor = ndb.Cursor.from_websafe_string(cursor)
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE,start_cursor=cursor)
else:
num_updated = 0
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE)
count = 0
for swipe in swipes:
count += 1
if swipe.date>end:
pass
else:
uKey = str(swipe.uuId.urlsafe())
pKey = str(swipe.pId.urlsafe())
act = swipe.act
taskqueue.add(queue_name="analyzeData", url="/admin/analyzeData/send_swipes", params={'act':act,'uKey':uKey,'pKey':pKey})
num_updated += 1
logging.info('count = '+str(count))
logging.info('num updated = '+str(num_updated))
cursor = swipes.cursor_after().to_websafe_string()
taskqueue.add(queue_name="default", url="/admin/analyzeData/send_missed_swipes", params={'cursor':cursor,'num_updated':num_updated})
This is a bit of a complicated question, so please let me know if I need to explain it better. And thanks for the help!
p.s. Threadsafe is false in app.yaml
I believe a task can be executed multiple times, therefore it is important to make your process idempotent.
From doc https://developers.google.com/appengine/docs/python/taskqueue/overview-push
Note that this example is not idempotent. It is possible for the task
queue to execute a task more than once. In this case, the counter is
incremented each time the task is run, possibly skewing the results.
You can create task with name to handle this
https://developers.google.com/appengine/docs/python/taskqueue/#Python_Task_names
I'm curious why threadsafe=False in your yaml?
A bit off topic (since I'm not addressing your problems), but this sounds like a job for map reduce.
On topic: you can create custom queue with max_concurrent_requests=1. You could still have multiple tasks in the queue, but only one would be executing at a time.
I'm hoping someone can confirm what is actually happening here with TPL and SQL connections.
Basically, I have a large application which, in essence, reads a table from SQL Server, and then processes each row - serially. The processing of each row can take quite some time. So, I thought to change this to use the Task Parallel Library, with a "Parallel.ForEach" across the rows in the datatable. This seems to work for a little while (minutes), then it all goes pear-shaped with...
"The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."
Now, I surmised the following (which may of course be entirely wrong).
The "ForEach" creates tasks for each row, up to some limit based on the number of cores (or whatever). Lets say 4 for want of a better idea. Each of the four tasks gets a row, and goes off to process it. TPL waits until the machine is not too busy, and fires up some more. I'm expecting a max of four.
But that's not what I observe - and not what I think is happening.
So... I wrote a quick test (see below):
Sub Main()
Dim tbl As New DataTable()
FillTable(tbl)
Parallel.ForEach(tbl.AsEnumerable(), AddressOf ProcessRow)
End Sub
Private n As Integer = 0
Sub ProcessRow(row As DataRow, state As ParallelLoopState)
n += 1 ' I know... not thread safe
Console.WriteLine("Starting thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
Using cnx As SqlConnection = New SqlConnection(My.Settings.ConnectionString)
cnx.Open()
Thread.Sleep(TimeSpan.FromMinutes(5))
cnx.Close()
End Using
Console.WriteLine("Closing thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
n -= 1
End Sub
This creates way more than my guess at the number of tasks. So, I surmise that TPL fires up tasks to the limit it thinks will keep my machine busy, but hey, what's this, we're not very busy here, so lets start some more. Still not very busy, so... etc. (seems like one new task a second - roughly).
This is reasonable-ish, but I expect it to go pop 30 seconds (SQL connection timeout) after when and if it gets 100 open SQL connections - the default connection pool size - which it doesn't.
So, to scale it back a bit, I change my connection string to limit the max pool size.
Sub Main()
Dim tbl As New DataTable()
Dim csb As New SqlConnectionStringBuilder(My.Settings.ConnectionString)
csb.MaxPoolSize = 10
csb.ApplicationName = "Test 1"
My.Settings("ConnectionString") = csb.ToString()
FillTable(tbl)
Parallel.ForEach(tbl.AsEnumerable(), AddressOf ProcessRow)
End Sub
I count the real number of connections to the SQL server, and as expected, its 10. But my application has fired up 26 tasks - and then hangs. So, setting the max pool size for SQL somehow limited the number of tasks to 26, but why no 27, and especially, why doesn't it fall over at 11 because the pool is full ?
Obviously, somewhere along the line I'm asking for more work than my machine can do, and I can add "MaxDegreesOfParallelism" to the ForEach, but I'm interested in what's actually going on here.
PS.
Actually, after sitting with 26 tasks for (I'm guessing) 5 minutes, it does fall over with the original (max pool size reached) error. Huh ?
Thanks.
Edit 1:
Actually, what I now think happens in the tasks (my "ProcessRow" method) is that after 10 successful connections/tasks, the 11th does block for the connection timeout, and then does get the original exception - as do any subsequent tasks.
So... I conclude that the TPL is creating tasks at about 1 a second, and it gets enough time to create about 26/27 before task 11 throws an exception. All subsequent tasks then also throw exceptions (about a second apart) and the TPL stops creating new tasks (because it gets unhandled exceptions in one or more tasks ?)
For some reason (as yet undetermined), the ForEach than hangs for a while. If I modify my ProcessRow method to use the state to say "stop", it appears to have no effect.
Sub ProcessRow(row As DataRow, state As ParallelLoopState)
n += 1
Console.WriteLine("Starting thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
Try
Using cnx As SqlConnection = fnNewConnection()
Thread.Sleep(TimeSpan.FromMinutes(5))
End Using
Catch ex As Exception
Console.WriteLine("Exception on thread {0}", Thread.CurrentThread.ManagedThreadId)
state.Stop()
Throw
End Try
Console.WriteLine("Closing thread {0}({1})", n, Thread.CurrentThread.ManagedThreadId)
n -= 1
End Sub
Edit 2:
Dur... The reason for the long delay is that, while tasks 11 onwards all crash and burn, tasks 1 to 10 don't, and all sit there sleeping for 5 minutes. The TPL has stopped creating new tasks (because of the unhandled exception in one or more of the tasks it has created), and then waits for the un-crashed tasks to complete.
The edits to the original question add more detail and, eventually, the answer becomes apparent.
TPL creates tasks repeatedly because the tasks it has created are (basically) idle. This is fine until the connection pool is exhausted, at which point the tasks which want a new connection wait for one to become available, and timeout. In the meantime, the TPL is still creating more tasks, all doomed to fail. After the connection timeout, the tasks start failing, and the ensuing exception(s) cause the TPL to stop creating new tasks. The TPL then waits for the tasks that did get connections to complete, before an AggregateException is thrown.
The TPL is not made for IO-bound work. It has heuristics which it uses to steer the count of threads being active. These heuristics fail for long-running and/or IO-bound tasks, causing it to inject more and more threads without a practical limit.
Use PLINQ to set a fixed amount of threads using WithDegreeOfParallelism. You should probably test different amounts. It could look like this. I have written much more about this topic on SO, but I can't find it at the moment.
I have no idea why you are seeing exactly 26 threads in your example. Note, that when the pool is depleted, a request to take a connection only fails after a timeout. This entire system is very non-deterministic and I'd consider any number of threads plausible.
I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.
Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.
There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')
For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break
I want to do some heavy processing in the map() call of the mapper.
I was going through the source file MapReduceServlet.java:
// Amount of time to spend on actual map() calls per task execution.
public static final int PROCESSING_TIME_PER_TASK_MS = 10000;
Does it mean, the map call can last only for 10secs. What happens after 10sec?
Can I increase this to large number like 1min or 10min.
-Aswath
MapReduce operations are executed in tasks using Push Queues, and as said in the documentation the task deadline is currently 10 minutes (limit after which you will get a DeadlineExceededException).
If the task failed to execute, by default App Engine retries it until it succeed. If you need longer deadline that 10 minutes, you can use Backend for executing your tasks.
Looking at the actual usage of PROCESSING_TIME_PER_TASK_MS in Worker.java, this value is used to limit the number of map call done in a single task.
After each map call has been executed if more than 10s has elapsed since the beginning of the task, it will spawn a new task to handle the rest of the map calls.
Worker.scheduleWorker spawns a new Task for a each given shard
Each task will call Worker.processMapper
processMapper execute 1 map call
if less than PROCESSING_TIME_PER_TASK_MS have elapsed since 2. go back to 3.
else if processing is not finished reschedule a new worker task
In the worst case scenario the default task request deadline (10 minutes) should apply to each of your individual map call.