loop with requests.get and avoid timeout errors - loops

I am trying to get scrape information.
I goto the introduction page to determine the number of search results. Often, the results occur over >1 page, and as such, I need to refresh and run another requests in a loop. On a few occasions an error occurs in the extra requests, or it hangs.
I am curious if there is a way to check a request, if it fails, then try again, and if it still fails, log it and go to the next one
Here is a sample script:
t=.3
urls=['https://stackoverflow.com/','https://www.google.com/'] #list of upto 200 urls
for url in urls:
print(url)
response = requests.get(url, timeout=t)
t=t+10
i=i-1
running this returns a timeout occasionally, and the processing stops. My workaround is to print the url that failed, and then rerun, updating the list manually.
I would like to find a way that if a request error occurs, the response retries 5x, and if it fails, logs and stores the failed url, then goes onto the next one, so that I can try the failed urls at a later stage
Any suggestions?

I haven't used Python in a while, but I'm pretty sure that this will work.
t=.3
urls=['https://stackoverflow.com/','https://www.google.com/'] #list of upto 200 urls
for url in urls:
print(url)
try:
response = requests.get(url, timeout=t)
except:
# if the request fails: try again
try:
response = requests.get(url, timeout=t)
except:
# if the request fails again: do nothing (continue to the next url)
pass
t=t+10
i=i-1

Related

Gatling overrides `Cookie` header with the `Set-Cookie` header of previous response

I have an http request in Gatling that gets executed 10 times:
val scnProxy = scenario("Access proxy")
.exec(session => session.set("connect.sid", sessionId))
.repeat(10) {
exec(
http("Access endpoint")
.get("/my-api")
.header(
"Cookie",
session => "connect.sid=" + session("connect.sid").as[String]
)
.check(status is 200)
)
}
For some reason, I get the intended response only on the first iteration. On every other iteration, I keep getting 401. So, I changed log level to TRACE to see what the problem is and found a weird behavior. For the first iteration, I get the header Cookie: connect.sid=... but for some reason, on second and other iterations, the cookie parameter gets overridden by the set-cookie of the previous request. Since Cookie header value is a string, it does not merge these cookies.
Is there a way that I can add a cookie instead of my cookie getting overriden?
Use the proper Gatling components for manipulating cookies.

Gmail API: messages.list suddenly there's a messages error key despite there's a nextPageToken in the prior iteration

I used to interact with the Gmail API since past year using these tests https://developers.google.com/gmail/api/v1/reference/users/messages/list#try-it but now this examples are failing because seems there are more messages but the next iteration is coming empty.
Problem is in this part of the code:
while 'nextPageToken' in response:
page_token = response['nextPageToken']
response = service.users().messages().list(userId=user_id, q=query,
pageToken=page_token).execute()
messages.extend(response['messages'])
The error is raised when trying to access the response['messages'] as the unique key in the reponse is 'resultSizeEstimate' and is 0. Sounds like the page_token is pointing to a next empty page.
Is someone experiencing this issue as well?
If your last page perfectly contains the last email with that particular query, you will get a nextPageToken to a page with a response like this:
{
"resultSizeEstimate": 0
}
The easiest way around this is to just add a check if messages is part of the response:
while 'nextPageToken' in response:
page_token = response['nextPageToken']
response = service.users().messages().list(userId=user_id, q=query, pageToken=page_token).execute()
if 'messages' in response:
messages.extend(response['messages'])

How can an HTTP 403 be returned from an apache web server input filter?

I have written an apache 2.x module that attempts to scan request bodies, and conditionally return 403 Forbidden if certain patterns match.
My first attempt used ap_hook_handler to intercept the request, scan it and then returned DECLINED to the real handler could take over (or 403 if conditions were met).
Problem with that approach is when I read the POST body of the request (using ap_get_client_block and friends), it apparently consumed body so that if the request was subsequently handled by mod_proxy, the body was gone.
I think the right way to scan the body would be to use an input filter, except an input filter can only return APR_SUCCESS or fail. Any return codes other than APR_SUCCESS get translated into HTTP 400 Bad Request.
I think maybe I can store a flag in the request notes if the input filter wants to fail the request, but I'm not sure which later hook to get that.
turned out to be pretty easy - just drop an error bucket into the brigade:
apr_bucket_brigade *brigade = apr_brigade_create(f->r->pool, f->r->connection->bucket_alloc);
apr_bucket *bucket = ap_bucket_error_create(403, NULL, f->r->pool,
f->r->connection->bucket_alloc);
APR_BRIGADE_INSERT_TAIL(brigade, bucket);
bucket = apr_bucket_eos_create(f->r->connection->bucket_alloc);
APR_BRIGADE_INSERT_TAIL(brigade, bucket);
ap_pass_brigade(f->next, brigade);

App Engine generating infinite retries

I have a backends that is normally invoked by a cron to run a few times every day. Yesterday, I noticed it was restarting without stopping. I dont see a place in my code where that invocation is happening. Rather, the task queue seems to indicate it is running due to re-tries due to errors. One error is that status is saved to bigQuery and that is failing because a quoto is exceeded. But this seems to generate an infinite loop. Is this a bug in app engine or I am doing something wrong? Is there a way to indicate to not restart a task if it fails? My other app engine tasks that terminate without 200 status dont do that...
Here is a trace of the queue from which the restarts keep happening:
Here is the logging showing continous running
And here is the http header inside the logging
UPDATE1
Here is the cron:
<?xml version="1.0" encoding="UTF-8"?>
<cronentries>
<cron>
<url>/uploadToBigQueryStatus</url>
<description>Check fileNameSaved Status</description>
<schedule>every 15 minutes from 02:30 to 03:30</schedule>
<timezone>US/Pacific</timezone>
<target>checkuploadstatus-backend</target>
</cron>
</cronentries>
UPDATE 2
As for the comment about catching the error: The error I believe is that the biqQuery job fails because a quota has been hit. Strange thing is that it happened yesterday, and the quota should have been reset, so the error should have good away for at least a while. I dont understand why the task retries, I never selected that option that I am aware of.
I killed the servlet and emptied the task queue so at least it is stopped. But I dont know the root cause. IF BQ table quota was the reason, that shouldnt cause an infinite retry!
UPDATE 3
I have not trapped the servlet call that produced the error that led to the infinite retry. But I checked this cron activated servlet today and found I had another non-200 result. The return value this time was 500 and it is caused by a DataStore time-out exception.
Here is the screen shot of the return that show 500 return code.
Here is the exception info page 1
And the following data
The offending code line is the for loop iterating on the data store query
if (keys[0] != null) {
/* Define the query */
q = new Query(bucket).setAncestor(keys[0]);
pq = datastore.prepare(q);
gotResult = false;
// First system time stamp
Date date= new Timestamp(new Date().getTime());
Timestamp timeStampNow = new Timestamp(date.getTime());
for (Entity result : pq.asIterable()) {
I will add a try-catch on this for loop as it is crashing in this iteration.
if (keys[0] != null) {
/* Define the query */
q = new Query(bucket).setAncestor(keys[0]);
pq = datastore.prepare(q);
gotResult = false;
// First system time stamp
Date date= new Timestamp(new Date().getTime());
Timestamp timeStampNow = new Timestamp(date.getTime());
try {
for (Entity result : pq.asIterable()) {
Hopefully, the data store read will not crash the servlet but it will render a failure. At leas the cron will run again and pickup other non-handled results.
By the way, is this a java error or app engine? I see a lot of these data store time outs and I will add a try-catch around all the result loops. Still, it should not cause the infinite retry that I experienced. I will see if I can find the actual crash..problem is that it overloaded my logging...More later.
UPDATE 4
I went back to the logs to see when the inifinite loop began. In the logs below, I opened the run that is at the head of the continuous running. YOu can see that it fails with 500 every 5th time. It is not the cron that invoked it, it was me calling the servlet to check biq query upload status (I write to the data store the job info, then read it back in servlet and write to bigQuery the job status and if done, erase the data store entry.) I cannot explain the steady 500 errors every 5th call, but it is always the Data Store Timeout exception.
UPDATE 5
Can the infinite retries be happening because of the queue configuration?
CheckUploadStatus
20/s
10
100
10
200
2
I just noticed another task queue had a 500 return code and it was continuously retrying. I did some search and found some people have tried to configure
the queues for no retry. They said that didnt work.
See this link:
Google App Engine: task_retry_limit doesn't work?
But one re-try is possible? That is far better than infinite.
It is contradictory that Google enforces quotas but seems to prefer infinite retries. I would much prefer block the retries by default on non-200 return code and then have NO QUOTAS!!!
According to Retrying cron jobs that fail:
If a cron job's request handler returns a status code that is not in
the range 200–299 (inclusive) App Engine considers the job to have
failed. By default, failed jobs are not retried.
To set failed jobs to be retried:
Include a retry-parameters block in your cron.xml file.
Choose and set the retry parameters in the retry-parameters block.
Your cron config doesn't specify the necessary retry parameters, so the jobs returning the 500 code should, indeed, not be retried, as you expect.
So this looks like a bug. Possibly a variant of the (older) known issue 10075 - the 503 code mentioned there might have changed in the mean time - but it is also a quota-related failure.
The suggestion from GAEfan's comment is likely a good workaround:
You will need to catch the error, and send a 200 response to stop the
task queue from retrying. – GAEfan 1 hour ago

Camel Apache: can I use a retryWhile to re-send a request?

I would like to achieve the following kind of orchestration with CAMEL:
Client sends a HTTP POST request to CAMEL
CAMEL sends HTTP POST request to external endpoint (server)
External server replies with a 200 OK
CAMEL sends HTTP GET request to external endpoint (server)
External server replies
After step 5, I want to check the reply: if the reply is a 200 OK and state = INPROGRESS (this state can be retrieved from the received XML body), I want to re-transmit the HTTP GET to the external endpoint until the state is different from INPROGRESS.
I was thinking to use the retryWhile statement, but I am not sure how to build the routine within the route.
Eg, for checking whether the reply is a 200 OK and state = INPROGRESS, I can easily introduce a Predicate. So the retryWhile already becomes like:
.retryWhile(Is200OKandINPROGRESS)
but where should I place it in the route so that the HTTP GET will be re-transmitted ?
Eg: (only taking step 4 and 5 into account)
from("...")
// here format the message to be sent out
.to("external_server")
// what code should I write here ??
// something like:
// .onException(alwaysDo.class)
// .retryWhile(Is200OKandINPROGRESS)
// .delay(2000)
// .end ()
// or maybe it should not be here ??
I am also a bit confused how the "alwaysDo.class" should look like ??
Or ... should I use something completely different to solve this orchestration ?
(I just want to re-transmit as long as I get a 200 OK with INPROGRESS state ...)
Thanks in advance for your help.
On CAMEL Nabble, someone replied my question. Check out:
http://camel.465427.n5.nabble.com/Camel-Apache-can-I-use-a-retryWhile-to-re-send-a-request-td5498382.html
By using a loop statement, I could re-transmit the HTTP GET request until I received a state different from INPROGRESS. The check on the state needs to be put inside the loop statement using a choice statement. So something like:
.loop(60)
.choice()
.when(not(Is200OKandINPROGRESS)).stop() // if state is not INPROGRESS, then stop the loop
.end() // choice
.log("Received an INPROGRESS reply on QueryTransaction ... retrying in 5 seconds")
.delay(5000)
.to(httpendpoint")
.end() //loop
I never experimented what you are trying to do but it seems does not seem right.
In the code you are showing, the retry will only occur when an alwaysDo Exception is thrown.
The alwaysDo.class you are refering to should be the name of the Java Exception class you are expecting to handle. See http://camel.apache.org/exception-clause.html for more details.
The idea should be to make the call and inspect the response content then do a CBR based on the state attribute. Either call the GET again or terminate/continue the route.
You probably should write a message to the Apache Camel mailing list (or via Nabble) . Commiters are watching it and are very reactive.

Resources