How to ensure that a bot/scraper does not get blocked

How to ensure that a bot/scraper does not get blocked - screen-scraping

I coded a simple scraper , who's job is to go on several different pages of a site. Do some parsing , call some URL's that are otherwise called via AJAX , and store the data in a database.
Trouble is , that sometimes my ip is blocked after my scraper executes. What steps can I take so that my ip does not get blocked? Are there any recommended practices? I have added a 5 second gap between requests to almost no effect. The site is medium-big(need to scrape several URLs)and my internet connection slow, so the script runs for over an hour. Would being on a faster net connection(like on a hosting service) help ?
Basically I want to code a well behaved bot.
lastly I am not POST'ing or spamming .
Edit: I think I'll break my script into 4-5 parts and run them at different times of the day.

You could use rotating proxies, but that wouldn't be a very well behaved bot. Have you looked at the site's robots.txt?

Write your bot so that it is more polite, i.e. don't sequentially fetch everything, but add delays in strategic places.

Following guidelines set in robots.txt is a good first step. There are tools such as import.io and morph.io. There are also packages/ plugins for servers. For example x-ray; a node.js which have options to assist in quickly writing responsible scrapers e.g. throttle, delays, max connections etc.

Related

Host server bloating with numerous fake session files created in hundreds every minute on php(7.3) website

I'm experiencing an unusual problem with my php (7.3) website creating huge number of unwanted session files on server every minute (around 50 to 100 files) and i noticed all of them having a fixed size of 125K or 0K in cPanel's file manager, hitting iNode counts going uncontrolled into thousands in hours & hundred thousands+ in a day; where as my website really have a small traffic of less than 3K a day and google crawler on top it. I'm denying all bad bots in .htaccess.
I'm able to control situation with help of a cron command that executes every six hours cleaning all session files older than 12hours from /tmp, however this isn't an ideal solution as fake session files getting created great in number eating all my server resources like RAM, Processor & most importantly Storage getting bloated impacting overall site performance.
I opened many of such files to examine but found them not associated with any valid user as i add user id, name, email to session upon successful authentication. Even assuming a session created for every visitor (without acc/login), it shouldn't go beyond 3K on a day but sessions count going as high as 125.000+ just in a day. Couldn't figure out the glitch.
I've gone through relevant posts and made checks like adding IP & UserAgent to sessions to track suspecious server monitoring, bot crawling, overwhelming proxy activities, but with no luck! I can also confirm by watching their timestamps that there is no human or crawler activity taken place when they're being created. Can see files being created every single minute without any break throughout the day!!.
Didn't find any clue yet in order to figure out root cause behind this weird behavior and highly appreciate any sort of help to troubleshoot this! Unfortunately server team unable to help much but added clean-up cron. Pasting below content of example session files:
0K Sized> favourites|a:0:{}LAST_ACTIVITY|i:1608871384
125K Sized> favourites|a:0:{}LAST_ACTIVITY|i:1608871395;empcontact|s:0:"";encryptedToken|s:40:"b881239480a324f621948029c0c02dc45ab4262a";
Valid Ex.File1> favourites|a:0:{}LAST_ACTIVITY|i:1608870991;applicant_email|s:26:"raju.mallxxxxx#gmail.com";applicant_phone|s:11:"09701300000";applicant|1;applicant_name|s:4:Raju;
Valid Ex.File2> favourites|a:0:{}LAST_ACTIVITY|i:1608919741;applicant_email|s:26:"raju.mallxxxxx#gmail.com";applicant_phone|s:11:"09701300000";IP|s:13:"13.126.144.95";UA|s:92:"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0 X-Middleton/1";applicant|N;applicant_name|N;

We found that the issue triggered following hosting server's PHP version change from 5.6 to 7.3. However we noticed unwanted overwhelming session files not created on PHP 7.0! It's same code base we tested against three versions. Posting this as it may help others facing similar issue due to PHP version changes.

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?

This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

will gatling actually perform the operation or will it check only the urls' response time?

I have a gatling test for an application that will answer a survey and upon answering this survey, the application will identify possible answers that may pose a risk and create what we call riskareas. These riskareas are normally created in the background as soon as the survey answering is finished. My question is I have a gatling test with ten users who will go and answer the survey and logout, I used recorder to record the test; now after these ten users are finished I do not see any riskareas being created in the application. Am I missing something--should the survey be really answered by gatling (like it does in selenium) user or is it just the urls that the gatling test will touch ?
I am new to gatling please help.

Gatling should be indistinguishable from a user in a web browser (or Selenium) as far as the server is concerned, so the end result should be exactly the same as if you'd gone through the process yourself. However, writing a Gatling script is a little more work than writing a Selenium script.
For performance reasons, Gatling operates at a lower level than Selenium. Gatling works with the actual data that is sent and received from the server (i.e, the actual GETs and POSTs sent to the server), rather than with user-level interactions (such as clicking links and filling forms).
The recorder will generally produce a relaitvely "dumb" script. It records the exact data that was sent to the server, and makes no attempt to account for things that may change from run to run. For example, the web application you are testing might have hidden form fields that contain session information, or the link addresses might contain a unique identifier or a session id.
This means that your script may not be doing what you think it's doing.
To debug the script, the first thing to do is to add checks on each of the requests, to validate that you are getting the response you expect (for example, check that when you submit page 1 of the survey, you are taken to page 2 - check for something that you'd only expect to find on page 2, like a specific question).
Once you know which requests are failing, look at what data was sent with the request, and try to figure out where it came from. You will probably find that there are session ids, view state, or similar, that must be extracted from the previous page.
It will help to enable request and response logging, as per the documentation.
To simplify testing of web apps, we wrote some helper functions to allow tests to be written in a more Selenium-like way. Once you understand what your application is doing, you may find that it simplifies scripting for you too. However, understanding why your current script doesn't work the way you expect should be your first step.

Consecutive XML HTTP Requests seem to block on Google App Engine

I am working on an application on Google App Engine. Roughly this is what I do:
The user screen is split into 2 parts (actually 3, but lets leave that out for now). The left part (this takes upto 75% of the screen) has a document with some words highlighted. When one of these highlighted words are clicked the right part displays various meanings of it, example usage etc. The way this works is clicking the word send an XML HTTP Request to the server, where the sample usage(s)/meaning(s) are retrieved from the datastore. This data is returned and displayed.
My problem:
After I click on a few words consecutively, the application seems to "hang" - say, I click on 5 words in quick succession, clicking on the 6th word (or any word after that) doesn't replace the info regarding the 5th word on my right panel.
Since some data store columns (at least single valued properties) are indexed by default I'm guessing retrieval is not the bottleneck here. It is probably the requests.
Is such an issue known with the GAE? Any workarounds possible?
Kind of in a soup with this - the application was supposed to go live today. Urgent help required!
Thanks! :)

You're probably being limited to two simultaneous requests by your browser - not by appengine. If you click on a third link before the first two have had a chance to return, make sure your app can deal with requests returning for links that should no longer be displayed.
If you were hitting a limit on appengine, you'd see exceptions in your server logs. If you're not seeing those exceptions, it's probably a client-side issue.

Sorry for the late ack (for some reason I received a notification for the responses a day late, by which we had managed to fix a few things). It does look like the problem was at the data end - our code was doing some inserts, and it turns out you can't do too many of them quickly - the logs reported a transaction time-out error. The reason we couldn't spot it earlier in the logs was we were writing simply too much info out and this was buried in somewhere.
The clicks on the user-side were pulling data from this table.
Unfortunately, the GAE simulator doesn't simulate any timeout error - so even though we had tested with comparable volumes of data before deployment this error never happened during development.
Thanks again for your responses!
And yet again, I apologize for responding late.

How can I find why some classic asp pages randomly take a real long time to execute?

I'm working on a rather large classic asp / SQL Server application.
A new version was rolled out a few months ago with a lot of new features, and I must have a very nasty bug somewhere : some very basic pages randomly take a very long time to execute.
A few clues :
It isn't the database : when I run the query profiler, it doesn't detect any long running query
When I launch IIS Diagnostic tools, reqviewer shows that the request is in state "processing"
This can happen on ANY page
I can't reproduce it easily, it's completely random.
To have an idea of "a very long time" : this morning I had a page take more than 5 minutes to execute, when it normaly should be returned to the client in less than 100 ms.
The application can handle rather large upload and download of files (up to 2 gb in size). This is also handled with a classic asp script, using SoftArtisan FileUp. Don't think it can cause the problem though, we've had these uploads for quite a while now.
I've had the problem on two separate servers (in two separate locations, with different sets of data). One is running the application with good ol' SQL Server 2000 and the other runs SQL Server 2005. The web server is IIS 6 in both cases.
Any idea what the problem is or on how to solve that kind of problem ?
Thanks.
Sebastien
Edit :
The problem came from memory fragmentation. Some asp pages were used to download files from the server. File sizes could go from a few kb to more than 2 gb. These variations in size induced memory fragmentation. The asp pages could also take quite some time to execute (the time for the user to download the pages minus what is put in cache at IIS's level), which is not really standard for server pages that should execute quickly.
This is what I did to improve things :
Put all the download logic in a single asp page with session turned off
That allowed me to put that asp page in a specific pool that could be recycled every so often (download would now disturb the rest of the application no more)
Turn on LFH (Low Fragmention Heap), which is not by default on Windows 2003, in order to reduce memory fragmentation
References for LFH :
http://msdn.microsoft.com/en-us/library/aa366750(v=vs.85).aspx
Link (there is a dll there that you can use to turn on LFH, but the article is in French. You'll have to learn our beautiful language now!)

I noticed the same thing on a classic ASP + ajax application that I worked on. Using Timer, I timed the page load to be 153 milliseconds but in the firebug waterfall chart it randomly says 3.5 seconds. The Timer output is on the response and the waterfall chart claims that it's Firefox waiting for a response from the server. Because the waterfall chart also shows the response, I can compare the waterfall chart to the timer and there's a huge discrepancy 'every so often'

Can you establish whether this is a problem for all pages or a common subset of pages?
If a subset examine what these pages have in common, for example they all use a specific COM dll, that other pages don't.
Does this problem affect multiple clients or just a few?
IOW is there an issue with a specific browser OS version.
Is this public or intranet?
Can you reproduce the problem from a client you own?

Is there any chance there are some full-text search queries going on SQL Server?
Because if so, and if SQL Server has no access to internet, it may cause a 45-second delay every few hours or so when it tries to check the certifications (though this does not apply to SQL Server 2000).
For a detailed explanation of what I'm referring to, read this.

Are any other apps running on your web server? If so, is your problematic in the same app pool as any of them? If so, try creating a dedicated app pool for it. Maybe one of the other apps is having a problem and is adversely affecting yours.

One thing to watch out for is if you have server side debugging turned on in IIS, the web server will run in single threaded mode.
So if you try to load a page, and someone else has hit that url at the same time, you will be queued up behind them. It will seem like pages take a long time to load, but its simply because the server is doling out page requests in a single file line and sometimes you aren't at the front of the line.
You may have turned this on for debugging and forgot to turn it off for production.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight