Blocking Web Scrapers [duplicate] - screen-scraping

This question already has answers here:
Detecting 'stealth' web-crawlers
(11 answers)
Closed 9 years ago.
What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?

Captchas
Form submitted in less than a second
Hidden (by css) field gets a value submitted during form submit
Frequent page visits
Simple bots can not scrap text from flash, images or sound.

Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.
However, here are some methods that can be implemented:
Check User-Agent (this could be spoofed though)
Use robots.txt (proper bots will - hopefully respect this)
Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.

You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.

I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...

Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.

Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/
From their site:
Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.
Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.

Related

Detect user location in React app, Can Fastly be used with Non Server Side Rendering react app?

this is my first task of detecting users' geo locations and I am a fairly new dev.
The app uses React and backend is node.js.
Currently we have some functions that calls an api which returns users' locations.( this takes a while)
But, two other options right now is use:
Geolocation API <--- this might need users' permission?
Fastly
For Fastly, I am asking
Does it work with non server side rendering app?
For production site, we have fastly set up in route53. but need to ask devops for staging environment. ( I got this info from others but do not know what that means )
Can someone even explains to me how fastly work and what needs to be set up?
Basically any information is appreciated. I do not know what should be googled to find out the answers.
Thanks.
If you have Fastly fronting your app, then YES you can definitely use Fastly to provide geolocation information.
Just to be clear (as you mentioned you were unfamiliar with Fastly and more generally are a "new dev"), when I say "fronting your app" I mean: when a client (e.g. a user's web browser) makes a request for https://yourapp.com/, does the request first get routed through Fastly? If it does, then Fastly will proxy the request through to your app and any data you send back through Fastly to the client will likely be cached to make future requests for all your users much quicker (this is one of the many functions Fastly provides).
Fastly has lots of products, but for your primary purposes there are two platform services Fastly offers:
Content Delivery (CDN) which is built on Varnish/VCL (if your ops team already has Fastly setup then this is likely what they have).
Compute#Edge which is built upon WebAssembly.
I would highly recommend reading the following resources to understand more about the Fastly platform options:
Content Delivery with VCL
Content Delivery with Compute#Edge
As far as using Fastly to handle geolocation information, I'll point you to the following resources:
https://developer.fastly.com/solutions/examples/geo-ip-api-at-the-edge
https://developer.fastly.com/solutions/examples/decorating-origin-requests-with-geoip
Also search the following page for references to "geolocation" as there are quite a few 'examples' that you might be interested in:
https://developer.fastly.com/solutions/examples/
I would also suggest having a play around with https://fiddle.fastly.dev which let's you use either VCL or any of the supported Compute#Edge languages to test out ideas without needing to have a real Fastly service setup. This will give you a chance to trial out some geolocation code.
Lastly, you can also have a read through the first half of https://www.integralist.co.uk/posts/fastly-varnish/ which covers some basics about Fastly's use of Varnish/VCL (but I'd suggest reading the official references, linked above, first).
Any other questions, then please feel free to reach out to support#fastly.com who will be happy to help.

How a hacker can perform an xss attack if he does not have access to the user computer? [duplicate]

This question already has answers here:
What does it mean when they say React is XSS protected?
(2 answers)
Closed 2 years ago.
I am reading some articles about security in React applications. Indeed, I use localstorage to store the user's infos and I've seen that an xss attack could easily allow a hacker to steal them.
However, I understand that in React, an xss attack can only be performed through a setDangerouslyInnerHtml tag that displays a content written in an input. This way, you can steal his infos, cookies session, ect. and send them to your website.
But a hacker could only do this if he has the chance to write his script on the user's computer right? So, if I don't use any setDangerouslyInnerHtml tag, is the localstorage safe in a React app? If not, how a hacker could run such an attack on the website?
If the user uses a public computer it might be possible.
If you have some functionality which allows external users to post content on your site, for example comments or reactions then someone might write a script which sends localstorage data to a hacker.
There are a lot of ways to exploit this, check owasp for more detailed explanation.
https://owasp.org/www-project-top-ten/
Developers must accept what attackers can do:
They can retheme an entire site,
They can too make "bot" scripts to automate tasks and in other words flood your server if that was the task.
All limits defined in JS/HTML can and will be bypassed, (e.g: character lengths in forms/etc)
The entire page can be re-written to not talk to your server-right, in other words crashing it and more if not handled/detected.
The list goes on but accept it's pretty much all off the table if someone wants to pry hard enough.
There's not a whole lot you can do to prevent this, to explain! You can add an external script from randomxyxsite.com and though trusted could under-go an attack where that script now runs "loggers or some type of analytic grabbing bot", this in my opinion is easily avoided by not adding external scripts if you can.
Though I said what I said originally, here's where you're stuck...
Any user can open console/build extensions or use a third-party loader like Tampermonkey and other alternatives and execute script at their will. This too can become "shared" and comparable to botnet behavior.
So what can you do to stop clients from mis-behaving or "super-modding" their content for malicious server-use?
Some ways to safe-guard:
Server-sided requests should pass through some form of check/sanitization to ensure that whatever any of the clients pass-to it is absolutely safe to absorb.
Never let the user tell you who they beyond login, define these users by sessionid; know these users by their session and when user<>user, get between them and follow the above point.
Keep as much as possible private. Public variables/classes/functions are easily re-written during run-time leaving some features you maybe intended on to fall apart.
window.PayFeature = function(){};
ALLOW XSS:
If feared, a developer should study it more. As much as a user can distort/change their end it's only an issue if the traffic changes or the data received from them starts becoming attack like. So for a developer your best bet is to actually rate-limit, set rules and more for users so that abuse is detected and stopped. As long as you do that, you should never fear it but welcome it, when server is secured it becomes a matter of spam (potential botnet)

will gatling actually perform the operation or will it check only the urls' response time?

I have a gatling test for an application that will answer a survey and upon answering this survey, the application will identify possible answers that may pose a risk and create what we call riskareas. These riskareas are normally created in the background as soon as the survey answering is finished. My question is I have a gatling test with ten users who will go and answer the survey and logout, I used recorder to record the test; now after these ten users are finished I do not see any riskareas being created in the application. Am I missing something--should the survey be really answered by gatling (like it does in selenium) user or is it just the urls that the gatling test will touch ?
I am new to gatling please help.
Gatling should be indistinguishable from a user in a web browser (or Selenium) as far as the server is concerned, so the end result should be exactly the same as if you'd gone through the process yourself. However, writing a Gatling script is a little more work than writing a Selenium script.
For performance reasons, Gatling operates at a lower level than Selenium. Gatling works with the actual data that is sent and received from the server (i.e, the actual GETs and POSTs sent to the server), rather than with user-level interactions (such as clicking links and filling forms).
The recorder will generally produce a relaitvely "dumb" script. It records the exact data that was sent to the server, and makes no attempt to account for things that may change from run to run. For example, the web application you are testing might have hidden form fields that contain session information, or the link addresses might contain a unique identifier or a session id.
This means that your script may not be doing what you think it's doing.
To debug the script, the first thing to do is to add checks on each of the requests, to validate that you are getting the response you expect (for example, check that when you submit page 1 of the survey, you are taken to page 2 - check for something that you'd only expect to find on page 2, like a specific question).
Once you know which requests are failing, look at what data was sent with the request, and try to figure out where it came from. You will probably find that there are session ids, view state, or similar, that must be extracted from the previous page.
It will help to enable request and response logging, as per the documentation.
To simplify testing of web apps, we wrote some helper functions to allow tests to be written in a more Selenium-like way. Once you understand what your application is doing, you may find that it simplifies scripting for you too. However, understanding why your current script doesn't work the way you expect should be your first step.

Heuristics to discover spammers/bots (In forums, blogs etc)

The ways I can think of are:
Measure the time between actions.
Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
(added by chakrit) Number of links in a post.
Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
(added by Alekc) Not heuristic. User-agent values.
And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).
Any more ideas?
I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.
Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.
I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.
If you really try to make something like this. I would attempt by doing the following:
User starts off with no ability to post a url.
After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.
The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.
Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.
That was just the first thing that came to mind. Hope it helps.
The number of links in a post.
I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.
And most of spam comments at my blog contains 10+ links in them.
Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.
How about a search for spam related keywords in the post body?
Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.
Time between page visits is common I believe.
I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.
You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)
There is another answer that suggests using Akismet for detecting spam, which I completely endorse.
However, they are not the only player on the block.
There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.
You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).
Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.
Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.
I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA
I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.

How do screen scrapers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.

Resources