Identifying hostile web crawlers

Identifying hostile web crawlers - screen-scraping

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.
If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.
And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?
I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.
What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?

My solution would be to make a trap. Put some pages on your site where access are banned by robots.txt. Make a link on you page, but hide it with CSS, then ip ban anybody who goes to that page.
This will force the offender to obey robots.txt, which means that you can put important information or services permanently away from him, which will make his carbon-copy clone useless.

Don't try and recognize by IP and timing or intervals--use the data you send to the crawler to trace them.
Create a whitelist of known good crawlers--you'll serve them your content normally. For the rest, serve pages with an extra bit of unique content that only you will know how to look for. Use that signature to later identify who has been copying your content and block them.

And how do you keep someone from hiring a person in a country with low wages to use a browser to access your site and record all of the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if accessible, like javascript), patent your inventions, and copyright your site. Let the legal people worry about someone ripping you off.

Related

IPFS can't really enforce nodes to delete an uploaded file, isn't that a problem?

As this decentralisation wave is taking place around the digital world, I was wondering how can you remove some content that you just uploaded on a decentralized network.
As I understand, more and more people want to have decentralized services, because as opposed to the client-server architecture this gives you more ownership of your stuff and everything is more transparent. But, what happens if you messed up, or the company you're a client of messed up and they/you upload some personal info that you clearly don't want others to have access to? Since it's a peer-to-peer network everybody has access to it and there's no way to enforce them to delete it.
I think what I am trying to understand is how this decentralized future will play out with private data, will there be a centralized place for private data and then we gonna do other things on ipfs and different similar apps? Because if that's so then what's the purpose, why not continue as it is right now? Maybe I am still not seeing all the use cases...

IPFS does allows you to delete file, you just need to make so on all different nodes hosting the file.
If some nodes aren't in your control the process is fairly simple, monitor ipfs dht findprovs <A file you want to delete>, find all peers hosting the file, then for each find their IP with ipfs dht findpeer <Peer ID>, then use a database like whois or BGP to find the ISP and send them C&D or GDPR claim or whatever.
Appart from the tools you use being ipfs centred it's the exact same process as for regular good old web2 with http.
You might think that for multiple nodes it's unlikely for everyone to comply and whatever juridiction you use to claim your rights of forgotness.
But that already happen with http, you can host your server in a country that doesn't follow whatever law you claim your right of thoses files to be removed or use Tor and mostly not worry about legal threats.
GDPR or any other law like that is already ineffective at removing stuff from the web, the goal is more to scare big players and help politicians keep their job (putting in place an ineffective solution to a problem not many people understand can help them get a good reception of the public and being renewed).

Yes it can be a problem. Companies which store data of their customers should not store them on a Blockchain. As in Europe with the GDPR-Law they are obliged to delete the data if the customer requests it.
I have had a similar issue at my company when we were discovering if we should use a decentraliced network in a project. In this link here is a statement from R3 (which developed Corda, a DLT for Business) about this topic. It is from 3 years ago but it's still relevant in my opinion.
So the solution is to only store the reference to the user (like an ID) on-chain and keep the sensitive stuff off-chain.
Another interesting project is Atala Prism, but unfortunatly I had not yet the time to dive into it.

Ad Blocker w/ Segment.io

I'm considering using segment.io for several of my client-side 3rd party API needs, but I'm a little concerned about ad-blockers.
My app has no ads, but I do a lot of event-tracking for product analytics, as well as error tracking.
Segment.io offers a nice all-in-one solution, but if it's blocked, and all my eggs are in that basket, then, well, I won't have any eggs left, or however that idiom ends.
So my question is: is there a way to integrate multi-purpose event tracking (segment.io, keen.io, etc.) that isn't as susceptible to ad-blocking?
My app is mostly serverless, using a Firebase+AWS Lambda setup, so I've tried to think of some kind of back-end solution, but no big ideas so far.
ETA: I'm not looking to track adblocking users or violate anyone's trust. my question is about event-tracking unrelated to a user's identity, and whether or not that's possible in an all-in-one event tracking library that might be ad-blocked.

First, I'd typically consider such blocking to be "privacy" blocking instead of ads. So instead of Adblock it's more likely to be Ghostery or uBlock Origin.
Although most website uses of analytics are benign (improving conversion rates, catching browser exceptions, etc), the concern many have is that it allows the third party analytics sites (including segment, etc) to track users across multiple websites. Now most of these analytics sites are also not interested in that, but better safe than sorry?
The ethics of wanting to have analytics about all your webapp use are far more nuanced than "privacy good, tracking bad" and I don't think this is the forum for it, so I'll provide you a technical answer. Just note that your disclaimer about not wanting to "track adblocking users" is not really valid. If your aim is to gather analytics about them, that's still essentially tracking. Otherwise just use a hosted solution and realise that maybe 10-20% of users don't provide you with analytics.
The bad news: basically every "hosted" analytics solution is or will be in the block lists. Not only are their API hosts directly blocked, but there are also blocks in placed based on the name of JS files you may try to include.
The good news: you can work around it if you relay events through your own API, and AWS API Gateway which you may already be using is perfect for this.
There are multiple steps to this.
Step 1: The analytics provider need to provide the option of a fully bundled/built JS file. If they require you to pull the script dynamically from their own servers then it will be blocked there before it even downloads.
Step 2: Rename the bundled script so that it doesn't trigger any filename-based blocks, e.g. rename from mixpanel.umd.js to mp.js, and add it to your server.
Step 3: Create an API gateway to relay events back to the "correct" API (e.g. to api.analyticshost.com). You can actually do this with AWS API gateway only (no lambda required) if you pass through the right headers and URL params.
Step 4: Initialise the library to use your API host rather than the default one.
The result of this is (a) the browser no longer needs to dynamically pull the analytics from the analytics provider's CDN, and instead gets it from your server, and (b) the browser sends it to your API and then relayed through to the analytics provider's.

When gathering analytics segment also provides server side tracking libraries. This can be quite useful when you want to gather metrics for certain types of events that might be blocked by users on the client. At it's simplest, Segment has an HTTP Source but there are a number of popular languages supported as well.
https://segment.com/docs/connections/sources/catalog/#server
The classic example is the order complete event, this would typically happen in your server once that transaction has been committed to a database. Regardless of browser configuration, you could send this event from the server and track.
Be sure you respect the users consent management settings here though.

A lot of valid points are already mentioned in the accepted answer, I would mention a few technical considerations to minimize ad blockers impact on tracking tools (Segment, Google Tag Manager, etc):
Develop for server-side tracking. Whatever is on server cannot be blocked by ad blockers. However, this is usually tricky and very custom, Segment gives some examples on it as well.
Use managed client-side proxy solutions like DataUnlocker. This is great and does not require many code changes.
Use self-hosted open-source solutions for proxying Google Analytics and Google Tag Manager like this or this. I believe these solutions can be extended to support Segment as well.

Secure database and webpage against modification

My website provides extremely sensible information (think of bank account numbers) publicly available through webpages and webservices. The customers may modify these information when authentified with a username and a password.
Any hacking intrusion that would successfully modify the entries of the database, or modify the information displayed on the webpage, would be disastrous, as account numbers might then be incorrect and money could be directed to a malicious bank account.
Do you have any general advices about the architecture that would make such a service as robust as possible? I would not be responsible in case of a weak password, so my main concern is about attacks that would simply bypass the authentication process and modify the database without triggering any alert on my side; it could also be the html code of the webpage that is directly modified to show different information...
Thank you

In this case i would make sure to harden the system itself as good as possible. This includes a very broad spectrum reaching from Security Roles over transaction based usage of the database, logging as well as the prevention of all sorts of attacks like SQL injection, cross site scripting in general and maybe if its a that sensible system use certificates and general IP checks (like have a white list of IPs that are allowed to populate requests to the system that do not instantly get refused). Not to mention your Host architecture has to be protected regardless of the implemented security features inside your system (key words: firewalls, user privileges etc.). During the development process there should always be auto code checking software (like Sonar) running to detect logical errors and stuff.
Then it could also be a good idear to have a second system just to monitor your primary systems status. This system should log and notify you on:
changes made to the system itself (like if someone has access to your business logic and for examply removes authentication logic)
changes made to the database that are not consistent with your primary systems state.
detect suspicious actions: Banks for example have rules that apply on your account. Like if you used to make payments within europe for the last time and then out of nothing make a huge payment to lets say china you would recive a notification to commit this payment. The payment then would not be triggered unless that second commitment of the customer.
In the end you already pointed out correctly that you just can harden it as good as possible but generally not make it "100%" safe (at least in theory) so to have a good level of security part of the total system would include beeing able to detect unwanted changes, identify the exact changes already beeing made and have information on the overall status of your system to allow a rollback or manual correction of a corruptet state in case it already happened.
Even after having implemented mentioned techniques you would have to continously check for security bugs in used frameworks, librarys and the system as a full (like using security penetration frameworks that auto try to corrupt your system).
What i want to show you with my answer is what the comments already suggest: It is a very broad and complex topic with multiple layers of security concernes you will have to either study yourself or have framework solutions that "ensure" you to take care of the topic (like Webframeworks often include basic XSS prevention).

Without wanting to sound harsh, but if you have to ask this question on Stack Overflow, you're not really qualified to work on this project.
The financial value of your data sounds like it's enough for an attacker to expend significant resources breaching your defenses - and the consequences of such a breach would be disastrous for your organization and its customers; it could lead to the organization having to close down. You really don't want to be learning about security from strangers on the internet in this case.
One place to start learning in is with the established standards for managing financial information, often referred to as "PCI standards"; these provide guidelines for hardware, software and processes for organizations that deal with payment details.
There are numerous books on IT security; I like the "Hacking Exposed" series, and "Security Engineering".
You might also bring in specialized IT security consultants; I've worked with a number of these guys, and many of them are very good at helping you engineer security into your solution.

Website content crawling

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers.
We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue?
All suggestions are highly appreciated.

You could try a spider trap, but they could add a check for that.
You could also add a rate limiter, and after a certain rate force them to solve a CAPTCHA, but you might also annoy your regular users.
But really, anything you create they can probably adapt and work around. Your best be might just be what Developer Art said, and get a lawyer.

If there are many pages of data, you can monitor the IPs of visitors and make sure a given IP sees no more than a fraction of your pages per day.
Ultimately what you want is a contradiction: you do want people to download it to their computers (to view it now); but you don't want them to download it to their computers (to view it later).

How do I create a web application where I do not have access to the data?

Premise: The requirements for an upcoming project include the fact that no one except for authorized users have access to certain data. This is usually fine, but this circumstance is not usual. The requirements state that there be no way for even the programmer or any other IT employee be able to access this information. (They want me to store it without being able to see it, ever.)
In all of the scenarios I've come up with, I can always find a way to access the data. Let me describe some of them.
Scenario I: Restrict the table on the live database so that only the SQL Admin can access it directly.
Hack 1: I rollout a change that sends the data to a different table for later viewing. Also, the SQL Admin can see the data, which breaks the requirement.
Scenario II: Encrypt the data so that it requires a password to decrypt. This password would be known by the users only. It would be required each time a new record is created as well as each time the data from an old record was retrieved. The encryption/decryption would happen in JavaScript so that the password would never be sent to the server, where it could be logged or sniffed.
Hack II: Rollout a change that logs keypresses in javascript and posts them back to the server so that I can retrieve the password. Or, rollout a change that simply stores the unecrypted data in a hidden field that can be posted to the server for later viewing.
Scenario III: Do the same as Scenario II, except that the encryption/decryption happens on a website that we do not control. This magic website would allow a user to input a password and the encrypted or plain-text data, then use javascript to decrypt or encrypt that data. Then, the user could just copy the encrypted text and put the in the field for new records. They would also have to use this site to see the plain-text for old records.
Hack III: Besides installing a full-fledged key logger on their system, I don't know how to break this one.
So, Scenario III looks promising, but it's cumbersome for the users. Are there any other possibilities that I may be overlooking?

If you can have javascript on the page, then I don't think there's anything you can do. If you can see it in a browser, then that means it's in the DOM, which means you can write a script to get it and send it to you after it has been decrypted.
Aren't these problems usually solved via controls:
All programmers need a certain level of clearance and background checks
They are trained to understand that rolling out code to access the data is a fireable or worse offense
Every change in certain areas needs some kind of signoff
For example -- no JavaScript on page without signoff.
If you are allowed to add any code you want, then there's always a way, IMO.

Ask the client to provide an Non-disclosure Agreement for you to sign, sign it, then look at as much data as you want.
What I'm wondering is, what exactly will you be able to do with encrypted data anyway? Pretty-much all apps require you to do some filtering of the data, whether it be move it to a required place, modify it, sanitize it, or display it. Otherwise, you're just a glorified pipe, and you don't have to do any work.
The only way I can think of where you wouldn't be looking at the data or doing anything with it would be a simple form to table mapping with CRUD options. If you know what format the data will be coming in as you should be able to roll something out with RoR, a simple skin, put SSL into the mix, and roll it out. Test with dummy data in the same format, and you're set.
In fact, is your client unable to supply dummy data for testing? If they can, then your life is simple as all you do is provide an "installable" and tell them how to edit a config file.

I think you could still create the app in the following way:
Create a dev database and set up a user for it.
Ask them for: the data type, size, and name of each field that needs to be on the screen.
Set up the screens, create columns in the database that accept the data type and size they specify.
Deploy the app to production, hooked up to an empty database. Get someone with permission (not you) to go in and set the password on the database user and set the password for the DB user in the web app.
Authorized users can then do whatever they want and you never saw what any of the data looked like.
Of course, maintaining the app and debugging is gonna be a bitch!
--In answer to comments:
Ok, so after setting up the password for the Username in the database and in the web app's config, write a program that connects to the database, sets a randomized password, then writes that same randomized password to the web config.
Prevent any outgoing packets from the machine except to a set of authorized workstations - so you can't install your spyware.
Then set the Admin password on both servers to the same random password, then delete all other users on the servers, delete the program, and delete the program source code.
Wipe the hard drives of the developer machines with the DOD algorithm, and then toss them into an industrial shredder.
10. If the server ever needs debugging, toss it in the trash, buy a new one, and start back at #1.
But seriously - this is an insolvable problem. The best answer to this really is:
Tell them they can't have an application. Write your stuff on paper. Put it in a folder. Lock it in a vault. Thrust, repeat.

Wouldn't scenario 3 just expose all the data to the magic website? This doesn't sound like a solvable problem (at least I can't think of a solution).

Go with whatever solution is easiest for you to implement, I think the requirements show the the client does not understand software development and so it should be easy to sell any approach you take.

I have to say I really don't like the idea of using JavaScript on the client to decrypt the data. That is a huge hole as any script (hacker, GreaseMonkey, IE7Pro, etc.) can access the DOM and get data out of the page.
Also, it is very hard to get around the problem of key stroke loggers. If you throw those into the mix, then your options are limited. At that point you need a security FOB such as RSA (commonly used with corporate VPNs) to generate truly random PINs. That will probably be expensive, and it is a pain, and I have only seen it used with VPNs but I assume it could work with websites as well.
As far as the website, I'd stick with HTTPS and find a way to encrypt/decrypt through the WebServer rather than relying on JavaScript. The SSL traffic isn't very prone to sniffing (very difficult to decrypt), so that allows the encryption and decryption to happen server-side which (IMHO) is more secure.
Look at banking scenarios and other financial institutions for a starting point, and then go from there. Try not to over-complicate if possible.

You can't guarantee against hacking into the data as long as you have access to the server it lives on. So tell the employer they have to host the data somewhere else and grant access to the client's browser via a secure HTTPS connection.
You can design your web page to dynamically load an XML data stream securely, and format it into a web page using an XSLT script on the client.
See http://www.w3schools.com/xsl/xsl_client.asp for examples
That way you produce the code, but you never have access to the data. Only the user has access to their own data.
As for how the employer is going to host the data without granting any IT people access to it, that's their problem. It's a foolish requirement.

I think that I'll just tell them that they either have to trust a couple of us to have access (and not look at it) or they don't get a project.
Thanks for the answers. Feel free to post more thoughts if you have them.

You can never have 100% security, and extra security comes at a cost of speed/price/convenience etc.
Let's suppose you take scenario 3 - one of your programmers can use social engineering to get the password from one of the users. Goodbye security.
There's no point having a high-security iron door as a gate if people can just walk around it. Just implement a decent level of security.

(They want me to store it without being able to see it, ever.)
Hey, the recording industry wants people to be able to listen to their music, but not copy it. Sounds like they should get together sometime!
Their idea won't work for the same reason DRM doesn't work: the trust chain is inherently compromised. Encryption examples often use Alice, Bob, and Charlie where Alice is trying to communicate with Bob without Charlie listening in. With DRM, the trust chain is compromised because Bob and Charlie are the same person. With your situation, Charlie is the guy writing the software that Alice and Bob use to communicate. There's an implied trust, because if you don't trust Charlie then you can't trust Charlie's software, either.
That's the root of the issue: trust. If they can't trust the programmer, the game is over before it starts.

There are lots of options based on what their goal really is, but I am confused by their paranoia, er, intent:
Is this their (and end-user) data that they wish to keep private or end-user data to be kept private from everyone?
Is it just that your (or any contracted) company is suspect?
Are they afraid of over-the-wire snooping?
Are they afraid of DOM access through JavaScript or browser plugins?
Are they planning staged deployment? In that case you work on test/dev server w/o real data but have no access to the production server with the real data, and DNS logging and/or firewall rules inhibit all of your hacks from working undetected.
Ultimately if the data is stored in a DB then the programmer and DB admin can, by working together, get it. Period. A good audit should uncover that, though.

If this is truly a requirement, the only way to guard against this is to hire an outside firm to audit the code prior to releasing the software, and that's going to be very expensive.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight