Website content crawling - screen-scraping

Website content crawling - screen-scraping

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers.
We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue?
All suggestions are highly appreciated.

You could try a spider trap, but they could add a check for that.
You could also add a rate limiter, and after a certain rate force them to solve a CAPTCHA, but you might also annoy your regular users.
But really, anything you create they can probably adapt and work around. Your best be might just be what Developer Art said, and get a lawyer.

If there are many pages of data, you can monitor the IPs of visitors and make sure a given IP sees no more than a fraction of your pages per day.
Ultimately what you want is a contradiction: you do want people to download it to their computers (to view it now); but you don't want them to download it to their computers (to view it later).

Related

IPFS can't really enforce nodes to delete an uploaded file, isn't that a problem?

As this decentralisation wave is taking place around the digital world, I was wondering how can you remove some content that you just uploaded on a decentralized network.
As I understand, more and more people want to have decentralized services, because as opposed to the client-server architecture this gives you more ownership of your stuff and everything is more transparent. But, what happens if you messed up, or the company you're a client of messed up and they/you upload some personal info that you clearly don't want others to have access to? Since it's a peer-to-peer network everybody has access to it and there's no way to enforce them to delete it.
I think what I am trying to understand is how this decentralized future will play out with private data, will there be a centralized place for private data and then we gonna do other things on ipfs and different similar apps? Because if that's so then what's the purpose, why not continue as it is right now? Maybe I am still not seeing all the use cases...

IPFS does allows you to delete file, you just need to make so on all different nodes hosting the file.
If some nodes aren't in your control the process is fairly simple, monitor ipfs dht findprovs <A file you want to delete>, find all peers hosting the file, then for each find their IP with ipfs dht findpeer <Peer ID>, then use a database like whois or BGP to find the ISP and send them C&D or GDPR claim or whatever.
Appart from the tools you use being ipfs centred it's the exact same process as for regular good old web2 with http.
You might think that for multiple nodes it's unlikely for everyone to comply and whatever juridiction you use to claim your rights of forgotness.
But that already happen with http, you can host your server in a country that doesn't follow whatever law you claim your right of thoses files to be removed or use Tor and mostly not worry about legal threats.
GDPR or any other law like that is already ineffective at removing stuff from the web, the goal is more to scare big players and help politicians keep their job (putting in place an ineffective solution to a problem not many people understand can help them get a good reception of the public and being renewed).

Yes it can be a problem. Companies which store data of their customers should not store them on a Blockchain. As in Europe with the GDPR-Law they are obliged to delete the data if the customer requests it.
I have had a similar issue at my company when we were discovering if we should use a decentraliced network in a project. In this link here is a statement from R3 (which developed Corda, a DLT for Business) about this topic. It is from 3 years ago but it's still relevant in my opinion.
So the solution is to only store the reference to the user (like an ID) on-chain and keep the sensitive stuff off-chain.
Another interesting project is Atala Prism, but unfortunatly I had not yet the time to dive into it.

How is it done that global sides with millions of users are so fast?

could you please help me with some reference/literature/source/web where I can find more information about ways to optimize a site with a lot of data inside. The goal is to create a site where everyone can access data/files and do not need to use i.e. MS terminal or vmware viewer to access data from our other location. Should I invest into internal servers or look for cloud? I would like to know where to start. It's question from dummy, but thank you for help!

One of the aspects that make sites like facebook, google, and so on so fast is that they replicate their servers in different part of the world so users do not have the latency that occurs when you try to contact a website on the other side of the world.
The servers used are also very high end so they are able to handle a lot of simultaneous requests at a time.
And finally probably one of the most important aspect is that the database schemas, the code and the architecture are constantly refined to work as fast as possible.

Resolving Overloaded Webserver Issues

I am new to the area of web development and currently interviewing companies, the most favorite questions among what people ask is:
How do you scale your webserver if it
starts hitting a million queries?
What would you do if you have just one
database instance running at that
time? how do you manage that?
These questions are really interesting and I would like to learn about them.
Please pour in your suggestions / practices (that you follow) for such scenarios
Thank you

How to scale:
Identify your bottlenecks.
Identify the correct solution for the problem.
Check to see you you can implement the correct solution.
Identify alternate solution and check
Typical Scaling Options:
Vertical Scaling (bigger, faster server hardware)
Load balancing
Split tiers/components out onto more/other hardware
Offload work through caching/cdn
Database Scaling Options:
Vertical Scaling (bigger, faster server hardware)
Replication (active or passive)
Clustering (if DBMS supports it)
Sharding

At the most basic level, scaling web servers consists of writing your app in such a way that it can run on > 1 machine, and throwing more machines at the problem. No matter how much you tune them, the eventual scaling will involve a farm of web servers.
The database issue is way more sticky to deal with. What is your read / write percentage? What kind of application is this? OLTP? OLAP? Social Media? What is the database? How do we add more servers to handle the load? Do we partition our data across multiple dbs? Or replicate all changes to loads of slaves?
Your questions call more questions, i.e. in an interview, if someone just "has the answer" to a generic question like you've posted, then they only know one way of doing things, and that way may or may not be the best one.

There are a few approaches I'd take to the first question:
Are there hardware upgrades that may get things up enough to handle the million queries in a short time? If so, this is likely an initial point to investigate.
Are there software changes that could be made to optimize the performance of the server? I know IIS has a ton of different settings that could be used to improve performance to some extent.
Consider going into a web farm situation rather than use a single server. I actually did have a situation where I worked once where we did have millions of hits a minute and it was thrashing our web servers rather badly and taking down a number of sites. Our solution was to change the load balancer so that a few of the servers served up the site that would thrash the servers so that other servers could keep the other sites up as this was in the fall and in retail this is your big quarter. While some would start here, I'd likely come here last as this can be opening a bit can of worms compared to the other two options.
As for the database instance, it would be a similar set of options to my mind though I may do the multi-server option first as redundancy may be an important side benefit here that I'm not sure it is as easy with a web server. I may be way off, but that is how I'd initially tackle this.

Use a caching proxy
If you serve identical pages to all visitors (say, a news site) you can reduce load by an order of magnitude by caching generated content with a caching proxy such as Varnish or Apache Traffic Server.
The proxy will sit between your server and your visitors. If you get 10,000 hits to your front page it will only have to be generated once, the proxy will send the same response to the other 9999 visitors without asking your app server again.

probably before developer starting to develop the system,
they will consider the specification of the server
maybe you can decrease use of SEO and block it from search engine to craw it
(which is the task that taking a lot of resource)
try to index everything well and avoid to making search easily

Deploy it on the cloud, make sure your web server and webapp cloud ready and can scale across different nodes. I recommend cherokee web server (very easy to load balance across different servers, and benchmarks proves faster than Apache,). For ex, google cloud (appspot) needs your web app to be Python or Java
Use caching proxy eg. Nginx.
For database use memcache on some queries which are suppose to be repeated.
If the company wants data to be private , build a private cloud , Here , Ubuntu is doing very good job at it fully free and opensource : http://www.ubuntu.com/cloud/private

Identifying hostile web crawlers

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.
If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.
And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?
I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.
What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?

My solution would be to make a trap. Put some pages on your site where access are banned by robots.txt. Make a link on you page, but hide it with CSS, then ip ban anybody who goes to that page.
This will force the offender to obey robots.txt, which means that you can put important information or services permanently away from him, which will make his carbon-copy clone useless.

Don't try and recognize by IP and timing or intervals--use the data you send to the crawler to trace them.
Create a whitelist of known good crawlers--you'll serve them your content normally. For the rest, serve pages with an extra bit of unique content that only you will know how to look for. Use that signature to later identify who has been copying your content and block them.

And how do you keep someone from hiring a person in a country with low wages to use a browser to access your site and record all of the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if accessible, like javascript), patent your inventions, and copyright your site. Let the legal people worry about someone ripping you off.

What are the cons of a web based application

I am going to write a database application for the camp I work for. I am thinking about writing it in C# with a Windows GUI interface but using a browser as the application is seeming more and more appelaing for various reasons. What I am wondering is why someone would not choose to write an application as a web application. Ex. The back button can cause you some trouble. Are there other things that ayone can think of?

There are plenty of cons:
Speed and responsiveness tend to be significantly worse
Complicated UI widgets (such as tree controls) are harder to do
Rendering graphics of any kind is pretty tricky, 3D graphics is even harder
You have to mess around with logins
A centralised server means clients always need network access
Security restrictions may cause you trouble
Browser incompatibilities can cause a lot of extra work
UI conventions are less well-defined on the web - users may find it harder to use
Client-side storage is limited
The question is.. do enough of those apply to your project to make web the wrong choice?

One thing that was not mentioned here is the level of complexity and knowledge required to generate a good web application. The problem being unless you are doing something very simple, there is no "Single" knowledge or technology that goes into these applications.
For example if you were to write an application for some client server platform.. you may develop in Java or C++. For a complex web application you may have to have expertise in Java, Java Script, HTML, Flash, CSS, Ajax, SQL, J2EE.. etc. Also the components of a web based application are also more numerous, Web Application Server, HTTP Server, Database, Browser.. are tipical components but there could be more.. a client server app is tipical just what it says.. a client application and a Server application. My experience and personal preference is not web based .. web based is great for many things. But even though I am an IT Architect for a leading company that is completely emersed in Web Apps as the solution for everything... The cons are many still.. I do thing the technology will evolve and the cons will go away over time though.

Essentially the real limitations are only through of the platform, being the browser. If you have to account for all browsers in current use that can be a pain due to varying degrees of standards in each of them.
If have control of the which browser to use, that is everyone is on computers that you control on site, and say you install firefox on all of them, you could then leverage the latest Javascript and CSS standards to their fullest in your content delivery.
[edit] You could also look into options like the adobe integrated runtime or "AIR" as an option allowing you to code the front-end with traditional browser based options like xhtml/css/javascript, flash/flex and have the backend hooked up to your database online, only also providing functionality of a traditional desktop app at the same time.

The biggest difference and drawback I see with web applications is state management. Since the web is, by nature, stateless every thing you want to maintain has to be sent back and forth from the server with every request and response. How to efficiently store and retrieve it in a matter with respect to page size and performance is hard to do at times. Also the fact that there is no real standard (at least not that everyone adheres to) for browsers makes consistency really..........fun.

You need to have a network access to the server that you are going to have the web application on (if there are going to be multiple users for the application - which is typically the case).
Actually, there are more pros than cons - if you can give some details about your application, we could help a little more...

It completely depends on the requirements of your project. For the most part, there isn't much web applications cannot do these days. Admittedly, certain applications do belong on the desktop as browsers (while currently advancing, and rapidly), still are not quite there yet. From the advent of applications such as Google Docs, Gmail
There isn't much you -cannot- do on the web. If you're creating a World of Warcraft competitor however, the web is most certainly not the optimal solution. Again, unfortunately we'd need more insight on the application you're building for the camp. The best part about the web is that anyone with a browser can use your application.

Web applications delegate processing to a remote machine. Depending on the amount of processing, this can be a con. Consider a photo editor that's a web app.
Web applications also can't deal with a whole lot of data going back and forth to and from a client. You can watch video online.. when it's compressed. It will be awhile before we see any web-based video editing software.
Browser compatibility is also a hassle. You can't control the look-and-feel of the application 100%.
Vaibhav has a good point. What's your application?

A major one is down time for migrations... users will not expect the application to be down, ever, but realistically it will have to be down for major upgrades. When doing this with a desktop application, the user (or end-user systems admin) is in control of when upgrades happen; with an online app, they're not.
For applications which have large data, performance can be a major problem as you're storing a large number of users' data centrally, which means the IO performance will not be as good as it would be if you gave them all a laptop.
In general scalability gives problems for a server-based app. Desktop applications scale really well.

You can do an awful lot with a web-based app, but it is a lot easier to do certain things with a thick client:
Performance: You get simple access to the full power of the client's CPU.
Responsiveness: Interactivity is fast and easy.
Graphics: You can easily use graphics libraries such as DirectX and OpenGL to create fast impressive graphics.
Work with local files
Peer-to-peer

Deciding whether a web application is a good approach depends on what you are trying to achieve. However here are some more general cons of web applications:
Real integration with desktop apps (e.g. Outlook) is impossible
Drag and drop between your app and the desktop / other running apps

With a web application, there are more privacy concerns, when you are storing user data on your servers. You have to make sure that you don't loose/disclose it and your users have to be comfortable with the idea of storing that data on your servers.
Apart from that, there are many security problems, like Man-in-the-middle attacks, XSS or SQL injections.
You also need to make sure that you have enough computing power and bandwidth at hand.

"Ex. The back button can cause you some trouble."
You'll have to be specific on this. A lot of people make fundamental mistakes in their web applications and introduce bugs in how they handle transactions. If you do not use "Redirect after Post" (also known as Post-Redirect-Get, PRG design), then you've created a bug which appears as a problem with the back button.
A blanket statement that the back button in trouble is unlikely to be true. A specific example would clarify your specific question on this.

The back button really is not that much of an issue if you design your application correctly. You can use AJAX to manipulate parts of the current page, without adding items into the browser history (since the page itself wont change).
The biggest issue with designing web applications has to do with state, and the challenges that need to be programmed around. With a desktop application, state is easy to handle, you can leave a database connection opened, lock the record and wait for the user to make the changes and commit. With a web application, you could lock the record...but then what if the user closes the browser? These things must be overcome in the design of your application.
When designing a web application, make sure that each trip to the server "stands alone" and provides a complete answer. Always re-initialize your variables before performing any work and never assume anything. One of the challenges I ran into once was pulling "pages" of grid data back to the user. In a real busy system, with record additions/modifications happening in real time, the user navigation from page to page would vary greatly, sometimes even resulting in viewing the same set of a few records as new additions were added in-front of the query.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Website content crawling - screen-scraping

Related

IPFS can't really enforce nodes to delete an uploaded file, isn't that a problem?

How is it done that global sides with millions of users are so fast?

Resolving Overloaded Webserver Issues

Identifying hostile web crawlers

What are the cons of a web based application

Categories

Resources