Yahoo Pipes legality - screen-scraping

If a website states to not crawl their site, is it still legal to use a tool like Yahoo Pipes or YQL to create mash ups?
I don't see how it is any different than direct web scraping except a third party is retrieving the data.
Thanks.

Yes, you are right. If a site says not to crawl data from their site, that means its violation of their terms whichever way you scrape data from the site. For eg. from LinkedIn's terms and conditions (http://www.linkedin.com/legal/user-agreement) in point 23 in "Don’t undertake the following" -
Use manual or automated software, devices, scripts robots, other means
or processes to access, “scrape,” “crawl” or “spider” any web pages or
other services contained in the site;
which clearly states "manual or automated software, devices, scripts robots, other means", and thats cover any possible way to scrape.

Related

State-of-the-art protection against anti-virus in web applications

I plan to add anti-virus protection to our web application that is being built. I have a concern that even the limited amount of files (PDF files, images, or even unknown binaries) that the user uploads may contain viruses.
Concerns:
The images are shared with other users (exposed to web pages) may contain viruses.
The PDF files that users share with each other may contain viruses.
The API that I build for this web application handles the file upload and this API is the file server as well.
Are there any state-of-the-art approaches to minimize the exposure of users to malware, including techniques in the API or techniques on the client-side (browser)? More specifically, I'm interested in solutions that would scan files in the API itself (backend). The files may be stored in a database or on the file-system.
I definitely searched Github for open-source tools and packages, moreover, ran several searches on Google against terms like "open source anti-virus API", "open-source malware HTTP API", but could not find any. Broader search terms resulted in a huge amount of unrelated results.
A related and outdated question investigates a similar problem, but I'm looking for a solution that would integrate well into a micro-service architecture, like Kubernetes, moreover, I think a canonical answer would be useful from an expert.
There are definitely solutions that can help you and integrate into your web application via an API. Here are a few that I am aware of:
SophosLabs Intelix
Intelix is a threat intelligence platform that provides access via APIs through AWS Marketplace. There are three parts to the service lookups, static analysis and dynamic analysis. Each one will give a more detailed analysis of the file. Combining the three will give you a good protection for your web application.
VirusTotal
VirusTotal is a community that will provide you with aggregated information showing you what various anti-malware vendors will say about your file. While VT is a great service, one thing to watch here is that VT is focused on being a community and therefore files uploaded are shared with others.
Clam AV
Not one that I have personal experience of but Clam AV allow you to spin up a server and then query it using API. There is a tutorial / documentation here.
Others
If you tweak your google search and look for Sandboxes most offer an API for a fee. A couple that come to mind Joe Sandbox, Falcon Sandbox which powers Hybrid Analysis.
As always, be careful of any cloud service that offers you scanning for free. Most of the free tools will share the reports and/or files within their community.

Ad Blocker w/ Segment.io

I'm considering using segment.io for several of my client-side 3rd party API needs, but I'm a little concerned about ad-blockers.
My app has no ads, but I do a lot of event-tracking for product analytics, as well as error tracking.
Segment.io offers a nice all-in-one solution, but if it's blocked, and all my eggs are in that basket, then, well, I won't have any eggs left, or however that idiom ends.
So my question is: is there a way to integrate multi-purpose event tracking (segment.io, keen.io, etc.) that isn't as susceptible to ad-blocking?
My app is mostly serverless, using a Firebase+AWS Lambda setup, so I've tried to think of some kind of back-end solution, but no big ideas so far.
ETA: I'm not looking to track adblocking users or violate anyone's trust. my question is about event-tracking unrelated to a user's identity, and whether or not that's possible in an all-in-one event tracking library that might be ad-blocked.
First, I'd typically consider such blocking to be "privacy" blocking instead of ads. So instead of Adblock it's more likely to be Ghostery or uBlock Origin.
Although most website uses of analytics are benign (improving conversion rates, catching browser exceptions, etc), the concern many have is that it allows the third party analytics sites (including segment, etc) to track users across multiple websites. Now most of these analytics sites are also not interested in that, but better safe than sorry?
The ethics of wanting to have analytics about all your webapp use are far more nuanced than "privacy good, tracking bad" and I don't think this is the forum for it, so I'll provide you a technical answer. Just note that your disclaimer about not wanting to "track adblocking users" is not really valid. If your aim is to gather analytics about them, that's still essentially tracking. Otherwise just use a hosted solution and realise that maybe 10-20% of users don't provide you with analytics.
The bad news: basically every "hosted" analytics solution is or will be in the block lists. Not only are their API hosts directly blocked, but there are also blocks in placed based on the name of JS files you may try to include.
The good news: you can work around it if you relay events through your own API, and AWS API Gateway which you may already be using is perfect for this.
There are multiple steps to this.
Step 1: The analytics provider need to provide the option of a fully bundled/built JS file. If they require you to pull the script dynamically from their own servers then it will be blocked there before it even downloads.
Step 2: Rename the bundled script so that it doesn't trigger any filename-based blocks, e.g. rename from mixpanel.umd.js to mp.js, and add it to your server.
Step 3: Create an API gateway to relay events back to the "correct" API (e.g. to api.analyticshost.com). You can actually do this with AWS API gateway only (no lambda required) if you pass through the right headers and URL params.
Step 4: Initialise the library to use your API host rather than the default one.
The result of this is (a) the browser no longer needs to dynamically pull the analytics from the analytics provider's CDN, and instead gets it from your server, and (b) the browser sends it to your API and then relayed through to the analytics provider's.
When gathering analytics segment also provides server side tracking libraries. This can be quite useful when you want to gather metrics for certain types of events that might be blocked by users on the client. At it's simplest, Segment has an HTTP Source but there are a number of popular languages supported as well.
https://segment.com/docs/connections/sources/catalog/#server
The classic example is the order complete event, this would typically happen in your server once that transaction has been committed to a database. Regardless of browser configuration, you could send this event from the server and track.
Be sure you respect the users consent management settings here though.
A lot of valid points are already mentioned in the accepted answer, I would mention a few technical considerations to minimize ad blockers impact on tracking tools (Segment, Google Tag Manager, etc):
Develop for server-side tracking. Whatever is on server cannot be blocked by ad blockers. However, this is usually tricky and very custom, Segment gives some examples on it as well.
Use managed client-side proxy solutions like DataUnlocker. This is great and does not require many code changes.
Use self-hosted open-source solutions for proxying Google Analytics and Google Tag Manager like this or this. I believe these solutions can be extended to support Segment as well.

What probably should be facebook DB design

I was just wondering how many db queries might facebook be issuing to render a user's home page. Does anybody have some idea on how the facebook DB is designed. I've heard it runs MySql and there are thousands of replica plus more memcache server than DB Servers.
Is the facebook data shard-ed?
If it is does it go to every shard and search for the latest update of my friend. In worst case if I've 100 friends and suppose facebook has 101 shards, there is a possibility that each of my friend is in a different shard. How might facebook be handling this?
I'll be highly grateful if somebody can provide me seom hints or pointers towards something like "How to Design a DB for Social Networking Website". I'm just curious!
Facebook is using LAMP structure. Facebook’s backend services are written in a variety of different programming languages including C++, Java, Python, and Erlang and they are used according to requirement. With LAMP Facebook uses some technologies ,to support large number of requests, like
Memcache - It is a memory caching system that is used to speed up dynamic database-driven websites (like Facebook) by caching data and objects in RAM to reduce reading time. Memcache is Facebook’s primary form of caching and helps alleviate the database load. Having a caching system allows Facebook to be as fast as it is at recalling your data.
Thrift (protocol) - It is a lightweight remote procedure call framework for scalable cross-language services development. Thrift supports C++, PHP, Python, Perl, Java, Ruby, Erlang, and others.
Cassandra (database) - It is a database management system designed to handle large amounts of data spread out across many servers.
HipHop for PHP - It is a source code transformer for PHP script code and was created to save server resources. HipHop transforms PHP source code into optimized C++. After doing this, it uses g++ to compile it to machine code.
If we go into more detail, then answer to this question go longer. We can understand more from following posts:
How Does Facebook Work?
Data Management, Facebook-style
Facebook database design?
Facebook wall's database structure
Facebook "like" data structure
At this website you find lots of details about those big internet companies and their technical structures:
http://highscalability.com/
Adding to #Somnath Muluk's answer - Facebook uses few other technologies like Hadoop etc.
Refer to the following links for more details:
http://www.quora.com/Facebook-Engineering/What-is-Facebooks-architecture
Facebook Architecture
Hope it helps.
Zero. On average, that is. A highly interconnected network with a large number of users such as Facebook can only run effectively if it runs fully out of ram for the pages that are shown often. Nearly all data should already be in the memcache.

Single Sign On for a Web App

I have been trying to understand how this problem is solved for over a month now. I really need to come up with a general approach that work. I have a theory, but I'm just not sure it's the easiest (or correct) approach and I haven't been able to find any information to support my ideas.
Here's the scenario:
1) You have a complex web application that offers secure content on a subscription basis.
2) Users are required to log in to your application with user name and password.
3) You sell to large corporations, which already have a corporate authentication technology (for example, Active Directory).
4) You would like to integrate with the corporate authentication mechanism to allow their users to log onto your Web App without having to enter their user name and password.
Now, any solution you come up with will have to provide a mechanism for:
adding new users
removing users
changing user information
allowing users to log in
Ideally, all these would happen "automagically" when the corporate customer made the corresponding changes to their own authentication.
Now, I have a theory that the way to do this (at least for Active Directory) would be for me to write a client-side app that integrates with the customer's Active Directory to track the targeted changes, and then communicate those changes to my Web App. I think that if this communication were done via Web Services offered by my web app, then it would maintain an unhackable level of security, which would obviously be a requirement for these corporate customers.
I've found some information about a Microsoft product called Active Directory Federation Service (ADFS) which may or may not be the right approach for me. It seems to be a bit bulky and have some requirements that might not work for all customers.
For other existing ID scenarios (like Athens and Shibboleth), I don't think a client application is necessary. It's probably just a matter of tying into the existing ID services.
I would appreciate any advice anyone has on anything I've mentioned here. In particular, if you can tell me if my theory is correct about providing a client-side app that communicates with server-side Web Services, or if I'm totally going in the wrong direction. Also, if you could point me at any web sites or articles that explain how to do this, I'd really appreciate it. My research has not turned up much so far.
Finally, if you could let me know of any Web applications that currently offer this service (particularly as tied to a corporate Active Directory), I would be very grateful. I am wondering if other B2B Web app's like salesforce.com, or hoovers.com offer a similar service for their corporate customers.
I hate being in the dark and would greatly appreciate any light you can shed ...
Jeremy
Shibboleth is designed to support exactly this scenario. However it will rely on your customers' companies implementing the identity provider mechanisms. At the moment, that's only really common in universities. Further, if you want user information (any more than just a pseudonymous identifier), you'd need the company to agree to release those attributes to you.
I find it hard to believe that many companies would open their corporate authentication system to you, just to provide SSO.
You might find it better to rely on OpenID or similar, and using a "remember me" cookie to reduce the need for people to enter passwords.
One basic problem with your approach is that you're considering your web app in isolation. Employees at your client's company won't just require SSO to your web app but also some/few/many others, and extending your approach would require a bespoke implementation for each of those to enable access.
Hence the widespread adoption of OpenAthens and Shibboleth in the academic library community to leverage the use of locally-issued credentials. A typical medium/large university can subscribe to various products/services from more than fifty different publishers, and by deploying OpenAthens/Shibboleth they can take advantage of the SAML open standard (SAML is the protocol that Shibboleth uses) that is seeing increased take-up not only in the academic sector, but also in the commercial sector.
John's answer above points to another issue: there are a number of open standards that have recently emerged, SAML and OpenID among them. So content providers are having to decide whether they want to implement some or all of these natively, but they use separate technology stacks and so the implementation and support costs can be duplicated.
Quite a few major publishers have implemented OpenAthens as this supports Athens, SAML/Shibboleth and OpenID in a single platform, with options to plug in other technologies too, or writing a custom module to allow an internal app to connect, e.g. an invoicing or entitlements system recording which clients' users are logging in.
This sector of access management is definitely moving towards open standards, so building your own method would be depriving access to your app for a large number of users

What's the best way to make a mobile friendly site?

Speaking entirely in technology-free terms, what is the best way to make a mobile friendly site? That is, I want to make a site that will work on a regular computer but also have mobile versions of the pages. Should I rewrite each page? The pages will probably have different functionality, so should I rewrite the backend code? Should it be an effectively different site with the same database?
On my site, I detect user agent, and for known mobile browsers I serve a different stylesheet, with some larger/less necessary items left off some pages. The backend doesn't really change.
I added a mobile presentation layer to an operational site about a year ago. Based on the architecture of the site (hopefully this isn't too technology dependent for you) I added a new set of JSPs to accommodate mobile browsers (sidenote: see http://wurfl.sourceforge.net/ for a great way to build mobile pages independent of browser type). Additionally some of the back-end functionality was changed due to the limited functionality of most mobile browsers. So, in short, the integration wasn't as painful as one would expect.
Good luck!
This is a pretty broad question, but here goes:
If the site is primarily about the content, meaning it's not so much a service you use as it's a publication you read, then I'd try to avoid publishing two sites wherever possible. Concentrate on simple presentation using mature technologies that mobile browsers can handle fairly well.
If it's essentially a software application delivered via the network, then things get trickier, because you're going to want to consider the UI of the mobile device, and how it differs from the desktop.
This should go without saying, but either way, if you have many mobile users, you should keep that in mind when you author content for the site. Formats, length, voice, etc.
In addition to the WURFL / WALL capabilities system that todd mentioned, there are Java Server Faces libraries available that use alternate WML renderkits for mobile phones.
One way I have done it in the past was to make sure my data was abstracted well in the data tier and then use separate middle tier models to pull what was appropriate. In my case the application was a weather application and the display methods of the target devices were really limited so we opted to only show the user the essentials on the mobile devices while the website was full featured. That was probably 10 years ago when WAP was big. But these days with devices getting bigger screens, better bandwidth, you may want to consume and display the exact same data with a different view model.
I never really know what type of application will need to consume the data in the future. We do a lot of apps across platforms but the domain model rarely changes. So I end up using the same middle tier objects where I can and pulling that data in different clients. A good example of this is a recent project where we had a rich internet application (widget), a full website, and a web service consuming the same data. Data abstraction in the middle-tier really shines in this environment.
On a very high level of abstraction, there are two main caveats with mobile devices: (1) their screen is small, (2) their network connection is intermittent. This basically means that your need to present the content so that it looks fine even on a small (variable size) screen, and preferably make it cacheable too so that your users can browse the content while offline. Then there's also the problem of low bandwidth and high latency, but those are slightly less important nowadays.
This is a very thorough overview of how to make a site mobile, though i hope its fair to say that there will always be different requirements for anyone seeking to go mobile. If you have a Blog, then you could just as easily make it mobile friendly using Mippin Mobilizer; its free, provides branding customisation tools, and with a big audience already browsing a wide mix of mobilized content, there's opportunities to generate advertising revenue around your blog.
This is because the Mippin Mobilized blog then becomes part of a much wider community of content, people, news, blogs, listings, all connecting around content, and much more at the mobile site:
http://mippin.com (on a mobile browser.)
Take a look at the Mobilizing tool because it shows off what the site can do in a second:
www.mippin.com/mobilizer
Only if you have a blog of course...

Resources