Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday - screen-scraping

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.

You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Related

Parse data from another site and display on own site

If someone wants to create website that parses data from other sites and then sifts through and organizes that data to be displayed, what are the best programming languages, front and back-end, to use in order to achieve this?
(Like searching through Craigslist and displaying the best deals of the day. Or searching through every football teams' roster to display any recent changes.)
A good lawyer. No, seriously, get one, because you can run afoul of copyright fairly easily.
Apart from that, any language that includes good support for parsers and text processing, or ideally processing HTML. XML is too strict. Also, depending on the site, you may need a language with a built-in JavaScript/HTML renderer, because many sites these days use JS to actually load the data into an empty page.

Freebase: Is it worth it to base my company's entire database on it?

I'm with a company that is building a venue / artist database for live music and recently came across Freebase. It looks very compelling, even if the data isn't there for new, up-and-coming bands. For those of you who have worked with Freebase, I have a couple questions:
Are there downsides to integrating all of the data entry with Freebase? We are not looking to sell or privatize this information.
What are the weaknesses of Freebase, with regards to usability?
Disclosure: I work on Freebase at Google.
The music data in Freebase is one of our strongest areas and is going to continue to get broader and richer as we continue to load more datasets. For example, we import data from MusicBrainz, clean it up and match the topics against existing topics in Freebase to avoid duplicates.
In terms of downsides, you should be prepared to work with a lot of data. For example, Freebase currently has 4 musical artists named "John Smith" which may or may not be useful for your application but you'll still need to figure out which one(s) map to the John Smith that your users are interested in. We call this "reconciliation" and its necessary so that your app knows precisely which topics to query the API for.
Since you mentioned music venues I should also point out that while Freebase has a lot of data about places, we don't yet have a geosearch API so you'd need to roll your own if that's something you need.
Since anyone can edit Freebase, you should also consider using as_of_time to protect your site against vandalism.
Freebase is great for developers because you can easily jump in and clean up bad data or add missing topics. However, one area that has always been a challenge is loading large amounts of data from outside of Google. We've built the OpenRefine which allows folks to upload datasets, but these datasets must pass a QA process that takes some time to complete. Its necessary to have these QA processes to maintain the level of quality in Freebase, but it does slow down the process of loading large datasets.
I really hope that you choose to make use of Freebase music data to build your company. I know that there are already a number of music startups happily using our data.

Good (CMS-based?) platform for simple database apps

I need to implement yet another database website. Let's say roughly 5 tables, 25 columns, and (eventually) thousands to tens of thousands of rows. Easy data entry and maintenance are more important than presentation of the data to non-privileged users. It's a niche site, so performance is not a concern. We'll have no trouble finding somewhere to host it.
So: what's a good platform for this? Intituitively I feel that there ought to be some platform that allows this to be done with no code written - some web version of MS Access. Obviously I'm happy to code business rules, and special logic that distinguishes this from every other database app.
I've looked at Drupal (with Views) and it looks possible, but with quite a bit of effort. Will look at Al Fresco next. A CMS-y platform helps because then you can nicely integrate static content, you get nice styling, plugins, etc etc.
Really good data entry (tracking changes, logging, ability to roll back, mass imports...) would be great. If authorised users could do arbitrary SQL queries (yes, I know...) that would be a big bonus. Image management support a small bonus.
Django is what you are looking for. In fact, you could probably set up what you ask without much coding at all, just configuration.
Once complete, authorised users can add 'rows' with a nice but simple GUI, or, of course, you can batch import via database commands.
I'm a Python newbie, and I've already created 2 Django-based sites. I have created more than a dozen Drupal-based sites, and Django is easier and produces significantly faster sites.
Your need somewhat sits between two chairs : bespoke application and CMS-based. I'd advocate for the CMS approach, if and only if you feel the need for content structure customization will grow in the future, slowly removing the need for direct SQL queries.
I am biased since working with eZ Publish for many years now, but it satisfies the requirements you expressed natively :
Really good data entry (tracking changes, logging, ability to roll back, mass imports...)
[...] Image management support a small bonus.
An idea of the content edition feel can be watched here:
http://ez.no/Demos-Videos/eZ-Publish-Administration-Interface-Video-Tutorial
and you can download and test-drive eZ Publish Community Edition there : http://share.ez.no/latest
It is a PHP-based solution, strong professional community (http://share.ez.no), over 1100 add-ons available on http://projects.ez.no. The underlying libs are mostly relying on Apache Zeta Components, high-quality, robust set of PHP5 libraries.
Last note : the content model is abstracted, meaning you'd not have to create a new table everytime a new type of content should be stored : a simple content class definition from the administration interface, and the rest is taken care of, including the edition interface for the new content type. Might remove the need for hardcore SQL queries ?
Hope it helped,
Drupal can do most of what you need (I don't know of a module that will let you enter arbitrary SQL queries), but you will end up with some overhead of tables and modules you don't really need. It's up to you to decide if that's a problem or not. I don't think the overhead would hurt performance in your case.
The advantages of using Drupal would be the large community, the stability of the platform and the flexibility to add more functionality when needed. Also, the large user base ensures that most code has been tested rather well.
I highly recommend Drupal. It is very simple (also internally codebase is small and clean) it has dosens of possibilities and tremendous support. Once you start with Drupal you will never go to anything else.
Note that I'm not connected with Drupal staff, I've just created dosens of Drupas sites and many of them in just a minutes. My last one took me 2 hrs, see it here http://iPadDevZone.com
UPDATE #1:
It really depends on your DB schema complexity. The best case is that you just use CCK module (part of core now) and create your node type. Node is Drupal name for content. All you do is just web admin your node type fields (text, image, numbers, dates, custom, etc). Then, if user creates content with this node type he/she can enter all the fields which are stored in separate db table fields. This is however hidden for you - if you wish not to know about it - it is just a web gui. Then you choose how the node is presented, which properties as shown and where.
Watch videos in CCK resources section in the bottom of this page: http://drupal.org/project/cck
If you need to do some programming then it is also very easy to use so called PHP code sniplets which are entered as part of your content (node) and executed when the page is displayed.
Drupal has node revisions built in the core. You can see all the versions and roll back if you wish.
You can set the permissions in quite granular level so you can control what your users may or may not.
I would take a look at Symphony. I havn't been using it myself, but it seems like it's really easy to use and to customize!
http://symphony-cms.com/
Seems to me an online database system would be better than a CMS system.
So in addition to what's been posted above:
www.quickbase.com (by Intuit) - think around $150/mo
www.rollbase.com - check on price, full featured
www.rhythmdata.com - easy to set up, but don't think it's got the advanced features you're looking for.
Good luck!
B
I appreciate these answers, but most of them are really platforms that are much better at something else (eg, Drupal really is a CMS, and has some support for custom fields - but it's not at all easy). Since this is a brand new site from scratch, it doesn't really make sense to start with something that does custom database fields as an afterthought, I think.
The closest I've found is Zoho Creator. It really is like "MS Access for Web 2.0" - and even supports importing from Access. The pricing could get expensive though. It feels like it might eventually be quite constraining. I'm still evaluating.
Are there any other products like Zoho Creator?

Managing a large collection of music

I'd like to write my own music streaming web application for my personal use but I'm racking my brain on how to manage it. Existing music and their location's rarely change but are still capable of (fixing filename, ID3 tags, /The Chemical Brothers instead of /Chemical Brothers). How would the community manage all of these files? I can gather a lot of information through just an ID3 reader and my file system but it would also be nice to keep track of how often played and such. Would using iTunes's .xml file be a good choice? Just keeping my music current in iTunes and basing my web applications data off of it? I was thinking of keeping track of all my music by md5'ing the file and using that as the unique identifier but if I change the ID3 tags will that change the md5 value?
I suppose my real question is, how can you keep track of large amounts of music? Keep the meta info in a database? Just how I would connect the file and db entry is my real question or just use a read when need filesystem setup.
I missed part 2 of your question (the md5 thing). I don't think an MD5/SHA/... solution will work well because they don't allow you to find doubles in your collection (like popular tracks that appear on many different samplers). And especially with big collections, that's something you will want to do someday.
There's a technique called acoustic fingerprinting that shows a lot of promise, have a look here for a quick intro. Even if there are minor differences in recording levels (like those popular "normalized" tracks), the acoustic fingerprint should remain the same - I say should, because none of the techniques I tested is really 100% errorfree. Another advantage of these acoustic fingerprints is that they can help you with tagging: a service like FreeDB will only work on complete CD's, acoustic fingerprints can identify single tracks.
For inspiration, and maybe even for a complete solution, check out ampache. I don't know what you call large, but ampache (a php application backed by a mysql db) easily handles music collections of tens of thousands of tracks.
Reecently I discovered SubSonic, and the web site says "Manage 100,000+ files in your music collection without hazzle" bt I haven't been able to test it yet. It's written in Java and the source looks pretty neat at first sight, so maybe there's inspiration to get there too.

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Resources