Is there any way to download a public DB to hard drive? - database

I'm a social science researcher, and I'm working with data from various public databases of NGO, government, etc. Let's assume that I've got no opportunity to ask the admins for the whole database. However, if I have enough patience, I'm able to download all the data one-by-one. But the size of the DB makes it almost impossible solving the problem with brute-force.
So, is there any way to download a public DB with all of it's components?
Here's an example:
http://www.trademap.org/tradestat/Country_SelProductCountry_TS.aspx
You can see the Japanese Live animal import (USD) by the importing countries. Is there a faster way to download all the data for every country and products as well than clicking them one-by-one?

Yes, there exist software and web services for scraping. You can find them easily with Google - this is a programming, not a software recommendations site.
Beware that the use of automatic downloading tools may violate the terms of service, and get you into legal trouble. Also, websites may block your access when accessing them too fast.

Related

School project help: Connecting database to website and granting remote access

I'm working on a project at school where we have to create a small system for an RV retailer to track customers, vehicles, employees, inventory, and so on.
We've gotten to the point where we'll need to start coding pretty soon, but I'm having trouble figuring out the logistics of everything. For example, I know enough to build and use the website and database, but I don't have any idea on how to connect the two.
I know SQL fairly well. Enough to know what and where to look for the information, but I don't know enough about connecting my database to know what I'm looking for.
So what I'm looking for is a basic rundown on different options I can look and do some research on what would be best for our group.
I feel like there's a lot of information out there on how to do stuff, but I just don't have the basic information on why it's relevant and how and where to fully utilize it.
I hope this makes sense. Please let me know if there's anything I can do to help clear it up.
One of the options of having a database-driven website is to use PHP.
It is a server-side scripting language, which is used to generate the dynamic content on the webpages. You can connect to the database, obtain user input using for example HTML forms, perform queries and display results on the webpage.
Essentially you build an application with a web-based user interface. PHP is supported on a vast majority of web hosting platforms.

Simple centralized system to store and share information at an office

I work at an office with some colleagues generating and consuming structured data which can be normally stored in a database. For instance:
- data from several countries: capital, population, currency, ...
- forecasts of the evolution of the population in each country: each year we generate one different time series
We store these data in dozens of Excel files (which is the last version?, where are they stored?, are they in a shared directory?), and we produce lots of document from these data (power point files, other Excel files to make calculations, ...).
I know how to install a mySQL server on Linux, and I could build a web-app to generate and store data, and I could build an API to consume the data. But I wondered if there was any other smarter solution to implement a simple centralized system to store and share information at an office.
Thank you very much.
It may be better to implement a cloud service here instead of working things from scratch - just to save the time and effort. Here are two cloud service solutions with some pros and cons of their features. If anything strikes to be useful, I would recommend looking deeper into them.
Cloud Service 1
If storing and sharing of files is the key point here, using an online storage system like box.com would be a good solution. I personally like box.com better than Dropbox since it seems to have better Admin capabilities when working in teams (I may just be baiased here).
Pros:
there are version histories for uploaded files, and files can be locked so they cannot be downloaded
there are access stats (logs) for each file, so you know when someone viewed or downloaded files
there's an area for Box users to leave comments for each file that is uploaded
Excel/Word/Powerpoint files can be previewed in the browser before
downloading them (and other files as well - personally found that
preview of vector files being very useful)
directories and files are immediately accessable via mobile after uploading
shared links can be generated for each file or directory for users without Box accounts to view and download
Cons:
Users will need a Box account to upload files (even for the shared links)
Users may be more used to the Dropbox UI and may find Box to be have a different UX than expected (althoug the UI is not hard to master at all)
Cloud Service 2
If finding an SQL alternative is the key point here, using an online database platform such as kintone would be a good solution. kintone allows you to build customizable online tables (called "Apps" in kintone) using drag and drop.
Pros:
live graphs can be generated on kintone from the stored data
database tables can be created and updated really easily with just the GUI
you can define table columns (or "Fields") to store attachment files
each row (or "Record") has an area for users to leave comments
each Record has a history feature, so you know who edited what contents and when. Nothing is saved locally on your computer, so the latest data is always online (no conflicts occur)
kintone also has internal forums (or "Spaces") that can be used as an alternative for internal emails.
Apps and Spaces are instantly accessable via mobile
kintone has open REST APIs and JavaScript customization capabilities for any further UI changes or connecting with other sources
Cons:
users need a kintone account to view data or to add data into apps, although there are 3rd party solutions (at a reasonable price) that allow you to do that
it may not be as intuitive as storing data in excel spreadsheets (but that's mainly because everyone's used to excel)
Further questions about cloud services may belong better in the Software Recommendations Community https://softwarerecs.stackexchange.com/
I'm personally good with the kintone APIs, so if you have any further questions related to API (capabilities, limits, possibilities etc), please go ahead to post them here in stackoverflow

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.
You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Core understanding f what salesforce is

firstly I apologise if this is a ridiculously simple question to answer but it has been bothering me for a while.
I am trying to understand what salesforce actually is, I mean in technical terms. I have read the websites documentation and the wikipedia page but I am trying to understand what's behind all this fluffy terminology.
My understanding is that salesforce is a cloud based database which stores a very high volume of information and all salesforce apps consists of scripts that query this database and model them in different ways depending on the intended application, is this correct?
Thanks !
Software as a Service (SaaS)
To get program you need to download it, install, configure and so on. If your system have a lot of users it's very hard to configure ans support single user installation.
Imagine that you improved application, new release for example. You need update every instance.
With SaaS model you have a shared web application, that do the same thing as old downloadable one. But it's much easier to support it, because ideally there is just one instance of it.
Salesforce is a company that provides its own system by SaaS model, but not only. It is also a platform for developing new applications.

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Resources