As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.
For example, www.housingmaps.com and the now closed www.chicagocrime.org
If there is a URL that can be used for reference, that would be perfect!
For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.
For example, to extract the categories you could:
//scrape category data
$h = new http();
$h->dir = "../cache/";
$url = "http://craigslist.org/";
if (!$h->fetch($url, 300)) {
echo "<h2>There is a problem with the http request!</h2>";
exit();
}
//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);
$catNames = $categoryTemp['2'];
//return the array of abreviations
if(sizeof($catNames) > 0)
return $catNames;
else
return $emptyArray = array();
An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.
3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.
80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.
The alternative option would be to use YQL or Yahoo pipes to gather the results.
Craiglook and HousingMaps are using them to gather results
The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.
That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).
What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.
Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.
I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.
For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.
i am guessing screen scraping
i do not think there is a craigslist API yet.. and i do not think they will release one..
so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page
if you see a link .. access the page.. scrape the new page get the data and show it or store it
and so on..
I just made one:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.js
That produces:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.html
Must be run in rhino.
While continuing to research this area, I found an awesome site that does partly what I'm interested in:
Crazedlist
It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.
Related
this is my first task of detecting users' geo locations and I am a fairly new dev.
The app uses React and backend is node.js.
Currently we have some functions that calls an api which returns users' locations.( this takes a while)
But, two other options right now is use:
Geolocation API <--- this might need users' permission?
Fastly
For Fastly, I am asking
Does it work with non server side rendering app?
For production site, we have fastly set up in route53. but need to ask devops for staging environment. ( I got this info from others but do not know what that means )
Can someone even explains to me how fastly work and what needs to be set up?
Basically any information is appreciated. I do not know what should be googled to find out the answers.
Thanks.
If you have Fastly fronting your app, then YES you can definitely use Fastly to provide geolocation information.
Just to be clear (as you mentioned you were unfamiliar with Fastly and more generally are a "new dev"), when I say "fronting your app" I mean: when a client (e.g. a user's web browser) makes a request for https://yourapp.com/, does the request first get routed through Fastly? If it does, then Fastly will proxy the request through to your app and any data you send back through Fastly to the client will likely be cached to make future requests for all your users much quicker (this is one of the many functions Fastly provides).
Fastly has lots of products, but for your primary purposes there are two platform services Fastly offers:
Content Delivery (CDN) which is built on Varnish/VCL (if your ops team already has Fastly setup then this is likely what they have).
Compute#Edge which is built upon WebAssembly.
I would highly recommend reading the following resources to understand more about the Fastly platform options:
Content Delivery with VCL
Content Delivery with Compute#Edge
As far as using Fastly to handle geolocation information, I'll point you to the following resources:
https://developer.fastly.com/solutions/examples/geo-ip-api-at-the-edge
https://developer.fastly.com/solutions/examples/decorating-origin-requests-with-geoip
Also search the following page for references to "geolocation" as there are quite a few 'examples' that you might be interested in:
https://developer.fastly.com/solutions/examples/
I would also suggest having a play around with https://fiddle.fastly.dev which let's you use either VCL or any of the supported Compute#Edge languages to test out ideas without needing to have a real Fastly service setup. This will give you a chance to trial out some geolocation code.
Lastly, you can also have a read through the first half of https://www.integralist.co.uk/posts/fastly-varnish/ which covers some basics about Fastly's use of Varnish/VCL (but I'd suggest reading the official references, linked above, first).
Any other questions, then please feel free to reach out to support#fastly.com who will be happy to help.
I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?
List of Domains
In general, the following services provide lists of domain names:
Alexa Top 1 Million: top-1m.csv.zip (free)
CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
Verisign: Top-Level Domain Zone File Information (free IIRC)
You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
whoisxmlapi.com: All Registered Domains (requires payment)
The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
premiumdrops.com: Domain Zone lists (requires payment)
Similar to the previous one, you can get lists of different domain TLDs.
Crawling Approach
In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.
To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).
I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.
I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.
Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
http://jsoup.org/ is a great Java tool for accessing content on html pages
Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
You could check out :http://web-harvest.sourceforge.net/
For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.
This question already has answers here:
Detecting 'stealth' web-crawlers
(11 answers)
Closed 9 years ago.
What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?
Captchas
Form submitted in less than a second
Hidden (by css) field gets a value submitted during form submit
Frequent page visits
Simple bots can not scrap text from flash, images or sound.
Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.
However, here are some methods that can be implemented:
Check User-Agent (this could be spoofed though)
Use robots.txt (proper bots will - hopefully respect this)
Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.
You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.
I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...
Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.
Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/
From their site:
Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.
Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.