Crawling and scraping random websites - request

I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?

List of Domains
In general, the following services provide lists of domain names:
Alexa Top 1 Million: top-1m.csv.zip (free)
CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
Verisign: Top-Level Domain Zone File Information (free IIRC)
You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
whoisxmlapi.com: All Registered Domains (requires payment)
The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
premiumdrops.com: Domain Zone lists (requires payment)
Similar to the previous one, you can get lists of different domain TLDs.
Crawling Approach
In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.
To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).

I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.

I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.

Related

How to collect donations within lightning network?

Weeks ago, I have set up my own ln node (Umbrel). Now I try to use the lightning network for receiving donations. Here start my questions.
How can I set up a donation address where donors can decide how much to donate?
(I have discovered "keysend" as a possibility. Unfortunately, it is not widely used among users. Isn't it?)
How can I create an invoice for open-ended deposits?
Please take for your answers into account, I'm an UX designer without serious developer skills.
On my donation site at https://donate.ln.rene-pickhardt.de/ you can see that I use an adopted version of lnme https://github.com/bumi/lnme which is a widget that allows users to request an invoice from your node for a certain amount.
The propper why of doing this would be via BOLT 12 offers which are currently being standardized and are similar to keysend not yet supported by all wallets / implementations.
Another way of achieving your goal is by using LNURL or lightning-address which are currently widely supported but not part of the spec. Similar to my own solution based on lnme you need a webserver for LNURL or lightning-address.

Host server bloating with numerous fake session files created in hundreds every minute on php(7.3) website

I'm experiencing an unusual problem with my php (7.3) website creating huge number of unwanted session files on server every minute (around 50 to 100 files) and i noticed all of them having a fixed size of 125K or 0K in cPanel's file manager, hitting iNode counts going uncontrolled into thousands in hours & hundred thousands+ in a day; where as my website really have a small traffic of less than 3K a day and google crawler on top it. I'm denying all bad bots in .htaccess.
I'm able to control situation with help of a cron command that executes every six hours cleaning all session files older than 12hours from /tmp, however this isn't an ideal solution as fake session files getting created great in number eating all my server resources like RAM, Processor & most importantly Storage getting bloated impacting overall site performance.
I opened many of such files to examine but found them not associated with any valid user as i add user id, name, email to session upon successful authentication. Even assuming a session created for every visitor (without acc/login), it shouldn't go beyond 3K on a day but sessions count going as high as 125.000+ just in a day. Couldn't figure out the glitch.
I've gone through relevant posts and made checks like adding IP & UserAgent to sessions to track suspecious server monitoring, bot crawling, overwhelming proxy activities, but with no luck! I can also confirm by watching their timestamps that there is no human or crawler activity taken place when they're being created. Can see files being created every single minute without any break throughout the day!!.
Didn't find any clue yet in order to figure out root cause behind this weird behavior and highly appreciate any sort of help to troubleshoot this! Unfortunately server team unable to help much but added clean-up cron. Pasting below content of example session files:
0K Sized> favourites|a:0:{}LAST_ACTIVITY|i:1608871384
125K Sized> favourites|a:0:{}LAST_ACTIVITY|i:1608871395;empcontact|s:0:"";encryptedToken|s:40:"b881239480a324f621948029c0c02dc45ab4262a";
Valid Ex.File1> favourites|a:0:{}LAST_ACTIVITY|i:1608870991;applicant_email|s:26:"raju.mallxxxxx#gmail.com";applicant_phone|s:11:"09701300000";applicant|1;applicant_name|s:4:Raju;
Valid Ex.File2> favourites|a:0:{}LAST_ACTIVITY|i:1608919741;applicant_email|s:26:"raju.mallxxxxx#gmail.com";applicant_phone|s:11:"09701300000";IP|s:13:"13.126.144.95";UA|s:92:"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0 X-Middleton/1";applicant|N;applicant_name|N;
We found that the issue triggered following hosting server's PHP version change from 5.6 to 7.3. However we noticed unwanted overwhelming session files not created on PHP 7.0! It's same code base we tested against three versions. Posting this as it may help others facing similar issue due to PHP version changes.

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
take the latest crawl (eg. April 2019)
navigate to the WARC file list and download it (same for WAT or WET)
unzip the file and randomly select one line (file path)
prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

Need ideas on retrieving data from a website

I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!
Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy
I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.
I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.

Question on serving Images on App Engine ( 2 Alternatives )

planning to launch a comic site which serves comic strips (images).
I have little prior experience to serving/caching images.
so these are my 2 methods i'm considering:
1. Using LinkProperty
class Comic(db.Model)
image_link = db.LinkProperty()
timestamp = db.DateTimeProperty(auto_now=True)
Advantages:
The images are get-ed from the disk space itself ( and disk space is cheap i take it?)
I can easily set up app.yaml with an expiration date to cache the content in user's browser
I can set up memcache to retrieve the entities faster (for high traffic)
2. Using BlobProperty
I used this tutorial , it worked pretty neat. http://code.google.com/appengine/articles/images.html
Side question: Can I say that using BlobProperty sort of "protects" my images from outside linkage? That means people can't just link directly to the comic strips
I have a few worries for method 2.
I can obviously memcache these entities for faster reads.
But then:
Is memcaching images a good thing? My images are large (100-200kb per image). I think memcache allows only up to 4 GB of cached data? Or is it 1 Mb per memcached entity, with unlimited entities...
What if appengine's memcache fails? -> Solution: I'd have to go back to the datastore.
How do I cache these images in the user's browser? If I was doing method no. 1, I could just easily add to my app.yaml the expiration date for the content, and pictures get cached user side.
would like to hear your thoughts.
Should I use method 1 or 2? method 1 sounds dead simple and straightforward, should I be wary of it?
[EDITED]
How do solve this dilemma?
Dilemma: The last thing I want to do is to prevent people from getting the direct link to the image and putting it up on bit.ly because the user will automatically get directed to only the image on my server
( and not the advertising/content around it if the user had accessed it from the main page itself )
You're going to be using a lot of bandwidth to transfer all these images from the server to the clients (browsers). Remember appengine has a maximum number of files you can upload, I think it is 1000 but it may have increased recently. And if you want to control access to the files I do not think you can use option #1.
Option #2 is good, but your bandwidth and storage costs are going to be high if you have a lot of content. To solve this problem people usually turn to Content Delivery Networks (CDNs). Amazon S3 and edgecast.com are two such CDNs that support token based access urls. Meaning, you can generate a token in your appengine app that that is good for either the IP address, time, geography and some other criteria and then give your cdn url with this token to the requestor. The CDN serves your images and does the access checks based on the token. This will help you control access, but remember if there is a will, there is a way and you can't 100% secure anything - but you probably get reasonably close.
So instead of storing the content in appengine, you would store it on the cdn, and use appengine to create urls with tokens pointing to the content on the cdn.
Here are some links about the signed urls. I've used both of these :
http://jets3t.s3.amazonaws.com/toolkit/code-samples.html#signed-urls
http://www.edgecast.com/edgecast_difference.htm - look at 'Content Security'
In terms of solving your dilemma, I think that there are a couple of alternatives:
you could cause the images to be
rendered in a Flash object that would
download the images from your server
in some kind of encrypted format that
it would know how to decode. This would
involve quite a bit of up-front work.
you could have a valid-one-time link
for the image. Each time that you
generated the surrounding web page,
the link to the image would be
generated randomly, and the
image-serving code would invalidate
that link after allowing it one time. If you
have a high-traffic web-site, this would be a very
resource-intensive scheme.
Really, though, you want to consider just how much work it is worth to force people to see ads, especially when a goodly number of them will be coming to your site via Firefox, and there's almost nothing that you can do to circumvent AdBlock.
In terms of choosing between your two methods, there are a couple of things to think about. With option one, where are are storing the images as static files, you will only be able to add new images by doing an appcfg.py update. Since AppEngine application do not allow you to write to the filesystem, you will need to add new images to your development code and do a code deployment. This might be difficult from a site management perspective. Also, serving the images form memcache would likely not offer you an improvement performance over having them served as static files.
Your second option, putting the images in the datastore does protect your images from linking only to the extent that you have some power to control through logic if they are served or not. The problem that you will encounter is that making that decision is difficult. Remember that HTTP is stateless, so finding a way to distinguish a request from a link that is external to your application and one that is internal to your application is going to require trickery.
My personal feeling is that jumping through hoops to make sure that people can't see your comics with seeing ads is solving the prolbem the wrong way. If the content that you are publishing is worth protecting, people will flock to your website to enjoy it anyway. Through high volumes of traffic, you will more than make up for anyone who directly links to your image, thus circumventing a few ad serves. Don't try to outsmart your consumers. Deliver outstanding content, and you will make plenty of money.
Your method #1 isn't practical: You'd need to upload a new version of your app for each new comic strip.
Your method #2 should work fine. It doesn't automatically "protect" your images from being hotlinked - they're still served up on a URL like any other image - but you can write whatever code you want in the image serving handler to try and prevent abuse.
A third option, and a variant of #2, is to use the new Blob API. Instead of storing the image itself in the datastore, you can store the blob key, and your image handler just instructs the blobstore infrastructure what image to serve.

Resources