Download small sample of AWS Common Crawl to local machine via http - dataset

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!

It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
take the latest crawl (eg. April 2019)
navigate to the WARC file list and download it (same for WAT or WET)
unzip the file and randomly select one line (file path)
prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

Related

Crawling and scraping random websites

I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?
List of Domains
In general, the following services provide lists of domain names:
Alexa Top 1 Million: top-1m.csv.zip (free)
CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
Verisign: Top-Level Domain Zone File Information (free IIRC)
You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
whoisxmlapi.com: All Registered Domains (requires payment)
The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
premiumdrops.com: Domain Zone lists (requires payment)
Similar to the previous one, you can get lists of different domain TLDs.
Crawling Approach
In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.
To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).
I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.
I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.

Google Drive, auto delete files older than 7 days in specific folder

I'm using Google drive to store my daily made backup from my linux machine.
But i need a script that auto delete's the files in a specific folder after 7 days.
So that there are 7 backups in the folder.
The file that get's backup is called world-$(date +%d-%m-%Y).tar.gz
it replaces the %d %m and %Y with the day month and year it created the backup.
so let's say it created one today it would be called world-14-09-2018.tar.gz
it get's stored inside a folder called backups
Is there any way to have it auto delete the files so not store the inside the trash but delete's them completely after 7 days.
I'm not really familair with those kind of scripts. So if anyone could help me that would be really awesome.
You'll want to use the REST API for Google Drive. There are official client libraries for several languages, listed here
Authenticate your account via OAuth2. Depending on the client library you use, there are different tools to do this. I'm most familiar with the Python SDK, and I use Google's oauth2client. The run_flow() command is a simple way to get an OAuth2 refresh token that you can then use to authenticate API calls. Here's the full documentation for authenticating to Google Drive via OAuth2.
Once you're authenticated, you can call the files list endpoint. By default, this will list all files in your my drive. You can limit the search to just those files so you don't have to iterate through all your files each time using a search query. If you have more backups than can fit on a single page (it doesn't seem like it, esp. with the max pageSize of 1000), you will have to paginate your calls.
You can then filter your results by either the filename (as you indicated) or by the createdTime parameter in files.list in your code. Make sure to include createdTime in your fields, by setting the fields parameter to a comma-separated list of parameters, ie "files(id,createdTime,name,mimeType)" or, simply, "*" to get every field. Get a list of all files older than 7 days, then call files.delete. You can then run this script on a cron job every night, however you want to deploy it.
Alternatively, you could use the unofficial Drive command line tool, which will take care of a lot of this for you.

How to upload .gz files into Google Big Query?

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?
See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

Clarification needed about Where I Should Put (Store) My Core Data’s SQLite File?

Yes, I know. This question have been already replied in Where to store the Core Data file? and in Store coredata file outside of documents directory?.
#Kendall Helmstetter Gelner and #Matthias Bauch provided very good replies. I upvoted for them.
Now my question is quite conceptual and I'll try to explain it.
From Where You Should Put Your App’s Files section in Apple doc, I've read the following:
Handle support files — files your application downloads or generates and
can recreate as needed — in one of two ways:
In iOS 5.0 and earlier, put support files in the /Library/Caches directory to prevent them from being
backed up
In iOS 5.0.1 and later, put support files in the /Library/Application Support directory and apply the
com.apple.MobileBackup extended attribute to them. This attribute
prevents the files from being backed up to iTunes or iCloud. If you
have a large number of support files, you may store them in a custom
subdirectory and apply the extended attribute to just the directory.
Apple says that for handling support files you can follow two different ways based on the installed iOS. In my opinion (but maybe I'm wrong) a Core Data file is a support file and so it falls in these categories.
Said this, does the approach by Matthias and Kendall continue to be valid or not? In particular, if I create a directory, say Private, within the Library folder, does this directory continue to remain hidden both in iOS 5 version (5.0 and 5.0.1) or do I need to follow Apple solution? If the latter is valid, could you provide any sample or link?
Thank you in advance.
I would say that a Core Data file is not really a support file - unless you have some way to replicate the data stored, then you would want it backed up.
The support files are more things like images, or databases that are only caches for a remote web site.
So, you could continue to place your Core Data databases where you like (though it should be under Application Support).
Recent addition as of Jan 2013: Apple has started treating pre-loaded CoreData data stores that you copy from a bundle into a writable area, as if they were a support file - even if you write user data into the same databases also. The solution (from DTS) is to make sure when you copy the databases into place, set the do-not-backup flag, and then un-set that if user data is written into the database.
If your CoreData store is purely a cache of downloaded network data, continue to make sure it goes someplace like Caches or has the Do Not Backup flag set.

Elegant way to determine total size of website?

is there an elegant way to determine the size of data downloaded from a website -- bearing in mind that not all requests will go to the same domain that you originally visited and that other browsers may in the background be polling at the same time. Ideally i'd like to look at the size of each individual page -- or for a Flash site the total downloaded over time.
I'm looking for some kind of browser plug-in or Fiddler script. I'm not sure Fiddler would work due to the issues pointed out above.
I want to compare sites similar to mine for total filesize - and keep track of my own site also.
Firebug and HttpFox are two Firefox plugin that can be used to determine the size of data downloaded from a website for one single page. While Firebug is a great tool for any web developer, HttpFox is a more specialized plugin to analyze HTTP requests / responses (with relative size).
You can install both and try them out, just be sure to disable the one while enabling the other.
If you need a website wide measurement:
If the website is made of plain HTML and assets (like CSS, images, flash, ...) you can check how big the folder containing the website is on the server (this assumes you can login on the server)
You can mirror the website locally using wget, curl or some GUI based application like Site Sucker and check how big the folder containing the mirror is
If you know the website is huge but you don't know how much, you can estimate its size. i.e. www.mygallery.com has 1000 galleries; each gallery has an average of 20 images loaded; every image is stored in 2 different sizes (thumbnail and full size) an average of for _n_kb / image; ...
Keep in mind that if you download / estimating a dynamic websites, you are dealing with what the website produces, not with the real size of the website on the server. A small PHP script can produce tons of HTML.
Have you tried Firebug for Firefox?
The "Net" panel in Firebug will tell you the size and fetch time of each fetched file, along with the totals.
You can download the entire site and then you will know for sure!
https://www.httrack.com/

Resources