How to download kaggle dataset? - dataset

How do I download kaggle datasets to colab or any other place from script or notebook?
! kaggle datasets download -d arslanali4343/world-cities-database-population-oct2022

The Easy way
Clone repo (small file)
!git clone https://github.com/tikendraw/funcyou.git
Import Function
from funcyou.dataset import download_kaggle_dataset
3.Note: Keep the kaggle.json token file in the current working directory.(where the notebook/script is)
Get command api for the dataset you want to download by clicking on the 3 dots next to download button and copy api command
call the function and paste api command as argument
download_kaggle_dataset(api_command='kaggle datasets download -d arslanali4343/world-cities-database-population-oct2022', unzip=True)

Related

Download file from github in CLI?

I'm trying to download the tokyo-night-storm.yaml colorscheme file from zellij. I thought that I just could do
wget https://github.com/zellij-org/zellij/blob/main/example/themes/tokyo-night-storm.yaml
but I got the webpage instead of the theme-file. How can I retrieve the just the actual file and not the webpage?
You have to click the
raw button, and copy that link (direct link).
wget https://raw.githubusercontent.com/zellij-org/zellij/main/example/themes/tokyo-night-storm.yaml
Replace /blob/ in URL using /raw/ that is do
wget https://github.com/zellij-org/zellij/raw/main/example/themes/tokyo-night-storm.yaml

How to upload folders to Google Colab?

I want to run a notebook that uses many header files defined in the directory. So basically I want to upload the entire directory to Google Colab so that I can run the notebook. But I am unable to find any such options and only able to upload files not complete folders. So can someone tell me how to upload entire directory to google colab?
I suggest you not to upload them just in Colab, since when you're restarting the runtime you will lose them (just need to re-upload them, but it can be an issue with slow connections).
I suggest you to use the google.colab package to manage files and folders in Colab. Just upload everything you need to your google drive, then import:
from google.colab import drive
drive.mount('/content/gdrive')
In this way, you just need to login to your google account through google authentication API, and you can use files/folders as if they were uploaded on Colab.
EDIT May 2022:
As pointed out in the comments, using Google Drive as storage for a large number of files to train a model is painfully slow, as described here: Google Colab is very slow compared to my PC. The better solution in this case is to zip the files, upload them to colab and then unzip them using
!unzip file.zip
More unzip options here: https://linux.die.net/man/1/unzip
You can zip them, upload, then unzip it.
!unzip file.zip
The easiest way to do this, if the folder/file is on your local drive:
Compress the folder into a ZIP file.
Upload the zipped file into colab using the upload button in the File section. Yes, there is a File section, see the left side of the colab screen.
Use this line of code to extract the file. Note: The file path is from colab's File section.
from zipfile import ZipFile
file_name = file_path
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print('Done')
Click Refresh in the colab File section.
Access the files in your folder through the file paths
Downside: The files will be deleted after the runtime is over.
You can use some part of these steps if your file is on a Google Drive, just upload the zipped file to colab from Google Drive.
you can create a git repository and push the files and folders to it,
and then can clone the repository in colaboratory with the command
!git clone https://github.com/{username}/{projectname}.git
i feel this method is faster.
but if the file size is more than 100 mb you will have to zip the file or will have to add extentions to push it to github.
for more information refer the link below.
https://help.github.com/en/github/managing-large-files/configuring-git-large-file-storage
The best way to approach this problem is simple yet tricky sometimes.
You first need to compress the folder into a zipped file and upload the same into your google drive.
While doing so, Make sure that the folder is in the root directory of the drive and not in any other subfolder!. If the compressed folder/data is in other subfolder, you can easily move the same into the root directory.
Compresses folder/data in another subfolder often messes with the unzipping process when you will be specifying the file location.
Once you did the afore mentioned tasks, enter the following commands in the colab to mount your drive:
from google.colab import drive
drive.mount('/content/gdrive')
This will ask for an access token that can be generated by clicking on the url displayed in the output of the same cell
!ls gdrive/MyDrive
Check the contents of the drive by executing the above command and ensure that your folder/data is displayed in the output.
!unzip gdrive/MyDrive/<File_name_without_space>.zip
eg:
!unzip gdrive/MyDrive/data_folder.zip
Executing the same will start unzipping your folder into the memory.
Congrats! You have successfully uploaded your folder/data into the colab.
zip your files zip -r file.zip your_folder and then:
from google.colab import files
from zipfile import ZipFile
with ZipFile(files.upload(), 'r') as zip:
zip.extractall()
print('Done')
So here's what you can do:
-upload the dataset desired folder to your drive
-over colab, mount the drive wherein this
"from google.colab import drive
drive.mount('/content/gdrive')"
automatically shows up and you just need to run it
-then check for your file over the Files section on the left-hand side (if folder not visible try refreshing, also there should be a drop-down arrow next to it where you can check all the files under the folder )
-left-click over the folder wherein you get a COPY PATH option
-paste the copied path over the desired location in your colab

Using Kaggle Datasets in Google Colab

Is it possible to use any datasets available via the kaggle API in Google Colab? I see the Kaggle API is used in this Colab notebook, but it's a bit unclear to me what datasets it provides access to.
Step-by-step --
Create an API key in Kaggle.
To do this, go to kaggle.com/ and open your user settings page.
Next, scroll down to the API access section and click generate
to download an API key.
This will download a file called kaggle.json to your computer.
You'll use this file in Colab to access Kaggle datasets and
competitions.
Navigate to https://colab.research.google.com/.
Upload your kaggle.json file using the following snippet in
a code cell:
from google.colab import files
files.upload()
Install the kaggle API using !pip install -q kaggle
Move the kaggle.json file into ~/.kaggle, which is where the
API client expects your token to be located:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
Now you can access datasets using the client, e.g., !kaggle datasets list.
Here's a complete example notebook of the Colab portion of this process:
https://colab.research.google.com/drive/1DofKEdQYaXmDWBzuResXWWvxhLgDeVyl
This example shows uploading the kaggle.json file, the Kaggle API client, and using the Kaggle client to download a dataset.
You should be able to access any dataset on Kaggle via the API. In this example, only the datasets for competitions are being listed. You can see that datasets you can access with this command:
kaggle datasets list
You can also search for datasets by adding the -s tag and then the search term you're interested in. So this would give you a list of datasets about dogs:
kaggle datasets list -s dogs
You can find more information on the API and how to use it in the documentation here.
Hope that helps! :)
Detailed approach:
Go to my account in your profile
Scroll down, until you find an option Create new Api Token, this will download a file called kaggle.json
Go to Colab upload the file kaggle.json
pip install kaggle
create a new folder named kaggle, copy kaggle.json into the kaggle folder, and set read-write permissions only for you(user).
6.Go to Kaggle website.For example, you want to download any data, click on the three dots in the right hand side of the screen. Then click copy API command
Go to colab, paste the API command
8.When you do an !ls, you will see that our download is a zip file.
To unzip the file use the following command
Now, when you do !ls you'll find our csv file is extracted from the zip file.
To read the file perform a simple pd.read_csv, import pandas
12.As you see, we have successfully read our file into colab.
This downloads the kaggle dataset into google colab, where you can perform analysis and build amazing machine learning models or train neural networks.
Happy Analysis!!!
Combined the top response to this Github gist as Colab Implementation. You can directly copy the code and use it.
How to Import a Dataset from Kaggle in Colab
Method:
First a few things you have to do:
Sign up for Kaggle
Sign up for a competition you want to access data from (for example LANL-Earthquake-Prediction competition).
Download your credentials to access Kaggle API as kaggle.json
# Install kaggle packages
!pip install -q kaggle
!pip install -q kaggle-cli
# Colab's file access feature
from google.colab import files
# Upload `kaggle.json` file
uploaded = files.upload()
# Retrieve uploaded file
# print results
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
# Then copy kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle
Now check if it worked!
#list competitions
!kaggle competitions list -s LANL-Earthquake-Prediction
Have a look at this.
It uses official kaggle api behind scene, but automates the process so you dont have to re-download manually every time your VM is taken away. Also, another issue i faced with using Kaggle API directly on Colab was the hassle of transferring Kaggle API token via Google Drive. Above method automates that as well.
Disclaimer: I am one of the creators of Clouderizer.
First of all, run this command to find out where this colab file exists, how it executes.
!ls -d $PWD/*
It will show /content/data /content/gdrive /content/models
In other words, your current directory is root/content/. Your working directory(pwd) is /content/. so when you do !ls, it will show data gdrive models.
FYI, ! allows you to run linux commands inside colab.
Google Drive keeps cleaning up the /content folder. Therefore, every session you use colab, downloaded data sets, kaggle json file will be gone. That's why it's important to automate the process, so you can focus on writing code, not setting up the environment every time.
Run this in colab code block as an example with your own api key. open kaggle.json file. you will find them out.
# Info on how to get your api key (kaggle.json) here: https://github.com/Kaggle/kaggle-api#api-credentials
!pip install kaggle
{"username":"seunghunsunmoonlee","key":""}
import json
import zipfile
import os
with open('/content/.kaggle/kaggle.json', 'w') as file:
json.dump(api_token, file)
!chmod 600 /content/.kaggle/kaggle.json
!kaggle config path -p /content
!kaggle competitions download -c dog-breed-identification
os.chdir('/content/competitions/dog-breed-identification')
for file in os.listdir():
zip_ref = zipfile.ZipFile(file, 'r')
zip_ref.extractall()
zip_ref.close()
Then run !ls again. You will see all data you need.
Hope it helps!
To download the competitve data on google colab from kaggle.
I'm working on google colab and I've been through the same problem. but i did two tings .
First you have to register your mobile number along with your country code.
Second you have to click on last submission on the kaggle dataset page
Then download kaggle.json file from kaggle.upload kaggle.json on the google colab
After that on google colab run these code is given below.
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c web-traffic-time-series-forecasting
A quick guide to use Kaggle datasets inside Google Colab using Kaggle API
(1) Download the Kaggle API token.
Go to “Account”, go down the page, and find the “API” section.
Click the “Create New API Token” button.
The “kaggle.json” file will be downloaded.
(2) Mount the Google drive to the Colab notebook.
It means giving access to the files in your google drive to Colab notebook.
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)
(3) Upload the “kaggle.json” file into the folder in google drive where you want to download the Kaggle dataset.
(4) Install Kaggle API.
!pip install kaggle
(5) Change the current working directory to where you want to download the Kaggle dataset.
%cd /content/gdrive/MyDrive/DataSets/house_price_data/
(6) Run the following code to configure the path to “kaggle.json”.
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/DataSets/house_price_data/"
(7) Download the dataset.
!kaggle competitions download -c house-prices-advanced-regression-techniques
After the steps (1-6) above from Bob Smith's answer, to use dataset from a particular competition in colab,
you can use the command:
!kaggle competitions download -c elo-merchant-category-recommendation
Here, elo-merchant-category-recommendation is the name of the competition.
Most important part is before to download files:
In the Kaggle webpage, in the Competition section you must clicked on:
Late Submission or on Join Competition
and
ACCEPT RULE AND CONDITIONS ON KAGGLE COMPETITION WEBPAGE
if not, after copying api file, and after launched downloading the dataset, 403 error shows as result.
A hacky way:
Go to the dataset page after login
Open Chrome Developer Tools, then go to Network pane
Click Download button on Kaggle
When clicked you will see many requests in Network pane, find the request starting archive.zip
Right click on that request, then Copy -> Copy as cURL (bash). Now you copied the command
On Colab, paste the command and append an ! to the beginnning of the command then run it
This is definitely a less reliable way than the API, but still remains as an option.
I find the accepted answer to be very comprehensive, but would like to add that:
!kaggle competitions download -c dogs-vs-cats
or most other downloads still wont work. You will probably get the following error:
403 - Forbidden
which is not very verbose. It wants to say: "Please visit kaggle.com and accept the rules (e.g. for that competition). You cannot accept through the API! It is explicitly stated in the docs (see Public API documentation | Kaggle):
Just like participating in a Competition normally through the user interface, you must read and accept the rules in order to download data or make submissions. You cannot accept Competition rules via the API. You must do this by visiting the Kaggle website and accepting the rules there.
Yes, this could have been a comment, but I am missing enough reputation to comment.
import os
os.makedirs("/content/.kaggle/")
import json
token = {"username":"your_username_here","key":"your_kaggle_key_here"}
with open('/content/.kaggle/kaggle.json', 'a+') as file:
json.dump(token, file)
import shutil
os.makedirs("/.kaggle/")
src="/content/.kaggle/kaggle.json"
des="/.kaggle/kaggle.json"
shutil.copy(src,des)
os.makedirs("/root/.kaggle/")
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v /content
#https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463
!kaggle datasets download -d xhlulu/siim-covid19-resized-to-512px-png
Works for me on Colab as of 29-05-21!

How to export fossil-scm timeline to another format

I'm using FossilSCM as my only solution for control version and tickets. So far, so good. Its self contained and minimalist approach suit my needs. But I would like to start to make some analysis on the projects history and development and a good soruce for that are the projects timelines. I could go with some html parsing trying to convert the Fossil timeline output to something else, but I would like if there is any option to export that info in other structured format (e.g JSON or similar). Web search has not produce any useful finding on that issue. Any pointers to a solution?
Thanks,
Offray
Have you tried fossil json timeline branch trunk?
fossil help json
Usage: fossil json SUBCOMMAND ?OPTIONS?
In CLI mode, the -R REPO common option is supported. Due to limitations
in the argument dispatching code, any -FLAGS must come after the final
sub- (or subsub-) command.
The commands include:
anonymousPassword
artifact
branch
cap
config
diff
dir
g
login
logout
query
rebuild
report
resultCodes
stat
tag
timeline
user
version (alias: HAI)
whoami
wiki
Run 'fossil json' without any subcommand to see the full list (but be
aware that some listed might not yet be fully implemented).
Compile json when you build from source:
./configure --json
The key for having this working is to enable json support in fossil by compiling it from sources. Current version have it disabled, so looking for any clue on it in command line help got me nothing originally. Thanks to user 2612611 for the initial clue about it. Here is the procedure I followed:
Go to https://www.fossil-scm.org/download.html and download the source tarball package.
Uncompress the previous package.
Go to the folder where you uncompressed the package (lets call it /uncompress-folder
Run ./configure --json
Run make.
Optional: Put your newly created fossil binary in your path or where the last one was installed (something like sudo mv /uncompress-folder/fossil /usr/bin/fossil.
Open the fossil repository that you want to export its history and launch the fossil web interface (fossil ui).
Go to http://localhost:8080/json/timeline/checkin?limit=0 ,where http://localhost:8080 is your local machine interface for fossil ui, and json/timeline/checkin?limit=0 is the json API call saying: json export of timeline (/json/timeline) chekins (/checkin) for all history (?limit=0). If instead of the 0 at the end of the url you put another integer you will have the last n checkins.
From command prompt you should be able to get the same result by running fossil json timeline checkin --limit=0 > timeline.json stored on the file timeline.json, instead of the web browser but in local test it didn't work.
API is still a moving target, but you can find documentation on this excellent project at [1] and a demo interface to test the parameters at [2]
[1] https://docs.google.com/document/d/1fXViveNhDbiXgCuE7QDXQOKeFzf2qNUkBEgiUvoqFN4/view?pli=1#
[2] http://fossil.wanderinghorse.net/repos/fossil-sgb/json/

How do I download the source code of a google app engine project?

This seems like it should be very easy but I don't see a link to it anywhere.
How do I download the source code of a google app engine project?
Windows
appengine-java-sdk\bin\appcfg.cmd -A <your_app_id> -V <your_app_version> download_app <output-dir>
Linux
./appengine-java-sdk/bin/appcfg.sh -A <your_app_id> -V <your_app_version> download_app <output-dir>
For completeness, using the Python implementation:
appcfg.py download_app -A $appID -V $appVersionNumber $downloadDirectory --oauth2
--oauth2 is of course optional, you can omit it and provide your email + app-specific password (or your password, and then go implement two-factor authentication right after), but it's easier, and frankly there's no reason not to.
Documentation.
App Engine actually recently added the ability for the developer who uploaded a given app version to download its source code.
As of October 2019 you can simply go to --> App Engine --> Services and in the tool dropdown select 'source' and the source code is there
Posting this since none of the listed methods above didn't take me to the code (by June 2021)
You could try accessing it through;
Google Cloud Platform > Debugger > choosing the version of the
Application from combo at top.
This will list the files of that version on the left pane. There is no way to download it automatically but you can copy-paste the code.
Hope you will find this helpful.
IMHO, the best option today (Aug 2018) is:
Under the main menu, under Products, go to Tools -> Cloud Build -> Build history.
There, click the ID of the build you want (for me - the last one).
Then, in the opened window (Build details), click the "source" link, the download of your compressed code begins.
As simple as that.
HTH.
Working with App engine standard using Go, the debugger isn't available yet.
How I managed to download the source code for an existing service was to use the gcloud tool.
First: Get the version id of your service using the app engine console or running: gcloud app versions list
Second: use the version and service name and run: gcloud app versions describe <versionID> --service=<service name>
the describe parameter will give you the storage locations for your source files that looks like this:
cmd/main.go:
sha1Sum: e3fe5848c2640eca7ac3591490e1debc2d3a9b09
sourceUrl: https://storage.googleapis.com/<project>/<file id>
Third: you can then use the storage console, using the file id, to download the files you are interested in.
this process based on java sdk
Its works for me...
Download Google cloud SDK
gcloud init
enter image description here
Follow through process of logging in using your credentials
Enter following command from SDK
C:\Program Files (x86)\Google\appengine-java-sdk-1.9.49\bin
enter image description here
Enter Following command to download source code
appcfg.sh -A [YOUR_APP_ID] -V [YOUR_APP_VERSION] download_app [OUTPUT_DIR]
Eg: appcfg.sh -A my-project-name-1234 -V 2 download_app C:\Users\india\Desktop\my project
Note: this progress based on java-appengine sdk so we use appcfg.sh instead of appcfg.py
check if your app is uploaded with same email id that is in your app engine. if you are not sure then in app engine > control > Clear deployment credentials and then click on any project, deploy to sign in again then use this
appcfg.py download_app -A {app id from google app engine} -V {1} "{c:\path}" --oauth2_credential_file=C:\Users\{your account name}/.appcfg_oauth2_tokens
change all {} to your needs
Things have changed since this question was asked so I'm adding an updated answer. Note that this only applies to GAE Standard Environment
Google has deprecated appcfg.py and so the previous responses appcfg.py download_app no longer works.
gcloud which is the SDK in use (it replaced appcfg) does not have the functionality to download your source code.
When you deploy your app via gcloud app deploy, it copies your source code to a bucket. The default bucket is staging.<project_name>.appspot.com. Your files will stay in this bucket for a maximum of 15 days before they are deleted. You can modify the rule so that the files are retained for longer or less time.
The file names in the bucket are encoded so you can't figure out what each file is unless you open it (i.e. download it). Google has a mapping of the encoded names to the original file names. To get this mapping, you run the gcloud app versions describe command and it will list the file names and their encoded names. To download the files, you have to manually click each url one by one. So essentially, you have to download each file manually and then use the mapping to rename them (or open the file, check the content and then rename them). Also note that downloading the files manually will not maintain the folder structure in which they were uploaded.
If you do not wish to go through all of the above hassles (imagine having to manually open each url for each file if you have a small to mid-sized project which has hundreds of files), our App - https://nocommandline.com - now supports downloading source code from the default bucket - staging.<project_name>.appspot.com (so far as your files are still there which means any deployment i.e update not older than 15 days from your current date unless you previously increased the deletion age on your staging bucket's lifecycle page).
In simple terms, you enter your project name, the version number and our App will take care of retrieving the original file name to encoded name mapping, automatically downloading the files and renaming them to the original names, while maintaining the folder structure. For more information, refer to https://nocommandline.com/help/#faq_download_source_code_from_gae.
Log in to the console.developers.google.com
Select the project you want to download the code from (Google App Engine Standard Envoronment).
Go to the App Engine Dashboard. Under Summary is Debug and Source. Click on Source.
Select each file one at a time and copy it (highlight the code, copy and paste into your local editor.)
Select the next file....
You need to use svn to checkout the files.
If you are on Windows, you can use tortoise svn for your GUI end.
Here are tutorials on how to do it, here is the related question.

Resources