How to copy notebooks between different Sagemaker instances? - amazon-sagemaker

My search didn't yield anything useful so I was wondering if there is any easy way to copy notebooks from one instance to another instance on Sagemaker? Of course other than manually downloading the notebooks on one instance and uploading to the other one!

The recommended way to do this (as of 12/16/2018) would be to use the newly- launched Git integration for SageMaker Notebook Instances.
Create a Git repository for your notebooks
Commit and push changes from Notebook Instance #1 to your Git repo
Start Notebook Instance #2 using the same Git repo
This way your notebooks are persisted in the Git repo rather than on the instance, and the Git repo can be shared by multiple instances.
https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-notebooks-now-support-git-integration-for-increased-persistence-collaboration-and-reproducibility/

Thank you for using Amazon SageMaker.
Unfortunately the way suggested is only way of sharing notebooks between Notebook Instances.
Let us know if there is any other way we can be of assistance.

I might be late to the show, but I had recently to deal with copying the contents of a Sagemaker notebook to a different notebook. This problem may become urgent for users of Sagemaker notebooks on amazon Linux 1 (al1) platform. Such notebooks have 'notebook-al1-v1' platform identifier. Amazon announced that "Amazon SageMaker Notebook Instance is ending its standard support on Amazon Linux AMI (AL1)" on April 22, 2022.
Now, what are the steps needed to copy of file from one Sagemaker instances to a different one. Or, put it differently, how to synchronize EBS volume attached to one instance (let's call it A) to instance B.
Amazon published a step-by-step explanation how to do it. The idea is first to synchronize EBS volume of notebook A to a specially created AWS s3 bucket, and second to synchronize contents of that s3 bucket to notebook B. I followed instructions in the post, and it didn't work for me )). I later discovered that the script 'on_start.sh' in 'migrate-ebs-data-backup' provided in Amazon solution has some issues with s3 bucket creation.
Personally I found that rather than do what the Amazon post recommends it's much easier to create the ad hoc basket (let's call it 'ebs-data-backup') manually via console and then:
launch Terminal of notebook A and enter the following code, which will synchronize contents of notebook A to bucket 'ebs-data-backup'
$ cd /home/ec2-user/SageMaker
$ BUCKET_NAME=ebs-data-backup
$ TIMESTAMP=date +%F-%H-%M-%S
$ SNAPSHOT=${NOTEBOOK_NAME}_${TIMESTAMP}
$ aws s3 sync --exclude "/lost+found/" /home/ec2-user/SageMaker/ s3://${BUCKET_NAME}/${SNAPSHOT}/
Start notebook B, launch Terminal in notebook B and enter the following code:
$ cd /home/ec2-user/SageMaker
$ aws s3 sync s3://${BUCKET_NAME}/${SNAPSHOT}/ /home/ec2-user/SageMaker/`
in notebook B use the same values for BUCKET_NAME and SNAPSHOT as in notebook B
echo "THAT SHOULD BE IT. HOPE IT HELPS"

an easy way to transfer can be to tar (zip like) the file on one instance and download. Upload in the new instance and then untar.
tar --help

Related

Brewing up custom ML models on AWS SageMaker

Iam new with SageMaker and I try to use my own sickit-learn algorithm . For this I use Docker.
I try to do the same task as described here in this github account : https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
My question is should I create manually the repository /opt/ml (I work with windows OS) ?
Can you explain me please?
thank you
You don't need to create /opt/ml, SageMaker will do it for you when it launches your training job.
The contents of the /opt/ml directory are determined by the parameters you pass to the CreateTrainingJob API call. The scikit example notebook you linked to describes this (look at the Running your container sections). You can find more info about this in the Create a Training Job section of the main SageMaker documentation.

An incremental storage / database

I am looking for a method to store only the incremental values of documents in a storage. It can be a database or a file system but the main requirements are:
Fast (adding documents, saving new revisions and retrieving all should be handled as fast as possible)
Efficient (it should use the least amount of storage while keeping it fast enough)
Able to handle a lot of files (billions of files/documents)
At first I was using SVN and now my best choice seems to be Git. It has all the thing I want. However, it has a few issues.
I have to have a copy of the last version of each document in the repository. Which equals to a lot of files sitting at the storage folder.
It seems like it's kind of overkill to use a full version control just to use its storage capability. I'm not sure if this has any disadvantages or not though.
I think the ideal solution would be a database which has something like a version control or basically git's core functionality at its core.
Is there such solution? Is it possible for just 1 developer to somehow easily create such tool without months/years of research and effort?
What would you recommend and why?
Git does meet your requirement. The database or incremental storage just as the .git folder for a git repo.
Remote repo is the place which only store the delta changes (checksum) for different version. And it’s bare repo, that means there is no working directory (no documents of last version exist). So the you can treat remote repo as database.
1. Create a remote repo:
You can create remote repo locally or hosted on github, bitbuket etc. For remote repo hosted on github or bitbucket, you just need to sign up and create a repository, then clone a working copy for it. So I just list create a remote repo locally here:
# In an empty folder, such as D:\repo
git init --bare
Now you have an empty remote repo in D:\repo.
2. Making changes for the remote repo/database:
Working in the git repo, you need a working copy (local repo). So you can clone a local repo from remote and make/commit changes. When you want to store the changes in remote repo (database), just push changes to remote.
# In another directory, such as D:\local
git clone D:/repo
cd repo
# Add/create files you want to store in git repo (D:\local\repo)
git add .
git commit -m 'message'
git push
Now what you make changes will be stored in the remote repo.

Capistrano get git commit sha1

I am writing a task for capistrano 3 and I need to get the current commit sha1. How can I read that ? Is there a variable for that ?
I have seen fetch(:sha1) in some files but this isn't working for me.
I am deploying into a docker container, and I need to tag the image with the current sha1 (and ideally, skip the deployment if there is already an image corresponding to the current sha1)
Capistrano creates a file in the deployed folder containing the git revision. In looking at the task which creates that file, we can see how it obtains the revision: https://github.com/capistrano/capistrano/blob/master/lib/capistrano/tasks/deploy.rake#L224
So, it is obtaining the revision from fetch(:current_revision).
In the git specific tasks, we can see where it is set: https://github.com/capistrano/capistrano/blob/master/lib/capistrano/scm/tasks/git.rake#L62
As a side note, Capistrano is probably not the best tool for what you are trying to do. Capistrano is useful for repeated deployments to the same server. Docker essentially is building a deployable container already containing the code. See the answers here for more detail: https://stackoverflow.com/a/39459945/3042016
Capistrano 3 is using a plugin system for the version manager application used (git, svn, etc.)
The current_revision is delegated to the version manager plugin and I don't understand how to access it...
In the meantime a dirty solution would be
set :current_revision, (lambda do
`git rev-list --max-count=1 #{fetch(:branch)}`
end)
But I'm waiting for a good solution that would instead, manage to invoke the right task from the SCM plugin

Deploying AngularJs + Sinatra to AWS

I have an AngularJS site consuming an API written in Sinatra.
I'm simply trying to deploy these 2 components together on an AWS EC2 instance.
How would one go about doing that? What tools do you recommend? What structure do you think is most suitable?
Cheers
This is based upon my experience of utilizing the HashciCorp line of tools.
Manual: Launch an Ubuntu image, gem install sinatra and deploy your code. Take a snapshot for safe keeping. This one off approach is good for a development box to iron out the configuration process. Write down the commands you run and any options you may need.
Automated: Use the Packer EC2 Builder and Shell Provisioner to automate your commands from the previous manual approach. This will give you a configured AMI that can be launched.
You can apply different methods of getting to an AMI using different toolsets. However, in the end, you want a single immutable image that can be deployed. repeatedly.

How to download an ephemeral file from Heroku Cedar

I have a rails project hosted on Heroku Cedar that does the following:
crawls daily newsfeed and store them into the database
manually judge the feeds and classify them into categories
use the judgments to build a classifier that automatically classifies new incoming feed
iteratively improve the classification with additional judgments
The problem is that the classifier requires writing to a file. However, when I run the scripts on Heroku Cedar, it creates an ephemeral file that isn't permanent.
My questions are:
Is there a way to download the ephemeral file I created by running a script on Heroku?
What's a better way to handle situation like this?
In short No. You want to be storing any generated data in some sort of persistent file/data store. You should look at pushing these files to S3 or similar.

Resources