Where do I store my model's training data, artifacts, etc? - amazon-sagemaker

I'm trying to build and push a custom ML model with docker to Amazon SageMaker. I know things are supposed to follow the general structure of being in opt/ml. But there's no such bucket in Amazon S3??? Am I supposed to create this directory within my container before I build and push the image to AWS? I just have no idea where to put my training data, etc.

SageMaker is automating the deployment of the Docker image with your code using the convention of channel->local-folder. Everything that you define with a channel in your input data configuration, will be copied to the local Docker file system under /opt/ml/ folder, using the name of the channel as the name of the sub-folder.
{
"train" : {"ContentType": "trainingContentType",
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"},
"evaluation" : {"ContentType": "evalContentType",
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"},
"validation" : {"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"}
}
to:
/opt/ml/input/data/training
/opt/ml/input/data/validation
/opt/ml/input/data/testing

When creating your custom model on AWS SageMaker, you can store your docker container with your inference code on ECR, while keeping your model artifacts just on S3. You can then just specify the S3 path to said artifacts when creating the model (when using Boto3's create_model, for example). This may simplify your solution so you don't have to re-upload your docker container every time you may need to change your artifacts (though you will need to re-create your model on SageMaker).
The same goes for your data sets. SageMakers' Batch Transform function allows you to feed any of your data sets stored on S3 directly into your model without needing to keep them in your docker container. This really helps if you want to run your model on many different data sets without needing to re-upload your image.

Related

Loading TensorFlow.js model from File Server

I am trying to load Tensorflow.js model via HTTP protocol. Tensorflow.js requires me to store 'model.json' and 'weights.bin' files in the same folder. But I can only call 'model.json' as a parameter. It refers to the binary file by itself. That is how it works as far as I know.
For now, in the local environment, I am loading the model from the localhost(Http://127.0.0.1:8080) and it works fine.
However, the actual application accepts HTTPS protocol only. So I have tried to store them with models and weights in the same buckets in S3 and called via Lambda but it seems like only 'model.json' is retrieved. I am thinking of using EC2 instances where the Python Flask server is running but it seems like the same that only model.json is retrieved, not binary files.
Is there any way that I can retrieve 'model.json' with referring to the weight file? Is there anyway to host file server remotely with HTTPS protocol?
TFJS downloads model JSON, parses it and uses whatever paths are specified in the JSON - you can edit that file and set any URL you want for weights.
Alternatively, you can also use lower-level methods to load weights manually (in case you want to have a custom loader, etc.), but leave that for future until you're more comfortable with TFJS.

Access to the container file system

I've pushed some node application to CF and I want to access to the file system inside the application container ,I didn't found lot of docs about the application container (warden) file system and how should I access it from inside the application,in node js typically you should use the fs module...
You can use the file system to get to your application files the same way you would when running the application locally. You can use fs.
The file you want to read should be packaged with your application. If "foo.txt" is at the root of your application :
fs.readFile("foo.txt", "utf8", function(error, data) {
console.log(data);
});
Here is an example of using fs in a cloud foundry environment:
http://www.ibm.com/developerworks/cloud/library/cl-bluemix-nodejs-app/
Keep in mind, it is not a good idea to write data to the file system as it is ephemeral, and not shared across instances. Use a database service such as Mongo to save your data.

HTML5 Database Use without Server

Is it possible to use a local database file with html5 without using a server. I would like to create a small application that depends on information from a small database. I do not want to host a server just to pull information. Is it possible to create a database file and pull information from the local files ?
Depends on the following:
The type of application you want to build:
Normal website with some data being pulled from a local storage;
Special purpose hosted website / application with data generated by the user;
Special purpose local application with a dedicated platform (a particular browser) and with access to the browser's non-web API -- in order to access the browser's own persistent storage methods (file storage, SQLite etc.);
Special purpose local application with a dedicated environment -- in order to deploy the application with a local web server and database;
Available options:
Indexed DB
Web Storage
XML files used for storing data and XSLT stylesheets for translating the data into HTML;
Indexed DB and Web Storage ar available in some browsers but you need to make sure the targeted browsers have it. Their features aren't quite as complete and flexible as SQL RDBMSs but they may fit the bill if your application doesn't need all that flexibility.
XML files can contain the data you want to be shown to the user and they can be updated manually (not by the user) or dynamically (by a server script).
For dynamic updating the content of the XML is kept in JavaScript and manipulated / altered (using the XML DOM) and when the session is over the XML content is sent to the server to entirely replace the previous XML file. This works OK if the individual users have a file each and they never write to each other's files.
Reading local files:
Normal file access is prohibited (for security reasons) to all local (JavaScript) code, which means that "having" a file locally implies either downloading it from a known source (a server) or asking the user to offer access to a local file.
Asking the user to offer access to a local file which implies offering the user a "file input" -- like for uploads but without actually uploading the file.
After a file has been selected using FileAPI to read that file should be fairly simple.
This workflow would involve the user "giving" you the database on every page refresh -- but since it's a one page thing it would mean giving you the data on every session as long as your script does not refresh the page.
You can use localstorage but you can run a server from your own computer. You can use Wamp or Xampp. Which use Apache and mysql.
What i'm looking for is a little more robust than a cookie. I am making a web application for a friend that will be 1 page, and have a list of names on the page. The person wants to be able to add names to the list, however they do not want to use a web server. Just want the files locally on a computer so a folder called test-app , with index.html, and possibly a database file that can be stored in the web browser or a way to save information to the web browser for repeated use.

Manually add entity to empty Google App Engine DataStore

From the tutorial, which I confirmed by creating a simple project, the index.yaml file is auto-generated when a query is run. What I further observe is that until then the admin console (http://localhost:8080/_ah/admin/datastore) does not show the data-store.
My problem is this: I have a project for which data/entities are to be added manually through the datastore admin console. The website is only used to display/retrieve data, not to add data to the data-store.
How do I get my data-store to appear on the console so I can add data?
Yes, try retrieving from the empty data-store through the browser just so I can get the index.yaml to populate, etc. But that does not work.
The easiest way is probably just to create a small python script inside your project folder and create your entities in script. Assign it to a URL handler that you'll use once, then disable.
You can even do it from the python shell. It's very useful for debugging, but you'll need to set it up once.
http://alex.cloudware.it/2012/02/your-app-engine-app-in-python-shell.html
In order to do the same on production, use the remote_api:
https://developers.google.com/appengine/articles/remote_api
This is a very strange question.
The automatic creation of index.yaml only happens locally, and is simply to help you create that file and upload it to AppEngine. There is no automatic creation or update of that file once it's on the server: and as the documentation explains, no queries can be run unless the relevant index already exists in index.yaml.
Since you need indexes to run queries, you must create that file locally - either manually, or by running the relevant queries against your development datastore - then upload it along with your app.
However, this has nothing at all to do with whether the datastore viewer appears in the admin. Online, it will always show, but only entity kinds that actually have an instance in the store will be shown. The datastore viewer knows nothing about your models, it only knows about kinds that exist in the datastore.
On your development server you can use the interactive console to create/instantiate/save an entity, which should cause the entity class to appear in the datastore interface, like so:
from google.appengine.ext import ndb
class YourEntityModel(ndb.Model):
pass
YourEntityModel().put()

How do you upload data in bulk to Google App Engine Datastore?

I have about 4000 records that I need to upload to Datastore.
They are currently in CSV format. I'd appreciate if someone would
point me to or explain how to upload data in bulk to GAE.
You can use the bulkloader.py tool:
The bulkloader.py tool included with
the Python SDK can upload data to your
application's datastore. With just a
little bit of set-up, you can create
new datastore entities from CSV files.
I don't have the perfect solution, but I suggest you have a go with the App Engine Console. App Engine Console is a free plugin that lets you run an interactive Python interpreter in your production environment. It's helpful for one-off data manipulation (such as initial data imports) for several reasons:
It's the good old read-eval-print interpreter. You can do things one at a time instead of having to write the perfect import code all at once and running it in batch.
You have interactive access to your own data model, so you can read/update/delete objects from the data store.
You have interactive access to the URL Fetch API, so you can pull data down piece by piece.
I suggest something like the following:
Get your data model working in your development environment
Split your CSV records into chunks of under 1,000. Publish them somewhere like Amazon S3 or any other URL.
Install App Engine Console in your project and push it up to production
Log in to the console. (Only admins can use the console so you should be safe. You can even configure it to return HTTP 404 to "cloak" from unauthorized users.)
For each chunk of your CSV:
Use URLFetch to pull down a chunk of data
Use the built-in csv module to chop up your data until you have a list of useful data structures (most likely a list of lists or something like that)
Write a for loop, iterating through each each data structure in the list:
Create a data object with all correct properties
put() it into the data store
You should find that after one iteration through #5, then you can either copy and paste, or else write simple functions to speed up your import task. Also, with fetching and processing your data in steps 5.1 and 5.2, you can take your time until you are sure that you have it perfect.
(Note, App Engine Console currently works best with Firefox.)
By using remote API and operations on multiple entities. I will show an example on NDB using python, where our Test.csv contains the following values separated with semicolon:
1;2;3;4
5;6;7;8
First we need to import modules:
import csv
from TestData import TestData
from google.appengine.ext import ndb
from google.appengine.ext.remote_api import remote_api_stub
Then we need to create remote api stub:
remote_api_stub.ConfigureRemoteApi(None, '/_ah/remote_api', auth_func, 'your-app-id.appspot.com')
For more information on using remote api have a look at this answer.
Then comes the main code, which basically does the following things:
Opens the Test.csv file.
Sets the delimiter. We are using semicolon.
Then you have two different options to create a list of entities:
Using map reduce functions.
Using list comprehension.
In the end you batch put the whole list of entities.
Main code:
# Open csv file for reading.
with open('Test.csv', 'rb') as file:
# Set delimiter.
reader = csv.reader(file, delimiter=';')
# Reduce 2D list into 1D list and then map every element into entity.
test_data_list = map(lambda number: TestData(number=int(number)),
reduce(lambda list, row: list+row, reader)
)
# Or you can use list comprehension.
test_data_list = [TestData(number=int(number)) for row in reader for number in row]
# Batch put whole list into HRD.
ndb.put_multi(test_data_list)
The put_multi operation also takes care of making sure to batch appropriate number of entities in a single HTTP POST request.
Have a look at this documentation for more information:
CSV File Reading and Writing
Using the Remote API in a Local Client
Operations on Multiple Keys or Entities
NDB functions
the later version of app engine sdk, one can upload using the appcfg.py
see appcfg.py

Resources