How to add additional files in Sagemaker Pipeline Processing Step - amazon-sagemaker

I want to have additional files which can be imported in preprocess.py file
but i am not able to import these directly.
My directory looks like this:
Want to import from helper_functions directory into preprocess.
I tried to add this in setup.py file but it did not work.
package_data={"pipelines.ha_forecast.helper_functions": ["*.py"]},
One thing which kind of worked was to add this folder in input like this:
inputs = [
ProcessingInput(source=f'{project_name}/{module_name}/helper_functions',
destination="/opt/ml/processing/input/code/helper_functions"),
]
But this was putting the required files in some other directory which I was not able to import anymore.
What is standard way of doing this?

You have to specify the source_dir. Within your script then you can import the modules as you normally do.
source_dir (str or PipelineVariable) – Path (absolute, relative or an
S3 URI) to a directory with any other training source code
dependencies aside from the entry point file (default: None). If
source_dir is an S3 URI, it must point to a tar.gz file. Structure
within this directory are preserved when training on Amazon SageMaker.
Look at the documentation in general for Processing (you have to use FrameworkProcessor and not the specific ones like SKLearnProcessor).
P.S.: The answer is similar to that of the question "How to install additional packages in sagemaker pipeline".
Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and also eventually the requirements.txt file.
The structure of the folder then will be:
BASE_DIR/
|─ helper_functions/
| |─ your_utils.py
|─ requirements.txt
|─ preprocess.py
Within your preprocess.py, you will call the scripts in a simple way with:
from helper_functions.your_utils import your_class, your_func
So, your code becomes:
from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
BASE_DIR = your_script_dir_path
sklearn_processor = FrameworkProcessor(
estimator_cls=SKLearn,
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=base_job_name,
sagemaker_session=pipeline_session,
role=role
)
step_args = sklearn_processor.run(
inputs=[your_inputs],
outputs=[your_outputs],
code="preprocess.py",
source_dir=BASE_DIR,
arguments=[your_arguments],
)
step_process = ProcessingStep(
name="ProcessingName",
step_args=step_args
)
It's a good practice to keep the folders for the various steps separate for each and don't create overlaps.

Related

How to install additional packages in sagemaker pipeline

I want to add dependency packages in my sagemaker pipeline which will be used in Preprocess step.
I have tried to add it in required_packages in setup.py file but it's not working.
I think setup.py file is no use of at all.
required_packages = ["sagemaker==2.93.0", "matplotlib"]
Preprocessing steps:
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1",
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=f"{base_job_prefix}/job-name",
sagemaker_session=pipeline_session,
role=role,
)
step_args = sklearn_processor.run(
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code=os.path.join(BASE_DIR, "preprocess.py"),
arguments=["--input-data", input_data],
)
step_process = ProcessingStep(
name="PreprocessSidData",
step_args=step_args,
)
Pipeline definition:
pipeline = Pipeline(
name=pipeline_name,
parameters=[
processing_instance_type,
processing_instance_count,
training_instance_type,
model_approval_status,
input_data,
],
steps=[step_process],
sagemaker_session=pipeline_session,
)
For each job within the pipeline you should have separate requirements (so you install only the stuff you need in each step and have full control over it).
To do this, you need to use the source_dir parameter:
source_dir (str or PipelineVariable) – Path (absolute, relative or an
S3 URI) to a directory with any other training source code
dependencies aside from the entry point file (default: None). If
source_dir is an S3 URI, it must point to a tar.gz file. Structure
within this directory are preserved when training on Amazon SageMaker.
Look at the documentation in general for Processing (you have to use FrameworkProcessor).
Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and the requirements.txt file.
The structure of the folder then will be:
BASE_DIR/
|- requirements.txt
|- preprocess.py
It is the common requirements file, nothing different. And it will be used automatically at the start of the instance, without any instruction needed.
So, your code becomes:
from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor = FrameworkProcessor(
estimator_cls=SKLearn,
framework_version='0.23-1',
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=f"{base_job_prefix}/job-name",
sagemaker_session=pipeline_session,
role=role
)
step_args = sklearn_processor.run(
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code="preprocess.py",
source_dir=BASE_DIR,
arguments=["--input-data", input_data],
)
step_process = ProcessingStep(
name="PreprocessSidData",
step_args=step_args
)
Note that I changed both the code parameter and the source_dir. It's a good practice to keep the folders for the various steps separate so you have a requirements.txt for each and don't create overlaps.

Image Extractor by AI Habitat produces a configuration error when importing Matterport dataset

I need help understanding the error message, which is along the lines of changing the file name to json because the configuration fails. I have a long error message but pasted the part that is mostly repeated throughout the message:
/Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.stage_config.json
I0412 19:04:17.735939 42397184 AttributesManagerBase.h:296] AttributesManager::createFromJsonOrDefaultInternal (Stage) : Proposing JSON name : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.stage_config.json from original name : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply | This file does not exist.
I0412 19:04:17.736085 42397184 AbstractObjectAttributesManagerBase.h:182] AbstractObjectAttributesManager::createObject (Stage) : Done making attributes with handle : /Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply
I0412 19:04:17.736093 42397184 AbstractObjectAttributesManagerBase.h:189] File (/Users/kyra/Documents/GitHub/habitat-sim/matterport/scans/house1/8194nk5LbLH 13/poisson_meshes/8194nk5LbLH_10.ply) exists but is not a recognized config filename extension, so new default Stage attributes created and registered.
I0412 19:04:17.736124 42397184 SceneDatasetAttributes.cpp:46]
What I did: Ran image extractor after activating Conda env. I modified the image extractor to change the file path to point to a .ply file in the matterport dataset.
Setup: 1)Facebook's AI Habitat-sim built from source,
2)MacBook Air M1,
3)Conda environment with the dependencies (using pip install -r requirements.txt) but habitat-sim is not installed by Conda,
4)Matterport3D dataset (downloaded one house).
Thank you.

unable to access file in resources/static folder spring-boot

I have a file in a location:
/resources/static/fcm-admin
It's absolute path: /home/jitu/project-name/src/main/resources/static/fcm-admin
I have tried to access this file in the following ways
val file = ResourceUtils.getFile("classpath:fcm-admin")
It gives me an error
java.io.FileNotFoundException: class path resource [fcm-admin] cannot be resolved to an absolute file path because it does not exist
I have tried to access the file in various ways but it is not working. I just want to the access file fcm-admin without giving the full absolute path. Anything will be helpful
EDIT:
So I'm able to access the file on local with the below code -
val file = ResourceUtils.getFile("classpath:static/fcm-admin")
But I'm not able to access it on the production server. And I'm getting below exception
class path resource [static/fcm-admin] cannot be resolved to an absolute file path because it does not reside in the file system: jar:file:/var/app/current/application.jar!/BOOT-INF/classes!/static/apple-app-site-association
You forgot to add static in the path
val file = ResourceUtils.getFile("classpath:static/fcm-admin")
EDIT because of comment
Load your file from Classpath:
val file = this.javaClass.classLoader.getResource("/static/fcm-admin").file;
When you load a resource using the class loader it will be start in the root of your classpath.
Local server can work with ClassPathResource but will fail on production
To solve error on production, change your code to
import org.springframework.core.io.ResourceLoader;
import java.io.InputStream;
import org.springframework.core.io.InputStreamSource;
import org.springframework.core.io.ByteArrayResource;
import org.apache.commons.io.IOUtils;
#Autowired
private ResourceLoader resourceLoader;
InputStream logoFileStrem = resourceLoader.getResource("classpath:static/images/image.png").getInputStream();
InputStreamSource byteArrayResource = new ByteArrayResource(org.apache.commons.io.IOUtils.toByteArray(logoFileStrem));
I spent so many time, lot of code work on dev but not on prod.
I used SpringBoot with war in production.
To load a file in resources/static the only solution who work in dev and in prod :
val inputStream = Thread.currentThread().contextClassLoader.getResourceAsStream("static/pathToTourFile/nameofFile.extension")
val texte :String = inputStream.bufferedReader().use(BufferedReader::readText)

Is it good to use package.json file in react component file?

I have to use version mentioned in package.json file in front-end(react js) file.
{
"name": "asdfg",
"version": "3.5.2", // want to use this
"description": "description",
"scripts": {}
//etc etc etc
......
}
Send package.json [version] to Angularjs Front end for display purposes
I'd gone through above post and found two ways for the same. but none of them I was asked to implement.
#1. During build process
#2. By creating endpoint
So I want to know the approach below is valid/good or not ?
react-front-end-file.js
import packageJson from '../package.json'; // imported
...
...
// Usage which gives me version - 3.5.2
<div className='app-version'>{packageJson.version}</div>
Let me know if this approach is fine.
The below 2 approaches seems to have either dependency or add an extra implentation which might not be needed
During build process - ( has dependency on module bundler like webpack etc.)
By creating endpoint - ( needs an extra code at server just to get version )
Instead, As package.json is a file which takes json object in it so you can use it to import json and use its any keys mentioned in that file ( version in your case but only constraint here is, you should have access to package.json file after application deployment, so dont forget to move file in deployment environment )
So your approach seems to be fine.
I would do something like this:
In your module bundler, require your package json file and define a global variable and use it wherever you want
e.g. I do something like this in webpack:
const packageJson = require('./package.json')
const plugins = [
new webpack.DefinePlugin(
{
'__APPVERSION__': JSON.stringify(packageJson.version)
}
)
]
React Component:
<div className='app-version'>{__APPVERSION__}</div>

Querying Google AppEngine's Datastore using PHP (through Quercus) and low-level API isn't working

When I query Google AppEngine's datastore using PHP(through Quercus) and the low-level data-access API for an entity, I get an error that the entity doesn't exist, even though I've put it in the datastore previously.
The specific error is "com.caucho.quercus.QuercusException: com.google.appengine.api.datastore.DatastoreService.get: No entity was found matching the key: Test(value1)"
Here's the relevant code -
<?php
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;
import com.google.appengine.api.datastore.PreparedQuery;
import com.google.appengine.api.datastore.Query;
$testkey = KeyFactory::createKey("Test", "value1");
$ent = new Entity($testkey);
$ent->setProperty("field1", "value2");
$ent->setProperty("field2", "value3");
$dataService = DatastoreServiceFactory::getDatastoreService();
$dataService->put($ent);
echo "Data entered";
try
{
$ent = $dataService->get($testkey);
echo "Data queried - the results are \n";
echo "Field1 has value ".$ent->getProperty("field1")."\n";
echo "Field2 has value ".$ent->getProperty("field2")."\n";
}
catch(EntityNotFoundException $e)
{
echo("<br/>Entity test not found.");
echo("<br/>Stack Trace is:\n");
echo($e);
}
And here's the detailed stack-trace - link.
This same code runs fine in Java (of course after changing the syntax). I wonder what's wrong.
Thanks.
I have found the solution to my problem. It was caused by missing dependencies and I solved it by using the prepackaged PHP Wordpress application available here.
One thing is to be noted. The package overlooked a minor issue in that all files other than the src/ directory need to be in a war/ directory which stays alongside the src/ directory (this as per appengine conventions as mentioned on its documentation). So I organized the files thus myself, put the above PHP file in the war/ directory, and it's working fine on the appengine.

Resources