Model output folder in sagemaker - amazon-sagemaker

I am trying to run a training job using sagemaker sdk.
I set the base_job_name as base-job-name and model_dir as s3://my-bucket/model-output/, The trained model, however, is at s3://my-bucket/model-output/base-job-name-2020-10-12-21-30-42-748/output.
Can I do something to remove the date-time part from the base-job-name folder? It is perfectly fine to overwrite files as a result.
I couldn't seem to locate any property in the documentation which can help me set this.
This is how I am creating the estimator
estimator = TensorFlow(
base_job_name='base-job-name',
entry_point='model.py',
source_dir=source_dir,
output_path='s3://my-bucket/model-output/',
model_dir='s3://my-bucket/model-output/',
instance_type='ml.m5.large',
instance_count=1,
role=my_role,
framework_version='2.2.0',
py_version='py37',
subnets=subnets,
security_group_ids=security_group_ids,
sagemaker_session=sagemaker_sess,
tags=tags
)

You are not able to remove the date and time stamp from the output name. The reason for this is because if you run the estimator code, and then call the .fit() function upon it more than 1 time, you will be overwriting the output model data, event data, etc.

Related

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

create timer (count down/ count up) on odoo

I want to create timer in odoo view which support run/pause something like Soccer match time , min:sec format
I tried below code but it generated an error
#api.one
def timer_th(self):
timer_thread = Thread(target=self.timer)
timer_thread.start()
def timer(self):
while self.current_time <= self.duration:
time.sleep(1)
self.current_time += 1
it gave me AttributeError: environments error
but when I used the code without thread it works but gui wasn't responsive
if you want to schedule a piece of code you can use ir.cron model
this is for automating actions but i don't know if you can do start/pause thing
more details in the docs:
http://odoo-development.readthedocs.io/en/latest/odoo/models/ir.cron.html

boost log every hour

I'm using boost log and I want to make basic log principal file: new error log at the beginning of each hour (if error exists), and to name it like "file_%Y%m%d%H.log".
I have 2 problems with this boost library:
1. How to rotate file at the beginning of each hour?
This isn't possible with rotation_at_time_interval parameter because it creates new file regarding first written record in file, and the hour in file name doesn't match that rule. Is it possible to have multiple rotation_at_time_point for one file in sink or is there some other solution?
2. When file exceed some size I want it to start new file and in that case it should append some index to file name. With adding rotation_size parametar and %N to file name it will increment N all the time while application is running. I want that N to be reset at the beginning of each hour, just as my file name changes. Does anybody have any idea how to do that with this boost log library?
This is basic principal in creating log files in industry. I really don't understand how this can't be done with library which is dedicated for creating log files.
Library itself doesn't provide a way to rotate file at the begging of every hour, but i had same problem so i used a function wrapper, which return true on begging of every hour.
I find this way better for me, because i can controll efficency of code.
from boost.org:
bool is_it_time_to_rotate();
void init_logging(){
boost::shared_ptr< sinks::text_file_backend > backend =
boost::make_shared< sinks::text_file_backend >(
keywords::file_name = "file_%5N.log",
keywords::time_based_rotation = &is_it_time_to_rotate
);
}
For a second question i really dont undrestand it well.

Importing Excel file with dynamic name into SQL table via SSIS?

I've done a few searches here, and while some issues are similar, they don't seem to be exactly what I need.
What I'm trying to do is import an Excel file into a SQL table via SSIS, but the problem is that I will never know the exact filename. We get files at no steady interval, and the file usually has a date/month in the name. For instance, our current file is "Census Data - May 2013.xls". We will only ever load ONE file at a time, so I don't need to loop through a directory for multiple Excel files.
My concept is that I can take this file, copy it to a "Loading" directory, and load it from there. At the start of the package, I will first clear out the loading directory, then scan the original directory for an Excel file, copy it to the loading directory and then load it into SQL. I suppose I may have to store the file names somewhere so I don't copy the same file into the loading directory in subsequent months, but I'm not really sure of the best way to handle that.
I've pretty much got everything down except the part that scans the directory for the Excel file and copies it to the loading directory. I've taken the majority of my info from this page, which (again) is close to what I want to do but not quite exactly the solution I need.
Can anyone get me over the finish line? I can't seem to get the Excel Connection Manager right (this is my first time using variables), and I can't figure out how to get the file into the Loading directory.
Problem statement
How do I dynamically identify a file name?
You will require some mechanism to inspect the contents of a folder and see what exists. Specifically, you are looking for an Excel file in your "Loading" directory. You know the file extension and that is it.
Resolution A
Use a ForEach File Enumerator.
Configure the Enumerator with an Expression on FileSpec of *.xls or *.xlsx depending on which flavor of Excel you're dealing with.
Add another Expression on Directory to be your Loading directory.
I typically create SSIS Variables named FolderInput and FileMask and assign those in the Enumerator.
Now when you run your package, the Enumerator is going to look in Diretory and find all the files that match the FileSpec.
Something needs to be done with what is found. You need to use that file name that the Enumerator returns. That's done through the Variable Mappings tab. I created a third Variable called CurrentFileName and assign it the results of the enumerator.
If you put a Script Task inside the ForEach Enumerator, you should be able to see that the value in the "Locals" window for #[User::CurrentFileName] has updated from the Design time value of whatever to the "real" file name.
Resolution B
Use a Script Task.
You will still need to create a Variable to hold the current file name and it probably won't hurt to also have the FolderInput and FileMask Variables available. Set the former as ReadWrite and the latter as ReadOnly variables.
Chose the .NET language of your choice. I'm using C#. The method System.IO.Directory.EnumerateFiles
using System;
using System.Data;
using System.IO;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
namespace ST_fe2ea536a97842b1a760b271f190721e
{
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
string folderInput = Dts.Variables["User::FolderInput"].Value.ToString();
string fileMask = Dts.Variables["User::FileMask"].Value.ToString();
try
{
var files = Directory.EnumerateFiles(folderInput, fileMask, SearchOption.AllDirectories);
foreach (string currentFile in files)
{
Dts.Variables["User::CurrentFileName"].Value = currentFile;
break;
}
}
catch (Exception e)
{
Dts.Events.FireError(0, "Script overkill", e.ToString(), string.Empty, 0);
}
Dts.TaskResult = (int)ScriptResults.Success;
}
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
Decision tree
Given the two resolutions to the above problem, how do you chose? Normally, people say "It Depends" but there only possible time it would depend is if the process should stop/error out in the case that more than one file did exist in the Loading folder. That's a case that the ForEach enumerator would be more cumbersome than a script task. Otherwise, as I stated in my original response that adds cost to your project for Development, Testing and Maintenance for no appreciable gain.
Bits and bobs
Further addressing nuances in the question: Configuring Excel - you'll need to be more specific in what isn't working. Both Siva's SO answer and the linked blogspot article show how to use the value of the Variable I call CurrentFileName to ensure the Excel File is pointing to the "right" file.
You will need to set the DelayValidation to True for both the Connection Manager and the Data Flow as the design-time value for the Variable will not be valid when the package begins execution. See this answer for a longer explanation but again, Siva called that out in their SO answer.

NDB query giving different results on local versus production environment

I am banging my head into a wall over this and hoping you can tell me the very simple thing I have overlooked in my sleep deprived/noob state.
Very simply I am doing a query and the type of object returned is different on my local machine than what gets returned once I deploy the application.
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]
On my local machine the above produces a MatchRealTimeStatsModel object. So I can run the following to lines without a problem:
logging.info(match) # outputs a MatchRealTimeStatsModel object
logging.info(match.match) # outputs a dictionary from json data
When the above two lines are run on Goggles machines I get the following though:
logging.info(match) # outputs a dictionary from json data
logging.info(match.match) # AttributeError: 'dict' object has no attribute 'match'
Any suggestions as to what might be causing this? I cleared the data store and did everything I could think of to clean the GAE environment.
Edit #1: Adding MatchRealTimeStatsModel code:
class MatchRealTimeStatsModel(ndb.Model):
match = ndb.JsonProperty()
#classmethod
def queryMatch(cls, ancestor_key):
return cls.query(ancestor=ancestor_key).fetch()
And here is the actual call:
ancestor_key = ndb.Key('MatchRealTimeStatsModel', matchUniqueUrl)
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]
Perhaps you are using different versions of your code locally than in prod? Try to reset your copy of the source code in both places.

Resources