Image-classification-transfer-learning Sagemaker issue - amazon-sagemaker

Im getting an error like this while trying image classification with Sagemaker:
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'ml.t2.medium' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

The ml.t2.medium instance type is not available on SageMaker Training as of this writing.
You can refer to https://aws.amazon.com/sagemaker/pricing/ to see the supported instance types in the component and region you are using.
You should also check if the algorithm you are running has additional hardware constraints. For instance, the Image Classification Algorithm doc calls out that it requires GPU instances for training:
For image classification, we support the following GPU instances for training: ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge and ml.p3.16xlarge. We recommend using GPU instances with more memory for training with large batch sizes. However, both CPU (such as C4) and GPU (such as P2 and P3) instances can be used for the inference. You can also run the algorithm on multi-GPU and multi-machine settings for distributed training.
Both P2 and P3 instances are supported in the image classification algorithm.

Related

Metrics for any step of Sagemaker pipeline (not just TrainingStep)

My understanding is that in order to compare different trials of a pipeline (see image), the metrics can only be obtained from the TrainingStep, using the metric_definitions argument for an Estimator.
In my pipeline, I extract metrics in the evaluation step that follows the training. Is it possible to record there metrics that are then tracked for each trial?
SageMaker suggests using Property Files and JsonGet for each necessary step. This approach is suitable for using conditional steps within the pipeline, but also trivially for persisting results somewhere.
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.steps import ProcessingStep
evaluation_report = PropertyFile(
name="EvaluationReport",
output_name="evaluation",
path="evaluation.json"
)
step_eval = ProcessingStep(
# ...
property_files=[evaluation_report]
)
and in your processor script:
import json
report_dict = {} # your report
evaluation_path = "/opt/ml/processing/evaluation/evaluation.json"
with open(evaluation_path, "w") as f:
f.write(json.dumps(report_dict))
You can read this file in the pipeline steps directly with JsonGet.

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

cublas matrix matrix multiplication gives INTERNAL ERROR when applying to matrix with one very long dimension with multiple GPUs

What I tried to do was to simply apply cublasDgemm (matrix-matrix multiplication) on several matrices with "double" (8 bytes) type element all of which have one dimension that is very large. In my case, the sizes of the matrices are 12755046 by 46. Simply say, A[46,12755046]*B_i[12755046,46] = C_i[46,46], where i = 1,2,3,....
The machine includes 128GB memory and two GTX2080Ti (11GB GPU memory) so my original strategy was to distribute B_i to each GPU. However, I always get INTERNAL ERROR when I execute my code on two GPUs.
So I solved this problem by trying three things:
1. use one GPU only. No error.
2. downsize the matrix size but keep using two GPUs. No error.
3. use cublasXt which implicitly uses two GPUs. No error.
Though it is solved, I am still interested in finding an answer to why my original plan did not work for large dimension matrix? I am guessing this could be due to some internal limitations from cublas or I missed some configurations?
I attached my simplified code here to illustrate my original plan:
double *A, *B[2], *C[2];
cudaMallocManaged(&A, 46*12755046*sizeof(double));
cudaMallocManaged(&B[0], 46*12755046*sizeof(double));
cudaMallocManaged(&B[1], 46*12755046*sizeof(double));
cudaMallocManaged(&C[0], 46*12755046*sizeof(double));
cudaMallocManaged(&C[1], 46*12755046*sizeof(double));
givevalueto(A);
givevalueto(B[0]);
givevalueto(B[1]);
double alpha = 1.0;
double beta = 0.0;
cublasHandle_t handle[nGPUs];
int iGPU;
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cublasCreate (& handle[iGPU]);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cublasDgemm(handle[iGPU],CUBLAS_OP_N,CUBLAS_OP_N,46,46,12755046,&alpha,A,46,B[iGPU],12755046,&beta,C[iGPU],46);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cudaDeviceSynchronize();
}
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaFree(B[iGPU]);
}
The cublas handle is applicable to the device that was active when the handle was created.
From the documentation for cublasCreate:
The CUBLAS library context is tied to the current CUDA device.
See also the description of the cublas context:
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls. In order for the cuBLAS library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another cuBLAS context, which will be associated with the new device, by calling cublasCreate().
You can fix your code with:
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaSetDevice(iGPU); // add this line
cublasCreate (& handle[iGPU]);
}

C (Windows) - GPU usage (load %)

According to many sources on the Internet its possible to get GPU usage (load) using D3DKMTQueryStatistics.
How to query GPU Usage in DirectX?
I've succeded to get memory information using code from here with slight modifications:
http://processhacker.sourceforge.net/forums/viewtopic.php?t=325#p1338
However I didn't find a member of D3DKMT_QUERYSTATISTICS structure that should carry information regarding GPU usage.
Look at the EtpUpdateNodeInformation function in gpumon.c. It queries for process statistic per GPU node. There can be several processing nodes per graphics card:
queryStatistics.Type = D3DKMT_QUERYSTATISTICS_PROCESS_NODE
...
totalRunningTime += queryStatistics.QueryResult.ProcessNodeInformation.RunningTime.QuadPart
...
PhUpdateDelta(&Block->GpuRunningTimeDelta, totalRunningTime);
...
block->GpuNodeUsage = (FLOAT)(block->GpuRunningTimeDelta.Delta / (elapsedTime * EtGpuNodeBitMapBitsSet));
It gathers process running time and divides by actual time span.

SystemVerilog: How to create an interface which is an array of a simpler interfaces?

I'm attempting to create an interface that is an array of a simpler interface. In VHDL I could simply define two types, a record and an array of records. But how to do this in SystemVerilog? Here's what I've tried:
`define MAX_TC 15
...
interface scorecard_if;
score_if if_score [`MAX_TC];
endinterface
interface score_if;
integer tc;
integer pass;
integer fail;
bit flag_match;
real bandwidth;
endinterface
But I get an error from Aldec Active-HDL:
Error: VCP2571 TestBench/m3_test_load_tb_interfaces.sv : (53, 34):
Instantiations must have brackets (): if_score.
I also tried
interface scorecard_if;
score_if [`MAX_TC] if_score;
endinterface
and
interface scorecard_if;
score_if [`MAX_TC];
endinterface
but both of those just resulted in "Unexpected token" syntax errors.
Is it even possible to do this? There are two workarounds that I can think of if there isn't a way to do this. First I could define all the individual elements of score_if as unpacked arrays:
interface score_if;
integer tc [1:`MAX_TC];
integer pass [1:`MAX_TC];
integer fail [1:`MAX_TC];
bit flag_match [1:`MAX_TC];
real bandwidth [1:`MAX_TC];
endinterface
This compiles, but it's ugly in that I can no longer refer to a single score as a group.
I might also be to instantiate an array of score_if (using the original code), but I really want to instantiate scorecard_if inside a generate loop that would allow me instantiate a variable number of scorecard_if interfaces based on a parameter.
Just to provide a bit of explanation of what I'me trying to do, score_if is supposed to be a record of the score for a given test case, and scorecard_if an array for all of the test cases. But my testbench has multiple independent stimulus generators, monitors and scorecards to deal with multiple independent modules inside the DUT where the multiple is a parameter.
Part 1 : Declaring an array of interfaces
Add parentheses to the end of the interface instantiation. According to IEEE Std 1800-2012, all instantiations of hierarchical instances need the parentheses for the port list even if the port list is blank. Some tools allow dropping the parentheses if the interfaces doesn't have any ports in the declaration and and the instantiation is simple; but this is not part of the standard. Best practice is to use parentheses for all hierarchical instantiation.
Solution:
score_if if_score [`MAX_TC] () ;
Syntax Citations:
§ 25.3 Interface syntax & § A.4.1.2 Interface instantiation
interface_instantiation ::= // from A.4.1.2
interface_identifier [ parameter_value_assignment ] hierarchical_instance { , hierarchical_instance } ;
§ A.4.1.1 Module instantiation
hierarchical_instance ::= name_of_instance ( [ list_of_port_connections ] )
Part 2: Accessing elements for that array
Hierarchical references must be constant. Arrayed hierarchical instances cannot be accessed by dynamic indexes. It is an rule/limitation since at least IEEE Std 1364. Read more about it in IEEE Std 1800-2012 § 23.6 Hierarchical names, and the syntax rule is:
hierarchical_identifier ::= [ $root . ] { identifier constant_bit_select . } identifier
You could use a generate-for-loop, as it does an static unroll at compile/elaboration time. The limitation is you cannot use your display message our accumulate your fail count in the loop. You could use the generate loop to copy data to a local array and sum that, but that defeated your intention.
An interface is normally a bundle of nets used to connect modules with class-base test-bench or shared bus protocols. You are using it as a nested score card. A typedef struct would likely be better suited to your purpose. A struct is a data type and does not have the hierarchical reference limitation as modules and interfaces. It looked like you were already trying rout in your previous question. Not sure why you switched to nesting interfaces.
It looks like you are trying to create a fairly complex test environment. If so, I suggest learning UVM before spending to much time reinventing for a advance testbench architecture. Start with 1.1d as 1.2 isn't mainstream yet.
This also works:
1. define a "container" interface:
interface environment_if (input serial_clk);
serial_if eng_if[`NUM_OF_ENGINES](serial_clk);
virtual serial_if eng_virtual_if[`NUM_OF_ENGINES];
endinterface
2. in the testbench instantiate env_if connect serial_if with generate, connect the virtual if with the non virtual and pass the virtual if to the verification env:
module testbench;
....
environment_if env_if(serial_clk);
.....
dut i_dut(...);
genvar eng_idx
generate
for(eng_idx=0; eng_idx<`NUM_OF_ENGINES; eng_idx++) begin
env_if.eng_if[eng_idx].serial_bit = assign i_dut.engine[eng_idx].serial_bit;
end
endgenerate
......
initial begin
env_if.eng_virtual_if = env_if.eng_if[0:`NUM_OF_ENGINES-1];
//now possible to iterate over eng_virtual_if[]
for(int eng_idx=0; eng_idx<`NUM_OF_ENGINES; eng_idx++)
uvm_config_db#(virtual serial_if)::set(null, "uvm_test_top.env", "tx_vif", env_if.env_virtual_if[eng_idx]);
end
endmodule

Resources