C (Windows) - GPU usage (load %) - c

According to many sources on the Internet its possible to get GPU usage (load) using D3DKMTQueryStatistics.
How to query GPU Usage in DirectX?
I've succeded to get memory information using code from here with slight modifications:
http://processhacker.sourceforge.net/forums/viewtopic.php?t=325#p1338
However I didn't find a member of D3DKMT_QUERYSTATISTICS structure that should carry information regarding GPU usage.

Look at the EtpUpdateNodeInformation function in gpumon.c. It queries for process statistic per GPU node. There can be several processing nodes per graphics card:
queryStatistics.Type = D3DKMT_QUERYSTATISTICS_PROCESS_NODE
...
totalRunningTime += queryStatistics.QueryResult.ProcessNodeInformation.RunningTime.QuadPart
...
PhUpdateDelta(&Block->GpuRunningTimeDelta, totalRunningTime);
...
block->GpuNodeUsage = (FLOAT)(block->GpuRunningTimeDelta.Delta / (elapsedTime * EtGpuNodeBitMapBitsSet));
It gathers process running time and divides by actual time span.

Related

cublas matrix matrix multiplication gives INTERNAL ERROR when applying to matrix with one very long dimension with multiple GPUs

What I tried to do was to simply apply cublasDgemm (matrix-matrix multiplication) on several matrices with "double" (8 bytes) type element all of which have one dimension that is very large. In my case, the sizes of the matrices are 12755046 by 46. Simply say, A[46,12755046]*B_i[12755046,46] = C_i[46,46], where i = 1,2,3,....
The machine includes 128GB memory and two GTX2080Ti (11GB GPU memory) so my original strategy was to distribute B_i to each GPU. However, I always get INTERNAL ERROR when I execute my code on two GPUs.
So I solved this problem by trying three things:
1. use one GPU only. No error.
2. downsize the matrix size but keep using two GPUs. No error.
3. use cublasXt which implicitly uses two GPUs. No error.
Though it is solved, I am still interested in finding an answer to why my original plan did not work for large dimension matrix? I am guessing this could be due to some internal limitations from cublas or I missed some configurations?
I attached my simplified code here to illustrate my original plan:
double *A, *B[2], *C[2];
cudaMallocManaged(&A, 46*12755046*sizeof(double));
cudaMallocManaged(&B[0], 46*12755046*sizeof(double));
cudaMallocManaged(&B[1], 46*12755046*sizeof(double));
cudaMallocManaged(&C[0], 46*12755046*sizeof(double));
cudaMallocManaged(&C[1], 46*12755046*sizeof(double));
givevalueto(A);
givevalueto(B[0]);
givevalueto(B[1]);
double alpha = 1.0;
double beta = 0.0;
cublasHandle_t handle[nGPUs];
int iGPU;
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cublasCreate (& handle[iGPU]);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cublasDgemm(handle[iGPU],CUBLAS_OP_N,CUBLAS_OP_N,46,46,12755046,&alpha,A,46,B[iGPU],12755046,&beta,C[iGPU],46);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cudaDeviceSynchronize();
}
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaFree(B[iGPU]);
}
The cublas handle is applicable to the device that was active when the handle was created.
From the documentation for cublasCreate:
The CUBLAS library context is tied to the current CUDA device.
See also the description of the cublas context:
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls. In order for the cuBLAS library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another cuBLAS context, which will be associated with the new device, by calling cublasCreate().
You can fix your code with:
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaSetDevice(iGPU); // add this line
cublasCreate (& handle[iGPU]);
}

How to update weights when using mini batches?

I am trying to implement mini batch training to my neural network instead of the "online" stochastic method of updating weights every training sample.
I have developed a somewhat novice neural network in C whereby i can adjust the number of neurons in each layer , activation functions etc. This is to help me understand neural networks. I have trained the network on mnist data set but it takes around 200 epochs to get down do an error rate of 20% on the training set which seams very poor to me. I am currently using online stochastic gradient decent to train the network. What i would like to try is use mini batches instead. I understand the concept that i must accumulate and average the error from each training sample before i propagate the error back. My problem comes in when i want to calculate the changes i must make to the weights. To explain this better consider a very simple perceptron model. One input, one hidden layer one output. To calculate the change i need to make to the weight between the input and the hidden unit i will use this following equation:
∂C/∂w1= ∂C/∂O*∂O/∂h*∂h/∂w1
If you do the partial derivatives you get:
∂C/∂w1= (Output-Expected Answer)(w2)(input)
Now this formula says that you need to multiply the back propogated error by the input. For online stochastic training that makes sense because you use 1 input per weight update. For minibatch training you used many inputs so which input does the error get multiplied by?
I hope you can assist me with this.
void propogateBack(void){
//calculate 6C/6G
for (count=0;count<network.outputs;count++){
network.g_error[count] = derive_cost((training.answer[training_current])-(network.g[count]));
}
//calculate 6G/6O
for (count=0;count<network.outputs;count++){
network.o_error[count] = derive_activation(network.g[count])*(network.g_error[count]);
}
//calculate 6O/6S3
for (count=0;count<network.h3_neurons;count++){
network.s3_error[count] = 0;
for (count2=0;count2<network.outputs;count2++){
network.s3_error[count] += (network.w4[count2][count])*(network.o_error[count2]);
}
}
//calculate 6S3/6H3
for (count=0;count<network.h3_neurons;count++){
network.h3_error[count] = (derive_activation(network.s3[count]))*(network.s3_error[count]);
}
//calculate 6H3/6S2
network.s2_error[count] = = 0;
for (count=0;count<network.h2_neurons;count++){
for (count2=0;count2<network.h3_neurons;count2++){
network.s2_error[count] = += (network.w3[count2][count])*(network.h3_error[count2]);
}
}
//calculate 6S2/6H2
for (count=0;count<network.h2_neurons;count++){
network.h2_error[count] = (derive_activation(network.s2[count]))*(network.s2_error[count]);
}
//calculate 6H2/6S1
network.s1_error[count] = 0;
for (count=0;count<network.h1_neurons;count++){
for (count2=0;count2<network.h2_neurons;count2++){
buffer += (network.w2[count2][count])*network.h2_error[count2];
}
}
//calculate 6S1/6H1
for (count=0;count<network.h1_neurons;count++){
network.h1_error[count] = (derive_activation(network.s1[count]))*(network.s1_error[count]);
}
}
void updateWeights(void){
//////////////////w1
for(count=0;count<network.h1_neurons;count++){
for(count2=0;count2<network.inputs;count2++){
network.w1[count][count2] -= learning_rate*(network.h1_error[count]*network.input[count2]);
}
}
//////////////////w2
for(count=0;count<network.h2_neurons;count++){
for(count2=0;count2<network.h1_neurons;count2++){
network.w2[count][count2] -= learning_rate*(network.h2_error[count]*network.s1[count2]);
}
}
//////////////////w3
for(count=0;count<network.h3_neurons;count++){
for(count2=0;count2<network.h2_neurons;count2++){
network.w3[count][count2] -= learning_rate*(network.h3_error[count]*network.s2[count2]);
}
}
//////////////////w4
for(count=0;count<network.outputs;count++){
for(count2=0;count2<network.h3_neurons;count2++){
network.w4[count][count2] -= learning_rate*(network.o_error[count]*network.s3[count2]);
}
}
}
The code i have attached is how i do the online stochastic updates. As you can see in the updateWeights() function the weight updates are based on the input values (dependent on the sample fed in) and the hidden unit values (also dependent on the input sample value fed in). So when i have the minibatch average gradient that i am propogating back how will i update the weights? which input values do i use?
Ok so i figured it out. When using mini batches you should not accumulate and average out the error at the output of the network. Each training examples error gets propogated back as you would normally except instead of updating the weights you accumulate the changes you would have made to each weight. When you have looped through the mini batch you then average the accumulations and change the weights accordingly.
I was under the impression that when using mini batches you do not have to propogate any error back until you have looped through the mini batch. I was wrong you still need to do that the only difference is you only update the weights once you have looped through your mini batch size.
For minibatch training you used many inputs so which input does the error get multiplied by?
"Many inputs" this is a proportion of the dataset size N, which typically segments your data into sizes which are not too large to fit into memory. DL needs Big Data and the full batch cannot fit into most computer systems to process in one go and therefore the mini-batch is necessary.
The error which gets backpropagated is the sum or average error calculated for the data samples in your current mini-batch $X^{{t}}$ which is of size M where $M<N$, $J^{{t}} = 1/m \sum_1^M ( f(x_m^{t})-y_m^{t} )^2$. This is the sum of the squared distances to the target across samples in the batch 't'. This is the forward step and then the backwards propagation of this error is made using the chain rule through the 'neurons' of the network; using this single value of the error for the whole batch. The update of the parameters is based upon this value for this mini-batch.
There are variations to how this scheme is implemented but if you consider your idea of using "many inputs" in the calculation of the parameter update using multiple input samples from the batch, we are averaging over multiple gradients to smooth over the gradient in comparison to stochastic gradient descent.

Image-classification-transfer-learning Sagemaker issue

Im getting an error like this while trying image classification with Sagemaker:
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'ml.t2.medium' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]
The ml.t2.medium instance type is not available on SageMaker Training as of this writing.
You can refer to https://aws.amazon.com/sagemaker/pricing/ to see the supported instance types in the component and region you are using.
You should also check if the algorithm you are running has additional hardware constraints. For instance, the Image Classification Algorithm doc calls out that it requires GPU instances for training:
For image classification, we support the following GPU instances for training: ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge and ml.p3.16xlarge. We recommend using GPU instances with more memory for training with large batch sizes. However, both CPU (such as C4) and GPU (such as P2 and P3) instances can be used for the inference. You can also run the algorithm on multi-GPU and multi-machine settings for distributed training.
Both P2 and P3 instances are supported in the image classification algorithm.

cudaEvent and gettimeofday report drastically different time

I'm trying to time a loop by using either gettimeofday or cudaEventRecord. However, they report very different results. Here's the pseudo code:
// get time here (start)
while (..)
{
. ..
}
// get time here (stop)
// calculate time
// time = (stop.tv_usec-start.tv_usec)*1.0e-3 + (stop.tv_sec - start.tv_sec); or
// cudaEventElapsedTime(&time,start,stop);
I do not use both of them at the same time but use each separately and the results are not the same. I also called cudaEventSynchrosize(stop) when using cudaEvent. Thanks.
I see problem in measuring units. I am not much of cuda programmer, but I can tell about gettimeofday function. gettimeofday expresses the time in seconds and microseconds, so the right pseudocode line would be:
// time = (stop.tv_usec-start.tv_usec)*1.0e-6 + (stop.tv_sec - start.tv_sec);
There are cuda specific solution given here: Timing CUDA operations.
I hope this helped.

Testing an Algorithms speed. How?

I'm currently testing different algorithms, which determine whether an Integer is a real square or not. During my research I found this question at SOF:
Fastest way to determine if an integer's square root is an integer
I'm compareably new to the Programming scene. When testing the different Algorithms that are presented in the question, I found out that this one
bool istQuadratSimple(int64 x)
{
int32 tst = (int32)sqrt(x);
return tst*tst == x;
}
actually works faster than the one provided by A. Rex in the Question I posted. I've used an NS-Timer object for this testing, printing my results with an NSLog.
My question now is: How is speed-testing done in a professional way? How can I achieve equivalent results to the ones provided in the question I posted above?
The problem with calling just this function in a loop is that everything will be in the cache (both the data and the instructions). You wouldn't measure anything sensible; I wouldn't do that.
Given how small this function is, I would try to look at the generated assembly code of this function and the other one and I would try to reason based on the assembly code (number of instructions and the cost of the individual instructions, for example).
Unfortunately, it only works in trivial / near trivial cases. For example, if the assembly codes are identical then you know there is no difference, you don't need to measure anything. Or if one code is like the other plus additional instructions; in that case you know that the longer one takes longer to execute. And then there are the not so clear cases... :(
(See the update below.)
You can get the assembly with the -S -emit-llvm flags from clang and with the -S flag from gcc.
Hope this help.
UPDATE: Response to Prateek's question in the comment "is there any way to determine the speed of one particular algorithm?"
Yes, it is possible but it gets horribly complicated REALLY quick. Long story short, ignoring the complexity of modern processors and simply accumulating some predefined cost associated with the instructions can lead to very very inaccurate results (the estimate off by a factor of 100, due to the cache and the pipeline, among others). If you try take into consideration the complexity of the modern processors, the hierarchical cache, the pipeline, etc. things get very difficult. See for example Worst Case Execution Time Prediction.
Unless you are in a clear situation (trivial / near trivial case), for example the generated assembly codes are identical or one is like the other plus a few instructions, it is also hard to compare algorithms based on their generated assembly.
However, here a simple function of two lines is shown, and for that, looking at the assembly could help. Hence my answer.
I am not sure if there is any professional way of checking the speed (if there is let me know as well). For the method that you directed to in your question I would probably do something this this in java.
package Programs;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class SquareRootInteger {
public static boolean isPerfectSquare(long n) {
if (n < 0)
return false;
long tst = (long) (Math.sqrt(n) + 0.5);
return tst * tst == n;
}
public static void main(String[] args) {
long iterator = 1;
int precision = 10;
long startTime = System.nanoTime(); //Getting systems time before calling the isPerfectSquare method repeatedly
while (iterator < 1000000000) {
isPerfectSquare(iterator);
iterator++;
}
long endTime = System.nanoTime(); // Getting system time after the 1000000000 executions of isPerfectSquare method
long duration = endTime - startTime;
BigDecimal dur = new BigDecimal(duration);
BigDecimal iter = new BigDecimal(iterator);
System.out.println("Speed "
+ dur.divide(iter, precision, RoundingMode.HALF_UP).toString()
+ " nano secs"); // Getting average time taken for 1 execution of method.
}
}
You can check your method in similar fashion and check which one outperforms other.
Record the time value before your massive calculation and the value after that. The difference is the time executed.
Write a shell script where you will run the program. And run 'time ./xxx.sh' to get it's running time.

Resources