MPU Access Permissions NXP IMXRT1060 - arm

Can someone help me?
I have some trouble with configuration of MPU in ARM MCU.
I would configure two regions in RAM that are unprivileged permissions (RW for privileged and RO for unprivileged).
So, 1st region can write in here region but not in 2nd region.
I try this without success:
/* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */
MPU->RBAR = ARM_MPU_RBAR(6, 0x20000000U);
MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_URO, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_64KB);
/* Region 6 setting: Memory with Normal type, not shareable, outer/inner write back */
MPU->RBAR = ARM_MPU_RBAR(7, 0x20010000U);
MPU->RASR = ARM_MPU_RASR(0, ARM_MPU_AP_URO, 0, 0, 1, 1, 0, ARM_MPU_REGION_SIZE_64KB);
ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk);
Any idea?
Thanks.
Morgan

Related

Arduino Binary Array is too Large

I have a three-dimensional array of binary numbers, which I use as a dictionary and pass through an LED array. The dictionary covers 27 letters, and each letter covers 30x30 pixels (where each pixel is a 0 or a 1).
I was using the Intel Edison - and the code worked well - but I ditched the Edison after having trouble connecting it to my PC (despite replacing it once). I switched to the Arduino Uno, but am now receiving an error that the array is too large.
Right now I have the array set as boolean. Is there anyway to reduce the memory demands of the array by setting it instead as bits? The array consists of just zeros and ones.
Here's a snip of the code:
boolean PHDict[27][30][30] = {
/* A */ {{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, /* this is one column of thirty, that show "A" as a letter */
You could write it as
#include <stdint.h>
//...
uint32_t PHdict[27][30] = {
{ 0x00004000, ... },
....
};
.. Where each entry contains 30 bits packed into a 32-bit number.
The size is under 4k bytes.
You would need a bit of code to unpack the bits when reading the array, and a way to generate the packed values (I.e a program which runs on your "host" computer, and generates the initialized array for the source code)
For the AVR processor, there's also a way to tell the compiler you want the array stored in PM (Flash memory) instead of DM - I think if you have it in DM, the compiler will need to put the initialization data in PM anyway, and copy it over before the program starts, so it's a good idea to explicitly store it in PM. See https://gcc.gnu.org/onlinedocs/gcc/AVR-Variable-Attributes.html#AVR-Variable-Attributes
In fact, depending on the amount of flash memory in the processor, changing it to PM may be sufficient to solve the problem, without needing to pack the bits.

Read Kernel Memory from user mode WITHOUT driver

I'm writing an program which enumerates hooks created by SetWindowsHookEx() Here is the process:
Use GetProcAddress() to obtain gSharedInfo exported in User32.dll(works, verified)
Read User-Mode memory at gSharedInfo + 8, the result should be a pointer of first handle entry. (works, verified)
Read User-Mode memory at [gSharedInfo] + 8, the result should be countof handles to enumerate. (works, verified)
Read data from address obtained in step 2, repeat count times
Check if HANDLEENTRY.bType is 5(which means it's a HHOOK). If so, print informations.
The problem is, although step 1-3 only mess around with user mode memory, step 4 requires the program to read kernel memory. After some research I found that ZwSystemDebugControl can be used to access Kernel Memory from user mode. So I wrote the following function:
BOOL GetKernelMemory(PVOID pKernelAddr, PBYTE pBuffer, ULONG uLength)
{
MEMORY_CHUNKS mc;
ULONG uReaded = 0;
mc.Address = (UINT)pKernelAddr; //Kernel Memory Address - input
mc.pData = (UINT)pBuffer;//User Mode Memory Address - output
mc.Length = (UINT)uLength; //length
ULONG st = -1;
ZWSYSTEMDEBUGCONTROL ZwSystemDebugControl = (ZWSYSTEMDEBUGCONTROL)GetProcAddress(
GetModuleHandleA("ntdll.dll"), "NtSystemDebugControl");
st = ZwSystemDebugControl(SysDbgCopyMemoryChunks_0, &mc, sizeof(MEMORY_CHUNKS), 0, 0, &uReaded);
return st == 0;
}
But the function above didn't work. uReaded is always 0 and st is always 0xC0000002. How do I resolve this error?
my full program:
http://pastebin.com/xzYfGdC5
MSFT did not implement NtSystemDebugControl syscall after windows XP.
The Meltdown vulnerability makes it possible to read Kernel memory from User Mode on most Intel CPUs with a speed of approximately 500kB/s. This works on most unpatched OS'es.

Why is clEnqueueMapBuffer returning random data?

I am trying to learn about clEnqueueMapBuffer in OpenCL by writing a kernel which finds the square of values in an input buffer, but only returns two items at a time in the output buffer using clEnqueueMapBuffer. As I understand it, this function returns a pointer in the hosts memory which points to the buffer memory in the device. Then clEnqueueUnmapMemObject must unmap this buffer to allow the kernels to continue their computations. Now, when I call clEnqueueMapBuffer, it is returning random data.
Here is my kernel
__kernel void testStream(
__global int *input_vector,
__global int *output_vector,
__global int *mem_flag) // informs the host when the workload is finished
{
mem_flag[0] = 1;
}
and my source
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <CL/opencl.h>
#include "utils.h"
int main(void)
{
unsigned int n = 24;
int BUFF_SIZE = 2;
// Input and output vectors
int num_bytes = sizeof(int) * n;
int *output_buffer = (int *) malloc(num_bytes);
int output_buffer_offset = 0;
int *mapped_data = NULL;
// use mapped_flag for determining if the job on the device is finished
int *mapped_flag = NULL;
int *host_in = (int *) malloc(num_bytes);
int *host_out = (int *) malloc(num_bytes);
// Declare cl variables
cl_mem device_in;
cl_mem device_out;
cl_mem device_out_flag;
// Declare cl boilerplate
cl_platform_id platform = NULL;
cl_device_id device = NULL;
cl_command_queue queue = NULL;
cl_context context = NULL;
cl_program program = NULL;
cl_kernel kernel = NULL;
// Located in utils.c -- the source is irrelevant here
char *kernel_source = read_kernel("kernels/test.cl");
// Initialize host_in
int i;
for (i = 0; i < n; i++) {
host_in[i] = i + 1;
}
// Set up opencl
cl_int error;
error = clGetPlatformIDs(1, &platform, NULL);
printf("clGetPlatformIDs: %d\n", (int) error);
error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
printf("clGetDeviceIDs: %d\n", (int) error);
context = clCreateContext(NULL, 1, &device, NULL, NULL, &error);
printf("clCreateContext: %d\n", (int) error);
queue = clCreateCommandQueue(context, device, 0, &error);
printf("clCreateCommandQueue: %d\n", (int) error);
program = clCreateProgramWithSource(context, 1,
(const char**)&kernel_source, NULL, &error);
printf("clCreateProgramWithSource: %d\n", (int) error);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, "testStream", &error);
printf("clCreateKernel: %d\n", (int) error);
// Create the buffers
device_in = clCreateBuffer(context, CL_MEM_READ_ONLY,
num_bytes, NULL, NULL);
device_out = clCreateBuffer(context,
CL_MEM_WRITE_ONLY,
sizeof(int) * BUFF_SIZE, NULL, NULL);
device_out_flag = clCreateBuffer(context,
CL_MEM_WRITE_ONLY,
sizeof(int) * 2, NULL, NULL);
// Write the input buffer
clEnqueueWriteBuffer(
queue, device_in, CL_FALSE, 0, num_bytes,
host_in, 0, NULL, NULL);
// Set the kernel arguments
error = clSetKernelArg(kernel, 0, sizeof(cl_mem), &device_in);
error = clSetKernelArg(kernel, 1, sizeof(cl_mem), &device_out);
error = clSetKernelArg(kernel, 2, sizeof(cl_mem), &device_out_flag);
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
(const size_t *) &n, NULL, 0, NULL, NULL);
// Map and unmap until the flag is set to true
int break_flag = 0;
while(1) {
// Map the buffers
mapped_data = (int *) clEnqueueMapBuffer(
queue, device_out, CL_TRUE, CL_MAP_READ, 0,
sizeof(int) * BUFF_SIZE, 0, NULL, NULL, &error);
mapped_flag = (int *) clEnqueueMapBuffer(
queue, device_out_flag, CL_TRUE, CL_MAP_READ, 0,
sizeof(int) , 0, NULL,NULL, &error);
// Extract the data out of the buffer
printf("mapped_flag[0] = %d\n", mapped_flag[0]);
// Set the break_flag
break_flag = mapped_flag[0];
// Unmap the buffers
error = clEnqueueUnmapMemObject(queue, device_out, mapped_data, 0,
NULL, NULL);
error = clEnqueueUnmapMemObject(queue, device_out_flag, mapped_flag,
0, NULL, NULL);
if (break_flag == 1) {break;}
usleep(1000*1000);
}
return 0;
}
When I run the program, I get output similar to
clGetPlatformIDs: 0
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
clCreateProgramWithSource: 0
clCreateKernel: 0
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
Why is this happening?
Edit
I am running this code on an HP dm1z with fedora 19 64-bit on the kernel 3.13.7-100.fc19.x86_64. Here is the output from clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 AMD-APP (1214.3)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Board name: AMD Radeon HD 6310 Graphics
Device Topology: PCI[ B#0, D#1, F#0 ]
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 492Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 201326592
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x00007fd434852fc0
Name: Loveland
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 1214.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (1214.3)
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_image2d_from_buffer_read_only
Also, it may be worth noting that when I began playing with OpenCL, I ran a test program to calculate the inner product, but that gave weird results. Initially I though it was an error with the program and forgot about it, but is it possible that the OpenCL implementation is faulty? If it helps, the OpenGL implementation has multiple errors, causing blocks of random data to show up on my desktop background, but this could also be a Linux problem.
You are passing NULL as the global work size to your clEnqueueNDRangeKernel call:
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, NULL, NULL, 0, NULL, NULL);
If you were checking the error code returned by this call (which you always should), you would get the error code corresponding to CL_INVALID_GLOBAL_WORK_SIZE back. You always need to specify a global work size, so your call should look something like this:
// Execute the kernel over the entire range of data
size_t global[1] = {1};
error = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global, NULL, 0, NULL, NULL);
// check error == CL_SUCCESS!
Your calls to map and unmap buffers are fine; I've tested this code with the above fix and it works for me.
Your updated code to fix the above problem looks like this:
unsigned int n = 24;
...
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
(const size_t *) &n, NULL, 0, NULL, NULL);
This is not a safe way of passing the global work size parameter to the kernel. As an example, the unsigned int n variable might occupy 32 bits, whereas a size_t could be 64 bits. This means that when you pass the address of n and cast to a const size_t*, the implementation will read a 64-bit value, which will encompass the 32 bits of n plus 32 other bits that have some arbitrary value. You should either assign n to a size_t variable before passing it to clEnqueueNDRangeKernel, or just change it to be a size_t itself.
This may or may not be related to the problems you are having. You could be accidentally launching a huge number of work-items for example, which might explain why the code appears to block on the CPU.
Here are few remarks that come to my mind:
I have the feeling that somehow you think that calling clEnqueueMapBuffer will interrupt the execution of the kernel. I don't think this is correct. Once a command is launched for execution it runs until it is completed (or failed...). It is possible for several commands to be launched concurrently, but trying to read some data while a kernel is still processing it will result in undefined behavior. Besides, the way you create your command queue wouldn't let you run several commands at the same time. You need to use CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property when creating the queue to allow that.
I don't know with which global size you call your kernel after the post of #jprice, but except if you execute the kernel with only one workitem you'll have trouble with this kind of statement: mem_flag[0] = 1; since all the workitems will write to the same location. (I'm guessing that you posted only a portion of your kernel. Check if you have other statements like that...actually that'd be useful if you post the entire kernel code).
Since you map and unmap always the same portion of the buffers and always go to check in the first element (of mapped_flag) and since the kernel has completed its computation at that moment (see first point), at least it is normal that you always have the same value read.

Color Depth PIXELFORMATDESCRIPTOR

I'm wondering what values to change in a PIXELFORMATDESCRIPTOR object to change the color depth.
According to the OpenGL wiki, this is how you'd create a PIXELFORMATDESCRIPTOR object for an OpenGL context:
PIXELFORMATDESCRIPTOR pfd =
{
sizeof(PIXELFORMATDESCRIPTOR),
1,
PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER, //Flags
PFD_TYPE_RGBA, //The kind of framebuffer. RGBA or palette.
32, //Colordepth of the framebuffer.
0, 0, 0, 0, 0, 0,
0,
0,
0,
0, 0, 0, 0,
24, //Number of bits for the depthbuffer
8, //Number of bits for the stencilbuffer
0, //Number of Aux buffers in the framebuffer.
PFD_MAIN_PLANE,
0,
0, 0, 0
};
But it has different variables effecting the color depth.
Which ones do I need to change to adjust the color depth accordingly?
The first number, 32 in your particular example specifies the amount of color bitplanes available to the framebuffer. The other numbers define the numbers of bitplanes to use for each component. It's perfectly possible to fit a 5-6-5 pixelformat into a 32 bitplanes framebuffer, which is a valid choice.
When you pass a PIXELFORMATDESCRIPTOR to ChoosePixelFormat the values are takes as minimum values. However the algorithm used by ChoosePixelFormat may not deliver an optimal result for your desired application. It can then be better to enumerate all available pixelformats and choose from them using a custom set of rules.

OpenCL Kernel Executes Slower Than Single Thread

All, I wrote a very simple OpenCL kernel which transforms an RGB image to gray scale using simple averaging.
Some background:
The image is stored in mapped memory, as a 24 bit, non padded memory block
The output array is stored in pinned memory (mapped with clEnqueueMapBuffer) and is 8 bpp
There are two buffers allocated on the device (clCreateBuffer), one is specifically read (which we clWriteBuffer into before the kernel starts) and the other is specifically write (which we clReadBuffer after the kernel finishes)
I am running this on a 1280x960 image. A serial version of the algorithm averages 60ms, the OpenCL kernel averages 200ms!!! I'm doing something wrong but I have no idea how to proceed, what to optimize. (Timing my reads/writes without a kernel call, the algorithm runs in 15ms)
I am attaching the kernel setup (sizes and arguments) as well as the kernel
EDIT: So I wrote an even dumber kernel, that does no global memory accesses inside it, and it was only 150ms... This is still ridiculously slow. I thought maybe I'm messing up with global memory reads, they have to be 4 byte aligned or something? Nope...
Edit 2: Removing the all parameters from my kernel gave me significant speed up... I'm confused I thought that since I'm clEnqueueWriteBuffer the kernel should be doing no memory transfer from host->device and device->host....
Edit 3: Figured it out, but I still don't understand why. If anyone could explain it I would be glad to award correct answer to them. The problem was passing the custom structs by value. It looks like I'll need to allocate a global memory location for them and pass their cl_mems
Kernel Call:
//Copy input to device
result = clEnqueueWriteBuffer(handles->queue, d_input_data, CL_TRUE, 0, h_input.widthStep*h_input.height, (void *)input->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to input buffer on device!")) return 0;
//Set kernel arguments
result = clSetKernelArg(handles->current_kernel, 0, sizeof(OpenCLImage), (void *)&h_input);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 1, sizeof(cl_mem), (void *)&d_input_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input data.")) return 0;
result = clSetKernelArg(handles->current_kernel, 2, sizeof(OpenCLImage), (void *)&h_output);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 3, sizeof(cl_mem), (void *)&d_output_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output data.")) return 0;
//Determine run parameters
global_work_size[0] = input->width;//(unsigned int)((input->width / (float)local_work_size[0]) + 0.5);
global_work_size[1] = input->height;//(unsigned int)((input->height/ (float)local_work_size[1]) + 0.5);
printf("Global Work Group Size: %d %d\n", global_work_size[0], global_work_size[1]);
//Call kernel
result = clEnqueueNDRangeKernel(handles->queue, handles->current_kernel, 2, 0, global_work_size, local_work_size, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to run kernel!")) return 0;
result = clFinish(handles->queue);
if(check_result(result, "opencl_rgb_to_gray", "Failed to finish!")) return 0;
//Copy output
result = clEnqueueReadBuffer(handles->queue, d_output_data, CL_TRUE, 0, h_output.widthStep*h_output.height, (void *)output->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to output buffer on device!")) return 0;
Kernel:
typedef struct OpenCLImage_t
{
int width;
int widthStep;
int height;
int channels;
} OpenCLImage;
__kernel void opencl_rgb_kernel(OpenCLImage input, __global unsigned char* input_data, OpenCLImage output, __global unsigned char * output_data)
{
int pixel_x = get_global_id(0);
int pixel_y = get_global_id(1);
unsigned char * cur_in_pixel, *cur_out_pixel;
float avg = 0;
cur_in_pixel = (unsigned char *)(input_data + pixel_y*input.widthStep + pixel_x * input.channels);
cur_out_pixel = (unsigned char *)(output_data + pixel_y*output.widthStep + pixel_x * output.channels);
avg += cur_in_pixel[0];
avg += cur_in_pixel[1];
avg+= cur_in_pixel[2];
avg /=3.0f;
if(avg > 255.0)
avg = 255.0;
else if(avg < 0)
avg = 0;
*cur_out_pixel = avg;
}
Overhead of copying the value to all the threads that will be created might be the possible reason for the time; where as for a global memory the reference will be enough in the other case. The only the SDK implementer will be able to answer exactly.. :)
You may want to try a local_work_size like [64, 1, 1] in order to coalesce your memory calls. (note that 64 is a diviser of 1280).
As previously said, you have to use a profiler in order to get more informations. Are you using an nvidia card ? Then download CUDA 4 (not 5), as it contains an openCL profiler.
Your performance must be far from the optimum. Change the local work size, the global work size, try to treat two or four pixels per tread. Can you change the way pixels are stored privous to your treatment? Then break your struct for tree arrays in order to coalesce memomry access more effectively.
Tou can hide your memory transfers with the GPU work: it will be more easy to do with a profiler near you.

Resources