I recently wrote a little curses game and as all it needs to work is some timer mechanism and a curses implementation, the idea to try building it for DOS comes kind of naturally. Curses is provided by pdcurses for DOS.
Timing is already different between POSIX and Win32, so I have defined this interface:
#ifndef CSNAKE_TICKER_H
#define CSNAKE_TICKER_H
void ticker_init(void);
void ticker_done(void);
void ticker_start(int msec);
void ticker_stop(void);
void ticker_wait(void);
#endif
The game calls ticker_init() and ticker_done() once, ticker_start() with a millisecond interval as soon as it needs ticks and ticker_wait() in its main loop to wait for the next tick.
Using the same implementation on DOS as the one for POSIX platforms, using setitimer(), didn't work. One reason was that the C lib coming with djgpp doesn't implement waitsig(). So I created a new implementation of my interface for DOS:
#undef __STRICT_ANSI__
#include <time.h>
uclock_t tick;
uclock_t nextTick;
uclock_t tickTime;
void
ticker_init(void)
{
}
void
ticker_done(void)
{
}
void
ticker_start(int msec)
{
tickTime = msec * UCLOCKS_PER_SEC / 1000;
tick = uclock();
nextTick = tick + tickTime;
}
void
ticker_stop()
{
}
void
ticker_wait(void)
{
while ((tick = uclock()) < nextTick);
nextTick = tick + tickTime;
}
This works like a charm in dosbox (I don't have a real DOS system right now). But my concern is: Is busy waiting really the best I can do on this platform? I'd like to have a solution allowing the CPU to at least save some energy.
For reference, here's the whole source.
Ok, I think I can finally answer my own question (thanks Wyzard for the helpful comment!)
The obvious solution, as there doesn't seem any library call doing this, is putting a hlt in inline assembly. Unfortunately, this crashed my program. Looking for the reason, it is because the default dpmi server used runs the program in ring 3 ... hlt is reserved to ring 0. So to use it, you have to modify the loader stub to load a dpmi server running your program in ring 0. See later.
Browsing through the docs, I came across __dpmi_yield(). If we are running in a multitasking environment (Win 3.x or 9x ...), there will already be a dpmi server provided by the operating system, and of course, in that case we want to give up our time slice while waiting instead of trying the privileged hlt.
So, putting it all together, the source for DOS now looks like this:
#undef __STRICT_ANSI__
#include <time.h>
#include <dpmi.h>
#include <errno.h>
static uclock_t nextTick;
static uclock_t tickTime;
static int haveYield;
void
ticker_init(void)
{
errno = 0;
__dpmi_yield();
haveYield = errno ? 0 : 1;
}
void
ticker_done(void)
{
}
void
ticker_start(int msec)
{
tickTime = msec * UCLOCKS_PER_SEC / 1000;
nextTick = uclock() + tickTime;
}
void
ticker_stop()
{
}
void
ticker_wait(void)
{
if (haveYield)
{
while (uclock() < nextTick) __dpmi_yield();
}
else
{
while (uclock() < nextTick) __asm__ volatile ("hlt");
}
nextTick += tickTime;
}
In order for this to work on plain DOS, the loader stub in the compiled executable must be modified like this:
<path to>/stubedit bin/csnake.exe dpmi=CWSDPR0.EXE
CWSDPR0.EXE is a dpmi server running all code in ring 0.
Still to test is whether yielding will mess with the timing when running under win 3.x / 9x. Maybe the time slices are too long, will have to check that. Update: It works great in Windows 95 with this code above.
The usage of the hlt instruction breaks compatibility with dosbox 0.74 in a weird way .. the program seems to hang forever when trying to do a blocking getch() through PDcurses. This doesn't happen however on a real MS-DOS 6.22 in virtualbox. Update: This is a bug in dosbox 0.74 that is fixed in the current SVN tree.
Given those findings, I assume this is the best way to wait "nicely" in a DOS program.
Update: It's possible to do even better by checking all available methods and picking the best one. I found a DOS idle call that should be considered as well. The strategy:
If yield is supported, use this (we are running in a multitasking environment)
If idle is supported, use this. Optionally, if we're in ring-0, do a hlt each time before calling idle, because idle is documented to return immediately when no other program is ready to run.
Otherwise, in ring-0 just use plain hlt instructions.
Busy-waiting as a last resort.
Here's a little example program (DJGPP) that tests for all possibilities:
#include <stdio.h>
#include <dpmi.h>
#include <errno.h>
static unsigned int ring;
static int
haveDosidle(void)
{
__dpmi_regs regs;
regs.x.ax = 0x1680;
__dpmi_int(0x28, ®s);
return regs.h.al ? 0 : 1;
}
int main()
{
puts("checking idle methods:");
fputs("yield (int 0x2f 0x1680): ", stdout);
errno = 0;
__dpmi_yield();
if (errno)
{
puts("not supported.");
}
else
{
puts("supported.");
}
fputs("idle (int 0x28 0x1680): ", stdout);
if (!haveDosidle())
{
puts("not supported.");
}
else
{
puts("supported.");
}
fputs("ring-0 HLT instruction: ", stdout);
__asm__ ("mov %%cs, %0\n\t"
"and $3, %0" : "=r" (ring));
if (ring)
{
printf("not supported. (running in ring-%u)\n", ring);
}
else
{
puts("supported. (running in ring-0)");
}
}
The code in my github repo reflects the changes.
Related
It's a kind of training task, because nowadays these methods (I guess) don't work anymore.
Win XP and MinGW compiler are used. No special compiler options are involved (just gcc with stating one source file).
First of all, saving an address to exit from the program and jumping to the some Hook function:
// Our system uses 4 bytes for addresses.
typedef unsigned long int DWORD;
// To save an address of the exit from the program.
DWORD addr_ret;
// An entry point.
int main()
{
// To make a direct access to next instructions.
DWORD m[1];
// Saving an address of the exit from the program.
addr_ret = (DWORD) m[4];
// Replacing the exit from the program with a jump to some Hook function.
m[4] = (DWORD) Hook;
// Status code of the program's execution.
return 0;
}
The goal of this code is to get an access to the system's privileges level, because when we return (should return) to the system, we just redirecting our program to some of our methods. The code of this method:
// Label's declaration to make a jump.
jmp_buf label;
void Hook()
{
printf ("Test\n");
// Trying to restore the stack using direct launch (without stack's preparation) of the function (we'll wee it later).
longjmp(label, 1);
// Just to make sure that we won't return here after jump's (from above) finish, because we are not getting stuck in the infinite loop.
while(1) {}
}
And finally I'll state a function which (in my opinion) should fix the stack pointer - ESP register:
void FixStack()
{
// A label to make a jump to here.
setjmp(label);
// A replacement of the exit from this function with an exit from the whole program.
DWORD m[1];
m[2] = addr_ret;
}
Of course we should use these includes for the stated program:
#include <stdio.h>
#include <setjmp.h>
The whole logic of the program works correctly in my system, but I can not restore my stack (ESP), so the program returns an incorrect return code.
Before the solution described above, I didn't use jumps and FixStack function. I mean that these lines were in the Hook function instead of jump and while cycle:
DWORD m[1];
m[2] = addr_ret;
But with this variant I was getting an incorrect value in ESP register before an exit from the program (it was on 8 bytes bigger then this register's value before an enter in this program). So I decided to add somehow these 8 bytes (avoiding any ASM code inside of the C program). It's the purpose of the jump into the FixStack function with an appropriate exit from it (to remove some values from stack). But, as I stated, it doesn't return a correct status of the program's execution using this command:
echo %ErrorLevel%
So my question is very wide: beginning from asking of some recommendations in a usage of debugging utilities (I was using only OllyDbg) and ending in possible solutions for the described Hook's implementation.
Ok, I could make my program work, as it was intended, finally. Now we can launch compiled (I use MinGW in Win XP) program without any errors and with correct return code.
Maybe will be helpful for someone:
#include <stdio.h>
#include <setjmp.h>
typedef unsigned long int DWORD;
DWORD addr_ret;
int FixStack()
{
DWORD m[1];
m[2] = addr_ret;
// This line is very necessary for correct running!
return 0;
}
void Hook()
{
printf("Test\n");
FixStack();
}
int main()
{
DWORD m[1];
addr_ret = (DWORD) m[4];
m[4] = (DWORD) Hook;
}
Of course it seems that you've realized that this will only work with a very specific build environment. It most definitely won't work on a 64-bit target (because the addresses aren't DWORD-ish).
Is there any reason why you don't want to use the facilities provided by the C standard library to do exactly this? (Or something very similar to this.)
#include <stdlib.h>
void Hook()
{
printf("Test\n");
}
int main()
{
atexit(Hook);
}
I'm having some fun with context switching. I've copied the example code into a file
http://pubs.opengroup.org/onlinepubs/009695399/functions/makecontext.html
and i defined the macro _XOPEN_SOURCE for OSX.
#define _XOPEN_SOURCE
#include <stdio.h>
#include <ucontext.h>
static ucontext_t ctx[3];
static void
f1 (void)
{
puts("start f1");
swapcontext(&ctx[1], &ctx[2]);
puts("finish f1");
}
static void
f2 (void)
{
puts("start f2");
swapcontext(&ctx[2], &ctx[1]);
puts("finish f2");
}
int
main (void)
{
char st1[8192];
char st2[8192];
getcontext(&ctx[1]);
ctx[1].uc_stack.ss_sp = st1;
ctx[1].uc_stack.ss_size = sizeof st1;
ctx[1].uc_link = &ctx[0];
makecontext(&ctx[1], f1, 0);
getcontext(&ctx[2]);
ctx[2].uc_stack.ss_sp = st2;
ctx[2].uc_stack.ss_size = sizeof st2;
ctx[2].uc_link = &ctx[1];
makecontext(&ctx[2], f2, 0);
swapcontext(&ctx[0], &ctx[2]);
return 0;
}
I build it
gcc -o context context.c -g
winges at me about get, make, swap context being deprecated. Meh.
When I run it it just hangs. It doesn't seem to crash. It just hangs.
I tried using gdb, but once I step into the swapcontext, it just is blank. It doesn't jump into f1. I just keep hitting enter and it will just move the cursor into a new line on the console?
Any idea what's a happening? Something to do with working on the Mac/deprecate methods?
Thanks
It looks like your code is just copy/pasted from the ucontext documentation, which must make it frustrating that it's not working...
As far as I can tell, your stacks are just too small. I couldn't get it to work with any less than 32KiB for your stacks.
Try making these changes:
#define STACK_SIZE (1<<15) // 32KiB
// . . .
char st1[STACK_SIZE];
char st2[STACK_SIZE];
yup fixed it. why did it fix it though?
Well, let's dig into the problem a bit more. First, let's find out what's actually going on.
When I run it it just hangs. It doesn't seem to crash. It just hangs.
If you use some debugger-fu (be sure to use lldb—gdb just doesn't work right on os x), then you will find that when the app is "hanging", it's actually spinning in a weird loop in your main function, illustrated by the arrow in the comments below.
int
main (void)
{
char st1[8192];
char st2[8192];
getcontext(&ctx[1]);
ctx[1].uc_stack.ss_sp = st1;
ctx[1].uc_stack.ss_size = sizeof st1;
ctx[1].uc_link = &ctx[0];
makecontext(&ctx[1], f1, 0);
getcontext(&ctx[2]);// <---------------------+ back to here
ctx[2].uc_stack.ss_sp = st2;// |
ctx[2].uc_stack.ss_size = sizeof st2;// |
ctx[2].uc_link = &ctx[1];// |
makecontext(&ctx[2], f2, 0); // |
// |
puts("about to swap...");// |
// |
swapcontext(&ctx[0], &ctx[2]);// ------------+ jumps from here
return 0;
}
Note that I added an extra puts call above in the middle of the loop. If you add that line and compile/run again, then instead of the program just hanging you'll see it start spewing out the string "about to swap..." ad infinitum.
Obviously something screwy is going on based on the given stack size, so let's just look for everywhere that ss_size is referenced...
(Note: The authoritative source code for the Apple ucontext implementation is at https://opensource.apple.com/source/, but there's a GitHub mirror that I'll use since it's nicer for searching and linking.)
If we take a look at makecontext.c, we see something like this:
if (ucp->uc_stack.ss_size < MINSIGSTKSZ) {
// fail without an error code since makecontext is a void function
return;
}
Well, that's nice! What is MINSIGSTKSZ? Well, let's take a look in signal.h:
#define MINSIGSTKSZ 32768 /* (32K)minimum allowable stack */
#define SIGSTKSZ 131072 /* (128K)recommended stack size */
Apparently these values are actually part of the POSIX standard. Although I don't see anything in the ucontext documentation that references these values, I guess it's kind of implied since ucontext preserves the current signal mask.
Anyway, this explains the screwy behavior we're seeing. Since the makecontext call is failing due to the stack size being too small, the call to getcontext(&ctx[2]) is what is setting up the contents of ctx[2], so the call to swapcontext(&ctx[0], &ctx[2]) just ends up swapping back to that line again, creating the infinite loop...
Interestingly, MINSIGSTKSZ is 32768 bytes on os x, but only 2048 bytes on my linux box, which explains why it worked on linux but not os x.
Based on all of that, it looks like a safer option is use the recommended stack size from sys/signal.h:
char st1[SIGSTKSZ];
char st2[SIGSTKSZ];
That, or switch to something that isn't deprecated. You might take a look at Boost.Context if you're not averse to C++.
This question already has an answer here:
Calculate run time of kernel code in OpenCL C
(1 answer)
Closed 7 years ago.
I want to measure the performance of different devices viz CPU and GPUs.
This is my kernel code:
__kernel void dataParallel(__global int* A)
{
sleep(10);
A[0]=2;
A[1]=3;
A[2]=5;
int pnp;//pnp=probable next prime
int pprime;//previous prime
int i,j;
for(i=3;i<10;i++)
{
j=0;
pprime=A[i-1];
pnp=pprime+2;
while((j<i) && A[j]<=sqrt((float)pnp))
{
if(pnp%A[j]==0)
{
pnp+=2;
j=0;
}
j++;
}
A[i]=pnp;
}
}
However the sleep() function doesnt work. I am getting the following error in buildlog:
<kernel>:4:2: warning: implicit declaration of function 'sleep' is invalid in C99
sleep(10);
builtins: link error: Linking globals named '__gpu_suld_1d_i8_trap': symbol multiply defined!
Is there any other way to implement the function. Also is there a way to record the time taken to execute this code snippet.
P.S. I have included #include <unistd.h> in my host code.
You dont need to use sleep in your kernel to measure the execution time.
There are two ways to measure the time.
1. Use opencl inherent profiling
look here: cl api
get timestamps in your hostcode and compare them before and after execution.
example:
double start = getTimeInMS();
//The kernel starts here
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &tasksize, &local_size_in, 0, NULL, NULL)
//wait for kernel execution
clFinish(command_queue);
cout << "kernel execution time " << (getTimeInMS() - start) << endl;
Where getTimeinMs() is a function that returns a double value of miliseconds:
(windows specific, override with other implementation if you dont use windows)
static inline double getTimeInMS(){
SYSTEMTIME st;
GetLocalTime(&st);
return (double)st.wSecond * (double)1000 + (double)st.wMilliseconds;}
Also you want to:
#include <time.h>
For Mac it would be (could work on Linux as well, not sure):
static inline double getTime() {
struct timeval starttime;
gettimeofday(&starttime, 0x0);
return (double)starttime.tv_sec * (double)1000 + (double)starttime.tv_usec / (double)1000;}
I'm writing code for an embedded device (no OS so no system calls or anything) and I need to have a delay but the compiler doesn't supply time.h. What other options do I have?
Depending on the clock of your system you may implement delays using the NOP (no operation) assembler instruction. You may calculate the time of one NOP depending on the MIPS of your system, so for example if 1 NOP is 1[us], then you could implement something like:
void delay(int ms)
{
int i;
for (i = 0; i < ms*1000; i++)
{
asm(NOP);
}
}
Depends on the device. Can you enable a stable timer interrupt? You might only be able to busy-wait and wait for a timer interrupt. How accurate this is likely to be (and how accurate it needs to be) is unclear.
For short fixed time delays, a do-nothing loop will meet the need, but of course, will need calibration.
void Delay_ms(unsigned d /* ms */) {
while (d-- > 0) {
unsigned i;
i = 2800; // Calibrate this value
// Recommend that the flowing while loop's asm code is check for tightness.
while (--i);
/* add multiple _nop_() here should you want precise calibration */
}
}
I would like to extract a rather limited set of information about NVIDIA GPUs without linking against the CUDA libraries. The only information that is needed is compute capability and name of the GPU, more than this could be useful but it is not required. The code should be written in C (or C++). The information would be used at configure-time (when the CUDA toolkit is not available) and at run-time (when the executed binary is not compiled with CUDA support) to suggest the user that a supported GPU is present in the system.
As far as I understand, this is possible through the driver API, but I am not very familiar with the technical details of what this would require. So my questions are:
What are the exact steps to fulfill at least the minimum requirement (see above);
Is there such open-source code available?
Note that the my first step would be to have some code for Linux, but ultimately I'd need platform-independent code. Considering the platform-availability of CUDA, for a complete solution this would involve code for on x86/AMD64 for Linux, Mac OS, and Windows (at least for now, the list could get soon extended with ARM).
Edit
What I meant by "it's possible through the driver API" is that one should be able to load libcuda.so dynamically and query the device properties through the driver API. I'm not sure about the details, though.
Unfortunately NVML doesn't provide information about device compute capability.
What you need to do is:
Load CUDA library manually (application is not linked against libcuda)
If the library doesn't exist then CUDA driver is not installed
Find pointers to necessary functions in the library
Use driver API to query information about available GPUs
I hope this code will be helpful. I've tested it under Linux but with minor modifications it should also compile under Windows.
#include <cuda.h>
#include <stdio.h>
#ifdef WINDOWS
#include <Windows.h>
#else
#include <dlfcn.h>
#endif
void * loadCudaLibrary() {
#ifdef WINDOWS
return LoadLibraryA("nvcuda.dll");
#else
return dlopen ("libcuda.so", RTLD_NOW);
#endif
}
void (*getProcAddress(void * lib, const char *name))(void){
#ifdef WINDOWS
return (void (*)(void)) GetProcAddress(lib, name);
#else
return (void (*)(void)) dlsym(lib,(const char *)name);
#endif
}
int freeLibrary(void *lib)
{
#ifdef WINDOWS
return FreeLibrary(lib);
#else
return dlclose(lib);
#endif
}
typedef CUresult CUDAAPI (*cuInit_pt)(unsigned int Flags);
typedef CUresult CUDAAPI (*cuDeviceGetCount_pt)(int *count);
typedef CUresult CUDAAPI (*cuDeviceComputeCapability_pt)(int *major, int *minor, CUdevice dev);
int main() {
void * cuLib;
cuInit_pt my_cuInit = NULL;
cuDeviceGetCount_pt my_cuDeviceGetCount = NULL;
cuDeviceComputeCapability_pt my_cuDeviceComputeCapability = NULL;
if ((cuLib = loadCudaLibrary()) == NULL)
return 1; // cuda library is not present in the system
if ((my_cuInit = (cuInit_pt) getProcAddress(cuLib, "cuInit")) == NULL)
return 1; // sth is wrong with the library
if ((my_cuDeviceGetCount = (cuDeviceGetCount_pt) getProcAddress(cuLib, "cuDeviceGetCount")) == NULL)
return 1; // sth is wrong with the library
if ((my_cuDeviceComputeCapability = (cuDeviceComputeCapability_pt) getProcAddress(cuLib, "cuDeviceComputeCapability")) == NULL)
return 1; // sth is wrong with the library
{
int count, i;
if (CUDA_SUCCESS != my_cuInit(0))
return 1; // failed to initialize
if (CUDA_SUCCESS != my_cuDeviceGetCount(&count))
return 1; // failed
for (i = 0; i < count; i++)
{
int major, minor;
if (CUDA_SUCCESS != my_cuDeviceComputeCapability(&major, &minor, i))
return 1; // failed
printf("dev %d CUDA compute capability major %d minor %d\n", i, major, minor);
}
}
freeLibrary(cuLib);
return 0;
}
Test on Linux:
$ gcc -ldl main.c
$ ./a.out
dev 0 CUDA compute capability major 2 minor 0
dev 1 CUDA compute capability major 2 minor 0
Test on linux with no CUDA driver
$ ./a.out
$ echo $?
1
Cheers
Sure these people know the answer:
http://www.ozone3d.net/gpu_caps_viewer
but i can only know that i could be done with an installation of CUDA or OpenCL.
I think one way could be using OpenGL directly, maybe that is what you were talking about with the driver API, but i can only give you these example (CUDA required):
http://www.naic.edu/~phil/hardware/nvidia/doc/src/deviceQuery/deviceQuery.cpp
First, I think NVIDIA NVML is the API you are looking for. Second, there is an open-source project based on NVML called PAPI NVML.
I solved this problem by using and linking statically against the CUDA 6.0 SDK. It produces an application that works also well on a machines that does not have NVIDIA cards or on machines that the SDK is not installed. In such case it will indicate that there are zero CUDA capable devices.
There is an example in the samples included with the CUDA SDK calld deviceQuery - I used snippets from it to write the following code. I decide if a CUDA capable devices are present and if so which has the higest compute capabilities:
#include <cuda_runtime.h>
struct GpuCap
{
bool QueryFailed; // True on error
int DeviceCount; // Number of CUDA devices found
int StrongestDeviceId; // ID of best CUDA device
int ComputeCapabilityMajor; // Major compute capability (of best device)
int ComputeCapabilityMinor; // Minor compute capability
};
GpuCap GetCapabilities()
{
GpuCap gpu;
gpu.QueryFailed = false;
gpu.StrongestDeviceId = -1;
gpu.ComputeCapabilityMajor = -1;
gpu.ComputeCapabilityMinor = -1;
cudaError_t error_id = cudaGetDeviceCount(&gpu.DeviceCount);
if (error_id != cudaSuccess)
{
gpu.QueryFailed = true;
gpu.DeviceCount = 0;
return gpu;
}
if (gpu.DeviceCount == 0)
return gpu; // "There are no available device(s) that support CUDA
// Find best device
for (int dev = 0; dev < gpu.DeviceCount; ++dev)
{
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
if (deviceProp.major > gpu.ComputeCapabilityMajor)
{
gpu.ComputeCapabilityMajor = dev;
gpu.ComputeCapabilityMajor = deviceProp.major;
gpu.ComputeCapabilityMinor = 0;
}
if (deviceProp.minor > gpu.ComputeCapabilityMinor)
{
gpu.ComputeCapabilityMajor = dev;
gpu.ComputeCapabilityMinor = deviceProp.minor;
}
}
return gpu;
}