I want to insert some time measurement into my code. On x64 I use __rdtscp. Is there something similar for the mac m1/m2? Specifically something that isn't a system call and high resolution.
Just use clock_gettime(CLOCK_MONOTONIC,...)
It is a VDSO function. That means that the kernel injects code into the userspace program that "does the right thing", so the userspace program can access the time stamp counter without doing a syscall.
On x86, it will [usually] invoke rdtsc [or a PET], and adjust the counter value to represent nanoseconds.
On arm, the TSC is a control register, accessible only in kernel mode. But, higher end arm arches allow this to be mapped for R/O access by userspace. The kernel enables the mapping. Then, the VDSO snippet will know how to access the values via the mapping.
Calls to clock_gettime are fast. So fast that it's not worth trying to access the counter register directly.
Also, it's not terribly meaningful to access the counter directly, because we still have to convert it to some standard unit (e.g. nanoseconds). The VDSO snippet will do this.
UPDATE:
Is it a VDSO call on macOS, too? –
fuz
My direct experience was with arm was on an nVidia Jetson [under linux].
But, AFAIK, macOS provides [has to provide] clock_gettime.
On older kernels, it may have to issue a syscall equivalent.
But, since the architecture provides the means to do the direct access for userspace to a given OS/kernel, there is every reason to believe the VDSO method is available under macOS as well. In fact, it does: https://www.unix.com/man-page/osx/7/vdso/
The way to see the specific mechanism is to build a program that uses clock_gettime and [using gdb] single step it a bit. Then, it is possible to have gdb disassemble the clock_gettime code.
We have to use gdb [vs. objdump and/or readelf] for the disassembly because the snippet is loaded/injected by the kernel dynamically, so it's not easily accessible with static analysis.
Further, the injected code can be processor model specific. The kernel probes the CPU arch and its features during boot. It crafts the snippet based on the features it finds.
Using gdb is how I examined clock_gettime [about 3 years ago for a commercial product], to verify that it would access the H/W without a syscall and that it provided the correct nanosecond values. In that particular case, I also looked at the arch specific sections in the kernel source code.
Related
From this post, I learned
syscall is the default way of entering kernel mode on x86-64.
In practice, recent kernels are implementing a VDSO
Then I look up manual, in http://man7.org/linux/man-pages/man2/syscall.2.html :
The first table lists the instruction used to transition to kernel
mode (which might not be the fastest or best way to transition to the
kernel, so you might have to refer to vdso(7)), the register used to
indicate the system call number, the register used to return the sys‐
tem call result, and the register used to signal an error.....
But I lack some essential knowledge to understand the statements.
Is it true that VDSO(7) is the implementation of syscall(2), or syscall(2) will invoke VDSO(7) to complete system call?
If it is not true, what's the relationship between VDSO(7) and SYSCALL(2)?
the VDSO(7) is not the implementation of syscall(2).
Without VDSO(7), syscall will be run in user-space applications. In this case will be occur context switching.
if use VDSO(7), will be run syscall without context switching.
The kernel automatically maps into the address space of all user-space applications with vDSO.
Read more carefully the man pages syscalls(2), vdso(7) and the wikipages on system calls and VDSO. Read also the operating system wikipage and Operating Systems: Three Easy Pieces (freely downloadable).
System calls are fundamental, they are the only way a user-space application can interact with the operating system kernel and use services provided by it. So every program uses some system calls (unless it crashes and is terminated by some signal(7)). System calls requires a user to kernel transition (e.g. thru a SYSCALL or SYSENTER machine instruction on x86) which is somehow "costly" (e.g. could take a microsecond).
VDSO is only a clever optimization (to avoid the cost of a genuine system call, for very few functions like clock_gettime(2) which also still exist as genuine system calls), a bit like some shared library magically provided by the kernel without any real file. Some programs (e.g. statically linked ones, or those not using libc like BONES or probably busybox) don't use it.
You can avoid VDSO (or not use it), and earlier kernels did not have it. But you cannot avoid doing system calls, and programs usually do a lot of them.
Play also with strace(1) to understand the (many) system calls done by an application or a running process.
I'd like to use hardware performance counter, specifically x86 CPUs to obtain cache misses or branch mis-prediction. Performance counters are heavily used in advanced profilers like Intel VTune. Please don't be confused performance counters on Windows operating systems.
In order to use these counters in C/C++ program, one may use PAPI: http://icl.cs.utk.edu/papi/
This allows you to easily use performance counters, but on only Linux. PAPI once supported Windows, but not now.
Is there anyone who recently tried PAPI or other APIs to use hardware performance counters on Windows?
You can use RDPMC instruction or __readpmc MSVC compiler intrinsic, which is the same thing.
However, Windows prohibits user-mode applications to execute this instruction by setting CR4.PCE to 0. Presumably, this is done because the meaning of each counter is determined by MSR registers, which are only accessible in kernel mode. In other words, unless you're a kernel-mode module (e.g. a device driver), you are going to get "privileged instruction" trap if you attempt to execute this instruction.
If you're writing a user-mode application, your only option is (as #Christopher mentioned in comments) to write a kernel module which would execute this instruction for you (you'll incur user->kernel call penalty) and enable test signing on your machine so your presumably self-signed "driver" can be loaded. This means you can't easily distribute this app, but that'll work for in-house tuning.
What about this HCP Reference? Does it not provide what you want?
I'm using ARM Cortex-R4 for my system. It has a Memory Protection Unit instead of a Memory Management Unit. Effectively, this means that there's dedicated hardware for memory protection but that there's a one-to-one mapping between physical and virtual addresses. I'm a little confused about which Linux I should go for - standard Linux kernel with MMU disabled or uCLinux.
On ARM's evaluation board, I have run the standard kernel compiled with MMU disabled. I used the cramfs filesystem which is available on the official ARM website. After the kernel boots up, I'm in the shell, but I couldn't do much experimentation as I found that, most of the time, the shell stops responding (particularly when I press "tab" for auto-completion).
So I'm still not sure whether the MMU-less kernel should run smoothly if I use the correct filesystem. Also, which distro (buildroot?) should I use for the no-VM Linux?
Any idea or suggestion is welcome.
It's been more than 2 years since I asked this question. Now is the time I should write what I found for myself.
ucLinux was a project forked from the Linux kernel long back with the aim to develop Kernel for MMU less systems. However, after a certain while, it was merged to the parent Linux branch. So, today there doesn't exist any active ucLinux distribution.
So, if you disable MMU from the mainline kernel configuration, you'll get an MMU-less version. In fact, now there are configuration options provided in the kernel itself whereby a user can specify the memory layout and the access permissions.
Cheers!
uClinux is a Linux distribution which uses the Linux kernel with the MMU "turned off" and adds some applications and libraries on top of it. You wont choose one or the either as they are best one on top of the other.
If you got to a point where you have a shell running, you've managed to boot Linux sans MMU on your board but ran into a bug.
I believe ucLinux was built for something just like this [mmu less systems]
http://www.uclinux.org/description/
I learned from download.savannah.gnu.org/.../ProgrammingGroundUp-1-0-booksize.pdf
that programs interrupt the kernel, and that is how things are done. What I want to know is how you do that in C (if it's possible)
There is no platform-independent way (obviously)! On x86 platforms, system-calls are typically implemented by placing the system-call code in the eax register, and triggering int 80h in assembler, which causes a switch to kernel-mode. The kernel then executes the relevant code based on what it sees in eax.
User processes usually request kernel services by calling system call wrapper functions from Standard C Library. You can do it manually with syscall(2).
The user program's interaction with the kernel is going to be very platform-specific, so it usually happens behind the scenes in the various library routines. So one just calls printf, write, select, or other library routines, which allow the programmer to write code without worrying about the details of the kernel interface, device drivers, and so forth.
And the way it usually works is that when one of those library routines needs the kernel to do something on its behalf, it performs a low-level system call that yields its control of the CPU to the kernel. It's the user program, not the kernel, that is the one being interrupted.
If you're using glibc (which you probably are if you are using gcc and linux) then there is a syscall function in unistd.h that you can use. It has different implementations for different architectures and operating systems, but the implementation is done in assembly (could be inline assembly). syscall has a man page, so:
man syscall
will give you some info.
If you are just curious about how all of this works then you should know that this has changed in Linux on x86 in recent years. Originally interrupt 0x80 was used by Linux as the normal system call entry point on x86. This worked well enough, but as processors got more advanced pipelining (starting an instruction before previous instructions have completed) interrupts have slowed down (relative to execution of regular code which has sped up, though some tests have shown that it has slowed down more than that). The reason for this is that even when the int instruction is used to trigger an interrupt it works mostly the same as hardware triggered interrupts, which occur unpredictably, which causes them not to play nice with the pipelining of instructions (pipelining works better when code paths are predictable).
To help with this newer x86 processors have instructions specifically intended for making system calls, but Intel and AMD use different instructions for this (sysenter and syscall, respectively). Additionally the Intel systenter instruction clobbers a general purpose register that Linux has used on x86_32 to pass a parameter to the kernel. This means that programs have to know which of 3 possible system call mechanisms to use as well as possibly different ways of passing arguments to the kernel. To get around all of this newer kernels map a special page of memory into programs (this page is called vsyscall and if you cat /proc/self/maps you will see an entry for it) that contains code for the system call mechanism that the kernel has determined should be used on the system, and newer versions of glib can implement their system call entry using the code in this page.
The point of all of this is that this isn't as simple as it used to be, but if you are just playing around on an x86_32 then you should be able to use the int 80h instruction because that will be supported on systems that can use one of the other mechanisms for backwards compatibility.
In C, you don't really do it directly, but you'll end up doing this indirectly any time you use library functions that end up invoking system calls. File access, network access, etc, are typical examples of this.
Those functions will all end up "trapping" to the kernel, which will handle the request.
I have an open-source Atari 2600 emulator (Z26), and I'd like to add support for cartridges containing an embedded ARM processor (NXP 21xx family). The idea would be to simulate the 6507 until it tries to read or write a byte of memory (which it will do every 841ns). If the 6507 performs a write, put the address and data on some of the ARM's I/O ports and let the ARM code run 20 cycles, confirm that the ARM is floating its data bus, and let the ARM run for another 38 cycles. If the 6507 performs a read, put the address on the ARM's I/O ports, let the ARM run 38 cycles, grab the data from the ARM's I/O port (hopefully the ARM software will have put it there), and let the ARM run another 20 cycles.
The ARM7 seems pretty straightforward to implement; I don't need to simulate a whole lot of hardware features. Any thoughts?
Edit
What I have in mind would be a routine that would take as a parameter a struct holding the machine state and pointers to a memory access routine. When called, the routine would emulate the ARM's instruction engine, generating appropriate reads, writes, and code fetches. I could then write the memory access routine to regard appropriate areas as flash (with roughly-approximated wait states), RAM, I/O ports, and timer registers. Some other areas would be marked as don't-care, and accesses to any other areas would flag an error and stop the emulator.
Perhaps QEMU uses such a thing internally. Since the ARM emulation would be integrated into an already-existing emulation engine (which I didn't write and don't fully understand--the only parts of Z26 I've patched have been the memory read/write logic) I would need something with a fairly small footprint.
Any idea how QEMU works inside? Any idea what the GPL licence would require if I just use 2% of the code in QEMU--whether I'd have to bundle the code for the whole thing, or just the part that I use, or what?
Try QEMU.
With some work, you can make my emulator do what you want. It was written for ARM920, and the Thumb instruction set isn't done yet. Neither is the MMU/cache interface. Also, it's slow because it is an interpreter. On the bright side, it's all written in C99.
http://code.google.com/p/gp2xemu/
I haven't worked on it for a while (The svn trunk is 2 years old), but if you're going to use the code, I'll be glad to help you out with the missing features. It is licensed under MIT, so it's just the same as the broad BSD license.