I am writing a small C application that use some threads for processing data. I want to be able to know the number of processors on a certain machine, without using system() & in combination to a small script.
The only way i can think of is to parse /proc/cpuinfo. Any other useful suggestions ?
As others have mentioned in comments, this answer is useful:
numCPU = sysconf( _SC_NPROCESSORS_ONLN );
Leaving as a solution for folks that might skip over comments...
Why not use sys/sysinfo.h?
#include <sys/sysinfo.h>
#include <stdio.h>
void main () {
printf ("You have %d processors.\n", get_nprocs ());
}
Way more information can be found on the man page
$ man 3 get_nprocs
machine:/sys/devices/system/cpu$ ls
cpu0 cpu3 cpu6 kernel_max perf_counters sched_mc_power_savings
cpu1 cpu4 cpu7 offline possible
cpu2 cpu5 cpuidle online present
If you have a machine with sysfs, take a look in /sys/devices/system/cpu.
Make sure you're asking for what you want -- CPUs, cores, hyperthreads, etc.
The following was the code that I used to figure number of cores.....it might help you
//Finding the number of cores(logical processor) using cpuid instruction.....
__asm
{
mov eax,01h //01h is for getting number of cores present in the processor
cpuid
mov t,ebx
}
(t>>16)&0xff contains the number cores........
I guess this could help you
http://lists.gnu.org/archive/html/autoconf/2002-08/msg00126.html
#include <stdio.h>
void getPSN(char *PSN)
{int varEAX, varEBX, varECX, varEDX;
char str[9];
//%eax=1 gives most significant 32 bits in eax
__asm__ __volatile__ ("cpuid": "=a" (varEAX), "=b" (varEBX), "=c" (varECX), "=d" (varEDX) : "a" (1));
sprintf(str, "%08X", varEAX); //i.e. XXXX-XXXX-xxxx-xxxx-xxxx-xxxx
sprintf(PSN, "%C%C%C%C-%C%C%C%C", str[0], str[1], str[2], str[3], str[4], str[5], str[6], str[7]);
//%eax=3 gives least significant 64 bits in edx and ecx [if PN is enabled]
__asm__ __volatile__ ("cpuid": "=a" (varEAX), "=b" (varEBX), "=c" (varECX), "=d" (varEDX) : "a" (3));
sprintf(str, "%08X", varEDX); //i.e. xxxx-xxxx-XXXX-XXXX-xxxx-xxxx
sprintf(PSN, "%s-%C%C%C%C-%C%C%C%C", PSN, str[0], str[1], str[2], str[3], str[4], str[5], str[6], str[7]);
sprintf(str, "%08X", varECX); //i.e. xxxx-xxxx-xxxx-xxxx-XXXX-XXXX
sprintf(PSN, "%s-%C%C%C%C-%C%C%C%C", PSN, str[0], str[1], str[2], str[3], str[4], str[5], str[6], str[7]);
}
int main()
{
char PSN[30]; //24 Hex digits, 5 '-' separators, and a '\0'
getPSN(PSN);
printf("%s\n", PSN); //compare with: lshw | grep serial:
return 0;
}
Here's a minimal example of how to get physical cores and virtual threads:
#include <stdio.h>
...
unsigned int thread_count, core_count;
FILE *cpu_info = fopen("/proc/cpuinfo", "r");
while (!fscanf(cpu_info, "siblings\t: %u", &thread_count))
fscanf(cpu_info, "%*[^s]");
while (!fscanf(cpu_info, "cpu cores\t: %u", &core_count))
fscanf(cpu_info, "%*[^c]");
fclose(cpu_info);
It's more portable than _SC_NPROCESSORS_ONLN as it doesn't require glibc extensions.
Note that you don't need to check for EOF in this example as fscanf will return EOF if reached. This will cause the loop to stop safely.
Also, this example doesn't contain error checking to see if fopen failed. This should be done however you see fit.
This fscanf technique was derived from here: https://stackoverflow.com/a/43483850
Related
I know how to get the number of logical cores in C.
sysconf(_SC_NPROCESSORS_CONF);
This will return 4 on my i3 processor. But actually there are only 2 cores in an i3.
How can I get physical core count?
This is a C solution using libcpuid.
cores.c:
#include <stdio.h>
#include <libcpuid.h>
int main(void)
{
struct cpu_raw_data_t raw;
struct cpu_id_t data;
cpuid_get_raw_data(&raw);
cpu_identify(&raw, &data);
printf("No. of Physical Core(s) : %d\n", data.num_cores);
return 0;
}
This is a C++ solution using Boost.
cores.cpp:
// use boost to get number of cores on the processor
// compile with : g++ -o cores cores.cpp -lboost_system -lboost_thread
#include <iostream>
#include <boost/thread.hpp>
int main ()
{
std::cout << "No. of Physical Core(s) : " << boost::thread::physical_concurrency() << std::endl;
std::cout << "No. of Logical Core(s) : " << boost::thread::hardware_concurrency() << std::endl;
return 0;
}
On my desktop (i5 2310) it returns:
No. of Physical Core(s) : 4
No. of Logical Core(s) : 4
While on my laptop (i5 480M):
No. of Physical Core(s) : 2
No. of Logical Core(s) : 4
Meaning that my laptop processor have Hyper-Threading tecnology
Without any lib:
int main()
{
unsigned int eax=11,ebx=0,ecx=1,edx=0;
asm volatile("cpuid"
: "=a" (eax),
"=b" (ebx),
"=c" (ecx),
"=d" (edx)
: "0" (eax), "2" (ecx)
: );
printf("Cores: %d\nThreads: %d\nActual thread: %d\n",eax,ebx,edx);
}
Output:
Cores: 4
Threads: 8
Actual thread: 1
You might simply read and parse /proc/cpuinfo pseudo-file (see proc(5) for details; open that pseudo-file as a text file and read it sequentially line by line; try cat /proc/cpuinfo in a terminal).
The advantage is that you just are parsing a (Linux-specific) text [pseudo-]file (without needing any external libraries, like in Gengisdave's answer), the disadvantage is that you need to parse it (not a big deal, read 80 bytes lines with fgets in a loop then use sscanf and test the scanned item count....)
The ht presence in flags: line means that your CPU has hyper-threading. The number of CPU threads is given by the number of processor: lines. The actual number of physical cores is given by cpu cores: (all this using a 4.1 kernel on my machine).
I am not sure you are right in wanting to understand how many physical cores you have. Hyper-threading may actually be useful. You need to benchmark.
And you probably should make the number of working threads (e.g. the size of your thread pool) in your application be user-configurable. Even on a 4 core hyper-threaded processor, I might want to have no more than 3 running threads (because I want to use the other threads for something else).
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned int lcores = 0, tsibs = 0;
char buff[32];
char path[64];
for (lcores = 0;;lcores++) {
FILE *cpu;
snprintf(path, sizeof(path), "/sys/devices/system/cpu/cpu%u/topology/thread_siblings_list", lcores);
cpu = fopen(path, "r");
if (!cpu) break;
while (fscanf(cpu, "%[0-9]", buff)) {
tsibs++;
if (fgetc(cpu) != ',') break;
}
fclose(cpu);
}
printf("physical cores %u\n", lcores / (tsibs / lcores));
}
thread_siblings_list has a comma delimited list of cores which are "thread siblings" with the current core.
Divide the number of logical cores by the number of siblings to get the siblings per core. Divide the number of logical cores by the siblings per core to get the number of physical cores.
I'm trying my hand at assembly in order to use vector operations, which I've never really used before, and I'm admittedly having a bit of trouble grasping some of the syntax.
The relevant code is below.
unit16_t asdf[4];
asdf[0] = 1;
asdf[1] = 2;
asdf[2] = 3;
asdf[3] = 4;
uint16_t other = 3;
__asm__("movq %0, %%mm0"
:
: "m" (asdf));
__asm__("pcmpeqw %0, %%mm0"
:
: "r" (other));
__asm__("movq %%mm0, %0" : "=m" (asdf));
printf("%u %u %u %u\n", asdf[0], asdf[1], asdf[2], asdf[3]);
In this simple example, I'm trying to do a 16-bit compare of "3" to each element in the array. I would hope that the output would be "0 0 65535 0". But it won't even assemble.
The first assembly instruction gives me the following error:
error: memory input 0 is not directly addressable
The second instruction gives me a different error:
Error: suffix or operands invalid for `pcmpeqw'
Any help would be appreciated.
You can't use registers directly in gcc asm statements and expect them to match up with anything in other asm statements -- the optimizer moves things around. Instead, you need to declare variables of the appropriate type and use constraints to force those variables into the right kind of register for the instruction(s) you are using.
The relevant constraints for MMX/SSE are x for xmm registers and y for mmx registers. For your example, you can do:
#include <stdint.h>
#include <stdio.h>
typedef union xmmreg {
uint8_t b[16];
uint16_t w[8];
uint32_t d[4];
uint64_t q[2];
} xmmreg;
int main() {
xmmreg v1, v2;
v1.w[0] = 1;
v1.w[1] = 2;
v1.w[2] = 3;
v1.w[3] = 4;
v2.w[0] = v2.w[1] = v2.w[2] = v2.w[3] = 3;
asm("pcmpeqw %1,%0" : "+x"(v1) : "x"(v2));
printf("%u %u %u %u\n", v1.w[0], v1.w[1], v1.w[2], v1.w[3]);
}
Note that you need to explicitly replicate the 3 across all the relevant elements of the second vector.
From intel reference manual:
PCMPEQW mm, mm/m64 Compare packed words in mm/m64 and mm for equality.
PCMPEQW xmm1, xmm2/m128 Compare packed words in xmm2/m128 and xmm1 for equality.
Your pcmpeqw uses an "r" register which is wrong. Only "mm" and "m64" registers
valter
The code above failed when expanding the asm(), it never tried to even assemble anything. In this case, you are trying to use the zeroth argument (%0), but you didn't give any.
Check out the GCC Inline assembler HOWTO, or read the relevant chapter of your local GCC documentation.
He's right, the optimizer is changing register contents. Switching to intrinsics and using volatile to keep things a little more in place might help.
I'm trying to write some self modifying code in C and ARM. I previously asked a similar question about MIPS and am now trying to port over the project to ARM.
My system := Raspbian on raspberry pi, ARMv6, GCC
There are a few things I am unsure of:
Does ARM require a D-cache write-back/I-cache invalidate (cache flush)? If so, how can we do this?
Also I tried an example
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int inc(int x){ //increments x
uint16_t *ret = malloc(2 * sizeof(uint16_t));
*(ret + 0) = 0x3001; //add r0 1 := r0 += 1
*(ret + 1) = 0x4770; //bx lr := jump back to inc()
int(*f)(int) = (int (*)(int)) ret;
return (*f)(x);
}
int main(){
printf("%d",inc(6)); //expect '7' to be printed
exit(0);}
but I keep getting a segmentation fault. I'm using the aapcs calling convention, which I've been given to understand is the default for all ARM
I'd be much obliged if someone pointed me in the right direction
Bonus question (meaning, it doesn't really have to be answered, but would be cool to know) - I "come from a MIPS background", how the heck do ARM programmers do without a 0 register? (as in, a register hardcoded to the value 0)
Read Caches and Self-Modifying Code on blogs.arm.com. Article includes an example as well which does what you are describing.
To answer your question from article
... the ARM architecture is often considered to be a Modified Harvard Architecture. ...
The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to ARM. On ARM, you can write instructions into memory, but because the D-cache and I-cache are not coherent, the newly-written instructions might be masked by the existing contents of the I-cache, causing the processor to execute old (or possibly invalid) instructions.
See __clear_cache for how to invalidate cache(s).
I hope you are also aware of ARM/Thumb instruction sets, if you are planning to push your instructions into memory.
Ok, so this works on my raspberry Pi.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
#include <stdlib.h>
int inc(int x){ //increments x
uint32_t *ret = mmap(NULL,
2 * sizeof(uint32_t), // Space for 16 instructions. (More than enough.)
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS,
-1,0);
if (ret == MAP_FAILED) {
printf("Could not mmap a memory buffer with the proper permissions.\n");
return -1;
}
*(ret + 0) = 0xE2800001; //add r0 r0 #1 := r0 += 1
*(ret + 1) = 0xE12FFF1E; //bx lr := jump back to inc()
__clear_cache((char*) ret, (char*) (ret+2));
int(*f)(int) = (int (*)(int)) ret;
return (*f)(x);
}
int main(){
printf("%d\n",inc(6)); //expect '7' to be printed
exit(0);}
There are a couple of problems.
You don't flush your D-Cache and I-Cache, so most times the I-Cache will fetch stale data from L2. Under linux there is a libc/sys-call which does that for you. Either use __clear_cache(begin, end) or _builtin_clear_cache(begin, end).
You output Thumb-Code, but you don't take care of how your code gets called. The easiest way to fix that would be to use some asm-code to do the actual blx call and OR the address with 1, as this bit sets the mode the processor runs in. As you're malloc address will always be aligned to a word boundary, making you call thumb-code in arm-mode.
I intend to write my own JIT-interpreter as part of a course on VMs. I have a lot of knowledge about high-level languages, compilers and interpreters, but little or no knowledge about x86 assembly (or C for that matter).
Actually I don't know how a JIT works, but here is my take on it: Read in the program in some intermediate language. Compile that to x86 instructions. Ensure that last instruction returns to somewhere sane back in the VM code. Store the instructions some where in memory. Do an unconditional jump to the first instruction. Voila!
So, with that in mind, I have the following small C program:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main() {
int *m = malloc(sizeof(int));
*m = 0x90; // NOP instruction code
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "eax");
return 42;
}
Okay, so my intention is for this program to store the NOP instruction somewhere in memory, jump to that location and then probably crash (because I haven't setup any way for the program to return back to main).
Question: Am I on the right path?
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
Question: Other issues I should beware of?
PS: My goal is to gain understanding, not necessarily do everything the right way.
Thanks for all the feedback. The following code seems to be the place to start and works on my Linux box:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
unsigned char *m;
int main() {
unsigned int pagesize = getpagesize();
printf("pagesize: %u\n", pagesize);
m = malloc(1023+pagesize+1);
if(m==NULL) return(1);
printf("%p\n", m);
m = (unsigned char *)(((long)m + pagesize-1) & ~(pagesize-1));
printf("%p\n", m);
if(mprotect(m, 1024, PROT_READ|PROT_EXEC|PROT_WRITE)) {
printf("mprotect fail...\n");
return 0;
}
m[0] = 0xc9; //leave
m[1] = 0xc3; //ret
m[2] = 0x90; //nop
printf("%p\n", m);
asm("jmp *%0"
: /* outputs: */ /* none */
: /* inputs: */ "d" (m)
: /* clobbers: */ "ebx");
return 21;
}
Question: Am I on the right path?
I would say yes.
Question: Could you show me a modified program that manages to find its way back to somewhere inside main?
I haven't got any code for you, but a better way to get to the generated code and back is to use a pair of call/ret instructions, as they will manage the return address automatically.
Question: Other issues I should beware of?
Yes - as a security measure, many operating systems would prevent you from executing code on the heap without making special arrangements. Those special arrangements typically amount to you having to mark the relevant memory page(s) as executable.
On Linux this is done using mprotect() with PROT_EXEC.
If your generated code follows the proper calling convention, then you can declare a pointer-to-function type and invoke the function this way:
typedef void (*generated_function)(void);
void *func = malloc(1024);
unsigned char *o = (unsigned char *)func;
generated_function *func_exec = (generated_function *)func;
*o++ = 0x90; // NOP
*o++ = 0xcb; // RET
func_exec();
The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles