When I connect a NOR Flash to an ARM processor, for example, do I connect, for example, A0 to A0 (and D0 to D0) like a PowerPC, or A0 to A23 (and D0 to D7/D15) like an x86.Sorry for this hardware question, but I did not find any reference design on ARM bit numbering (or its reverse). Thanks in advance.
Related
On a Raspberry Pi 3b, which is ARM Cortex-A5 processor, ARMv7 architecture, I am reading the cycle counter registers from the PMU (Performance Monitor Unit):
uint32_t cycle_counter_read (void)
{
uint32_t cc = 0;
__asm__ volatile ("mrc p15, 0, %0, c9, c13, 0":"=r" (cc));
return cc;
}
I'd like to understand on which ARM processors will this code run successfully, other than the one in Raspberry Pi 3B. Which ARM info is important for this:
processor name (Cortex A5 only?)
architecture (ARMv7 only?)
AArch32 and/or AArch64?
the combination of the above?
something else?
List of ARM architectures.
I'd like to understand on which ARM processors will this code run successfully, other than the one in Raspberry Pi 3B. Which ARM info is important for this
Well, mrc p15 is moving into a register from coprocessor 15.
So, the important info is just whether the processor has a PMU on coprocessor #15.
I think the feature is optional (although strongly recommended), so it's always possible you'll come across a custom ARM chip that doesn't have a PMU. Hopefully it's always on coprocessor #15 at least, when it is present.
Probably the best you can do is find a list of chips known to have one. You can use the ID_AA64DFR0_EL1, AArch64 Debug Feature Register 0 on ARM v8, but I don't know of a reliable mechanism before that.
Searching for "arm pmu" found, for example this list in the Linux kernel documentation:
ARM cores often have a PMU for counting cpu and cache events like cache misses
and hits.
...
- compatible : should be one of
"apm,potenza-pmu"
"arm,armv8-pmuv3"
"arm,cortex-a73-pmu"
"arm,cortex-a72-pmu"
"arm,cortex-a57-pmu"
"arm,cortex-a53-pmu"
"arm,cortex-a35-pmu"
"arm,cortex-a17-pmu"
"arm,cortex-a15-pmu"
"arm,cortex-a12-pmu"
"arm,cortex-a9-pmu"
"arm,cortex-a8-pmu"
"arm,cortex-a7-pmu"
"arm,cortex-a5-pmu"
"arm,arm11mpcore-pmu"
"arm,arm1176-pmu"
"arm,arm1136-pmu"
"brcm,vulcan-pmu"
"cavium,thunder-pmu"
"qcom,scorpion-pmu"
"qcom,scorpion-mp-pmu"
"qcom,krait-pmu"
(This question was originally about the CVTSI2SD instruction and the fact that I thought it didn't work on the Pentium M CPU, but in fact it's because I'm using a custom OS and I need to manually enable SSE.)
I have a Pentium M CPU and a custom OS which so far used no SSE instructions, but I now need to use them.
Trying to execute any SSE instruction results in an interruption 6, illegal opcode (which in Linux would cause a SIGILL, but this isn't Linux), also referred to in the Intel architectures software developer's manual (which I refer from now on as IASDM) as #UD - Invalid Opcode (UnDefined Opcode).
Edit: Peter Cordes actually identified the right cause, and pointed me to the solution, which I resume below:
If you're running an ancient OS that doesn't support saving XMM regs on context switches, the SSE-enabling bit in one of the machine control registers won't be set.
Indeed, the IASDM mentions this:
If an operating system did not provide adequate system level support for SSE, executing an SSE or SSE2 instructions can also generate #UD.
Peter Cordes pointed me to the SSE OSDev wiki, which describes how to enable SSE by writing to both CR0 and CR4 control registers:
clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ]
set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ]
set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ]
set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]
Note that, in order to be able to write to these registers, if you are in protected mode, then you need to be in privilege level 0. The answer to this question explains how to test it: if in protected mode, that is, when bit 0 (PE) in CR0 is set to 1, then you can test bits 0 and 1 from the CS selector, which should be both 0.
Finally, the custom OS must properly handle XMM registers during context switches, by saving and restoring them when necessary.
If you're running an ancient or custom OS that doesn't support saving XMM regs on context switches, it won't have set the SSE-enabling bits in the machine control registers. In that case all instructions that touch xmm regs will fault.
Took me a sec to find, but http://wiki.osdev.org/SSE explains how to alter CR0 and CR4 to allow SSE instructions to run on bare metal without #UD.
My first thought on your old version of the question was
that you might have compiled your program with -mavx, -march=sandybridge or equivalent, causing the compiler to emit the VEX-encoded version of everything.
CVTSI2SD xmm1, xmm2/m32 ; SSE2
VCVTSI2SD xmm1, xmm2, xmm3/m32 ; AVX
See https://stackoverflow.com/tags/x86/info for links, including to Intel's insn set ref manual.
Most real-world kernels are built with options that stop the compiler from using SSE or x87 instructions on its own, for example gcc -mgeneral-regs-only. Or in older GCC, -mno-sse -mno-mmx and avoid any use of float or double types to avoid x87. This is so kernels only have to save/restore integer registers on interrupts and system calls, only doing the SIMD/FP state on a full context switch to a different user-space task. Before that option existed and was used, Linux kernel code that used double could silently corrupt user-space state!
If you have a freestanding program that isn't trying to context-switch between user-space tasks, go ahead and let the compiler use SSE / AVX.
Related: Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) has some details about how to check for support for AVX and AVX512 (which also introduce new architectural state, so the OS has to set a bit or the HW will fault). It's coming at it from the other angle, but the links should indicate how to activate / disable AVX support.
I suggest that you consult Intel's manual when you have such questions.
It's clearly stated in the manual that CVTSI2SD is an SSE2 instruction.
I have doubht while understanding the meaning of ARM CortexA15.
my understanding is, it will have one processor(CPU) with 15 cores which uses ARMv7 architecture. Please correct me if this understanding is not correct.
multicore and coprocessor means the same or different. Can you help to understand if they are different.
A15 is just an ARM processor(CPU) model number. It comes with 1 - 4 cores and is based on ARM v7a.
Co-processors are compute units that help ARM (or any other processor for that matter) to do its operations more efficiently.
Multi-core means there is more than one CORE in the CPU.
Eg. Verstaile Express is a two CPU cluster based on a multicore 2x A15 and 3x A7 processors.
In this compiler output, I'm trying to understand how machine-code encoding of the nopw instruction works:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 .. codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main() without calling the recurse() function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.
The 0x66 bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.
The 0x2e is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).
0x0f 0x1f is a 2 byte opcode for a NOP that takes a ModRM byte
0x84 is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.
Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
AMD K8 decoders in Agner Fog's microarch pdf:
Each of the instruction decoders can handle three prefixes per clock
cycle. This means that three instructions with three prefixes each can
be decoded in the same clock cycle. An instruction with 4 - 6 prefixes
takes an extra clock cycle to decode.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4 directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.
The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).
The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.
I would guess this is just the branch-delay instruction.
I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.
I realize that this question is impossible to answer absolutely, but I'm only after ballpark figures:
Given a reasonably sized C-program (thousands of lines of code), on average, how many ASM-instructions would be generated. In other words, what's a realistic C-to-ASM instruction ratio? Feel free to make assumptions, such as 'with current x86 architectures'.
I tried to Google about this, but I couldn't find anything.
Addendum: noticing how much confusion this question brought, I feel some need for an explanation: What I wanted to know by this answer, is to know, in practical terms, what "3GHz" means. I am fully aware of that the throughput per Herz varies tremendously depending on the architecture, your hardware, caches, bus speeds, and the position of the moon.
I am not after a precise and scientific answer, but rather an empirical answer that could be put into fathomable scales.
This isn't a trivial answer to place (as I became to notice), and this was my best effort at it. I know that the amount of resulting lines of ASM per lines of C varies depending on what you are doing. i++ is not in the same neighborhood as sqrt(23.1) - I know this. Additionally, no matter what ASM I get out of the C, the ASM is interpreted into various sets of microcode within the processor, which, again, depends on whether you are running AMD, Intel or something else, and their respective generations. I'm aware of this aswell.
The ballpark answers I've got so far are what I have been after: A project large enough averages at about 2 lines of x86 ASM per 1 line of ANSI-C. Today's processors probably would average at about one ASM command per clock cycle, once the pipelines are filled, and given a sample big enough.
There is no answer possible. statements like int a; might require zero asm lines. while statements like a = call_is_inlined(); might require 20+ asm lines.
You can see yourself by compiling a c program, and then starting objdump -Sd ./a.out . It will display asm and C code intermixed, so you can see how many asm lines are generated for one C line. Example:
test.c
int get_int(int c);
int main(void) {
int a = 1, b = 2;
return getCode(a) + b;
}
$ gcc -c -g test.c
$ objdump -Sd ./test.o
00000000 <main>:
int get_int(int c);
int main(void) { /* here, the prologue creates the frame for main */
0: 8d 4c 24 04 lea 0x4(%esp),%ecx
4: 83 e4 f0 and $0xfffffff0,%esp
7: ff 71 fc pushl -0x4(%ecx)
a: 55 push %ebp
b: 89 e5 mov %esp,%ebp
d: 51 push %ecx
e: 83 ec 14 sub $0x14,%esp
int a = 1, b = 2; /* setting up space for locals */
11: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%ebp)
18: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%ebp)
return getCode(a) + b;
1f: 8b 45 f4 mov -0xc(%ebp),%eax
22: 89 04 24 mov %eax,(%esp)
25: e8 fc ff ff ff call 26 <main+0x26>
2a: 03 45 f8 add -0x8(%ebp),%eax
} /* the epilogue runs, returning to the previous frame */
2d: 83 c4 14 add $0x14,%esp
30: 59 pop %ecx
31: 5d pop %ebp
32: 8d 61 fc lea -0x4(%ecx),%esp
35: c3 ret
I'm not sure what you mean by "C-instruction", maybe statement or line? Of course this will vary greatly due to a number of factors but after looking at a few sample programs of my own, many of them are close to the 2-1 mark (2 assembly instructions per LOC), I don't know what this means or how it might be useful.
You can figure this out yourself for any particular program and implementation combination by asking the compiler to generate only the assembly (gcc -S for example) or by using a disassembler on an already compiled executable (but you would need the source code to compare it to anyway).
Edit
Just to expand on this based on your clarification of what you are trying to accomplish (understanding how many lines of code a modern processor can execute in a second):
While a modern processor may run at 3 billion cycles per second that doesn't mean that it can execute 3 billion instructions per second. Here are some things to consider:
Many instructions take multiple cycles to execute (division or floating point operations can take dozens of cycles to execute).
Most programs spend the vast majority of their time waiting for things like memory accesses, disk accesses, etc.
Many other factors including OS overhead (scheduling, system calls, etc.) are also limiting factors.
But in general yes, processors are incredibly fast and can accomplish amazing things in a short period of time.
That varies tremendously! I woudn't believe anyone if they tried to offer a rough conversion.
Statements like i++; can translate to a single INC AX.
Statements for function calls containing many parameters can be dozens of instructions as the stack is setup for the call.
Then add in there the compiler optimization that will assemble your code in a manner different than you wrote it thus eliminating instructions.
Also some instructions run better on machine word boundaries so NOPs will be peppered throughout your code.
I don't think you can conclude anything useful whatsoever about performance of real applications from what you're trying to do here. Unless 'not precise' means 'within several orders of magnitude'.
You're just way overgeneralised, and you're dismissing caching, etc, as though it's secondary, whereas it may well be totally dominant.
If your application is large enough to have trended to some average instructions-per-loc, then it will also be large enough to have I/O or at the very least significant RAM access issues to factor in.
Depending on your environment you could use the visual studio option : /FAs
more here
I am not sure there is really a useful answer to this. For sure you will have to pick the architecture (as you suggested).
What I would do: Take a reasonable sized C program. Give gcc the "-S" option and check yourself. It will generate the assembler source code and you can calculate the ratio for that program yourself.
RISC or CISC? What's an instruction in C, anyway?
Which is to repeat the above points that you really have no idea until you get very specific about the type of code you're working with.
You might try reviewing the academic literature regarding assembly optimization and the hardware/software interference cross-talk that has happened over the last 30-40 years. That's where you're going to find some kind of real data about what you're interested in. (Although I warn you, you might wind up seeing C->PDP data instead of C->IA-32 data).
You wrote in one of the comments that you want to know what 3GHz means.
Even the frequency of the CPU does not matter. Modern PC-CPUs interleave and schedule instructions heavily, they fetch and prefetch, cache memory and instructions and often that cache is invalidated and thrown to the bin. The best interpretation of processing power can be gained by running real world performance benchmarks.