Arm Cortex-A8 program flow prediction

Arm Cortex-A8 program flow prediction - arm

I am examining ARM-Cortex A8 flow prediction abilities, in order to done this i implemented below code:
char SecretDispatcher[256 * 512];
int counter = 0;
//evicting SecretDispatcher from cache
...
while(counter < (512 * 9 + 1))
{
//evict counter from cache
...
if(counter < (512 * 9))
{
asm volatile ("LDR %0, [%1]\n\t"
: "=r" (value)
: "r" (SecretDispatcher + index)
);
}
}
//measuring access time to SecretDispatcher[9*512], i expect that this memory cell exist in cache but it dosen't
in above code, i will execute if statement with true condition 8-time for training branch predictor of CPU, and in 9th i expect that cpu access to SecretDispatcher[9 * 512] speculatively however the condition is not true. this is just simple spectre-v1 PoC attack and i implemented this attack successfully on intel X86 processor with same logic and i expect that this work also in Cortex-A8 as arm clarified that this processor is vulnerable against this attack.
there is anything that i missing?? there is anything that i should do to enable program flow prediction on ARM-Cortex A8??

Related

Inconsistent values of ARM PMU cycles counter

I'm trying to measure performance of my code in linux kernel with pmu.
First of all I want to test pmu therefore created simple loop of couple operations in kernel. I placed it under spin lock with disabled interrupts so my test code can't be preempted. Then I printed cycle counter to check how much CPU cycles this loop takes. But I see very different values at each print: 100, 500, 1000, 200, ...
My question is: why I see so different values every time?
PS: in countrary to cycle counter, pmu's instruction counter is stable and I see same values every time.
I also tried to use arm timer but it also showing different values similar to pmu's cycle counter.
Here is how I use ARM timer to measure performance:
unsigned long long ticks_start, ticks_end;
int i = 0, j;
unsigned long flags;
spin_lock_irqsave(&lock, flags);
while (i++ < 100) {
j = 0;
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_start));
while (j++ < 10000) {
asm volatile ("nop");
}
asm volatile("mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
printk("ticks %d are: %llu\n", i, ticks_end - ticks_start);
}
spin_unlock_irqrestore(&lock, flags);
and output on real device are (cortex A-57):
...
ticks 31 are: 2287
ticks 32 are: 2287
ticks 33 are: 2287
ticks 34 are: 1984
ticks 35 are: 457
ticks 36 are: 1604
ticks 37 are: 2287
...

For using things like timers and PMU on Arm, you should be inserting an isb instruction before the read of the PMU register. The processor is allowed by the architecture to speculatively read the register early, or late since it is not dependent on your inner loop of nops.
So try this:
asm volatile("isb; mrs %0, CNTPCT_EL0" : "=r" (ticks_end));
The isb will flush the pipeline before letting the mrs instruction proceed. It is possible the CPU is also thermally throttling, but that should not affect your measurements using the cycle-counter, but it would if you were reading the generic timer to measure time.

ARM: Start/Wakeup/Bringup the other CPU cores/APs and pass execution start address?

I've been banging my head with this for the last 3-4 days and I can't find a DECENT explanatory documentation (from ARM or unofficial) to help me.
I've got an ODROID-XU board (big.LITTLE 2 x Cortex-A15 + 2 x Cortex-A7) board and I'm trying to understand a bit more about the ARM architecture. In my "experimenting" code I've now arrived at the stage where I want to WAKE UP THE OTHER CORES FROM THEIR WFI (wait-for-interrupt) state.
The missing information I'm still trying to find is:
1. When getting the base address of the memory-mapped GIC I understand that I need to read CBAR; But no piece of documentation explains how the bits in CBAR (the 2 PERIPHBASE values) should be arranged to get to the final GIC base address
2. When sending an SGI through the GICD_SGIR register, what interrupt ID between 0 and 15 should I choose? Does it matter?
3. When sending an SGI through the GICD_SGIR register, how can I tell the other cores WHERE TO START EXECUTION FROM?
4. How does the fact that my code is loaded by the U-BOOT bootloader affect this context?
The Cortex-A Series Programmer's Guide v3.0 (found here: link) states the following in section 22.5.2 (SMP boot in Linux, page 271):
While the primary core is booting, the secondary cores will be held in a standby state, using the
WFI instruction. It (the primary core) will provide a startup address to the secondary cores and wake them using an
Inter-Processor Interrupt(IPI), meaning an SGI signalled through the GIC
How does Linux do that? The documentation-S don't give any other details regarding "It will provide a startup address to the secondary cores".
My frustration is growing and I'd be very grateful for answers.
Thank you very much in advance!
EXTRA DETAILS
Documentation I use:
ARMv7-A&R Architecture Reference Manual
Cortex-A15 TRM (Technical Reference Manual)
Cortex-A15 MPCore TRM
Cortex-A Series Programmer's Guide v3.0
GICv2 Architecture Specification
What I've done by now:
UBOOT loads me at 0x40008000; I've set-up Translation Tables (TTBs), written TTBR0 and TTBCR accordingly and mapped 0x40008000 to 0x8000_0000 (2GB), so I also enabled the MMU
Set-up exception handlers of my own
I've got Printf functionality over the serial (UART2 on ODROID-XU)
All the above seems to work properly.
What I'm trying to do now:
Get the GIC base address => at the moment I read CBAR and I simply AND (&) its value with 0xFFFF8000 and use this as the GIC base address, although I'm almost sure this ain't right
Enable the GIC distributor (at offset 0x1000 from GIC base address?), by writting GICD_CTLR with the value 0x1
Construct an SGI with the following params: Group = 0, ID = 0, TargetListFilter = "All CPUs Except Me" and send it (write it) through the GICD_SGIR GIC register
Since I haven't passed any execution start address for the other cores, nothing happens after all this
....UPDATE....
I've started looking at the Linux kernel and QEMU source codes in search for an answer. Here's what I found out (please correct me if I'm wrong):
When powering up the board ALL THE CORES start executing from the reset vector
A software (firmware) component executes WFI on the secondary cores and some other code that will act as a protocol between these secondary cores and the primary core, when the latter wants to wake them up again
For example, the protocol used on the EnergyCore ECX-1000 (Highbank) board is as follows:
**(1)** the secondary cores enter WFI and when
**(2)** the primary core sends an SGI to wake them up
**(3)** they check if the value at address (0x40 + 0x10 * coreid) is non-null;
**(4)** if it is non-null, they use it as an address to jump to (execute a BX)
**(5)** otherwise, they re-enter standby state, by re-executing WFI
**(6)** So, if I had an EnergyCore ECX-1000 board, I should write (0x40 + 0x10 * coreid) with the address I want each of the cores to jump to and send an SGI
Questions:
1. What is the software component that does this? Is it the BL1 binary I've written on the SD Card, or is it U-BOOT?
2. From what I understand, this software protocol differs from board to board. Is it so, or does it only depend on the underlying processor?
3. Where can I find information about this protocol for a pick-one ARM board? - can I find it on the official ARM website or on the board webpage?

Ok, I'm back baby. Here are the conclusions:
The software component that puts the CPUs to sleep is the bootloader (in my case U-Boot)
Linux somehow knows how the bootloader does this (hardcoded in the Linux kernel for each board) and knows how to wake them up again
For my ODROID-XU board the sources describing this process are UBOOT ODROID-v2012.07 and the linux kernel found here: LINUX ODROIDXU-3.4.y (it would have been better if I looked into kernel version from the branch odroid-3.12.y since the former doesn't start all of the 8 processors, just 4 of them but the latter does).
Anyway, here's the source code I've come up with, I'll post the relevant source files from the above source code trees that helped me writing this code afterwards:
typedef unsigned int DWORD;
typedef unsigned char BOOLEAN;
#define FAILURE (0)
#define SUCCESS (1)
#define NR_EXTRA_CPUS (3) // actually 7, but this kernel version can't wake them up all -> check kernel version 3.12 if you need this
// Hardcoded in the kernel and in U-Boot; here I've put the physical addresses for ease
// In my code (and in the linux kernel) these addresses are actually virtual
// (thus the 'VA' part in S5P_VA_...); note: mapped with memory type DEVICE
#define S5P_VA_CHIPID (0x10000000)
#define S5P_VA_SYSRAM_NS (0x02073000)
#define S5P_VA_PMU (0x10040000)
#define EXYNOS_SWRESET ((DWORD) S5P_VA_PMU + 0x0400)
// Other hardcoded values
#define EXYNOS5410_REV_1_0 (0x10)
#define EXYNOS_CORE_LOCAL_PWR_EN (0x3)
BOOLEAN BootAllSecondaryCPUs(void* CPUExecutionAddress){
// 1. Get bootBase (the address where we need to write the address where the woken CPUs will jump to)
// and powerBase (we also need to power up the cpus before waking them up (?))
DWORD bootBase, powerBase, powerOffset, clusterID;
asm volatile ("mrc p15, 0, %0, c0, c0, 5" : "=r" (clusterID));
clusterID = (clusterID >> 8);
powerOffset = 0;
if( (*(DWORD*)S5P_VA_CHIPID & 0xFF) < EXYNOS5410_REV_1_0 )
{
if( (clusterID & 0x1) == 0 ) powerOffset = 4;
}
else if( (clusterID & 0x1) != 0 ) powerOffset = 4;
bootBase = S5P_VA_SYSRAM_NS + 0x1C;
powerBase = (S5P_VA_PMU + 0x2000) + (powerOffset * 0x80);
// 2. Power up each CPU, write bootBase and send a SEV (they are in WFE [wait-for-event] standby state)
for (i = 1; i <= NR_EXTRA_CPUS; i++)
{
// 2.1 Power up this CPU
powerBase += 0x80;
DWORD powerStatus = *(DWORD*)( (DWORD) powerBase + 0x4);
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) == 0)
{
*(DWORD*) powerBase = EXYNOS_CORE_LOCAL_PWR_EN;
for (i = 0; i < 10; i++) // 10 millis timeout
{
powerStatus = *(DWORD*)((DWORD) powerBase + 0x4);
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) == EXYNOS_CORE_LOCAL_PWR_EN)
break;
DelayMilliseconds(1); // not implemented here, if you need this, post a comment request
}
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) != EXYNOS_CORE_LOCAL_PWR_EN)
return FAILURE;
}
if ( (clusterID & 0x0F) != 0 )
{
if ( *(DWORD*)(S5P_VA_PMU + 0x0908) == 0 )
do { DelayMicroseconds(10); } // not implemented here, if you need this, post a comment request
while (*(DWORD*)(S5P_VA_PMU + 0x0908) == 0);
*(DWORD*) EXYNOS_SWRESET = (DWORD)(((1 << 20) | (1 << 8)) << i);
}
// 2.2 Write bootBase and execute a SEV to finally wake up the CPUs
asm volatile ("dmb" : : : "memory");
*(DWORD*) bootBase = (DWORD) CPUExecutionAddress;
asm volatile ("isb");
asm volatile ("\n dsb\n sev\n nop\n");
}
return SUCCESS;
}
This successfully wakes 3 of 7 of the secondary CPUs.
And now for that short list of relevant source files in u-boot and the linux kernel:
UBOOT: lowlevel_init.S - notice lines 363-369, how the secondary CPUs wait in a WFE for the value at _hotplug_addr to be non-zeroed and to jump to it; _hotplug_addr is actually bootBase in the above code; also lines 282-285 tell us that _hotplug_addr is to be relocated at CONFIG_PHY_IRAM_NS_BASE + _hotplug_addr - nscode_base (_hotplug_addr - nscode_base is 0x1C and CONFIG_PHY_IRAM_NS_BASE is 0x02073000, thus the above hardcodings in the linux kernel)
LINUX KERNEL: generic - smp.c (look at function __cpu_up), platform specific (odroid-xu): platsmp.c (function boot_secondary, called by generic __cpu_up; also look at platform_smp_prepare_cpus [at the bottom] => that's the function that actually sets the boot base and power base values)

For clarity and future reference, there's a subtle piece of information missing here thanks to the lack of proper documentation of the Exynos boot protocol (n.b. this question should really be marked "Exynos 5" rather than "Cortex-A15" - it's a SoC-specific thing and what ARM says is only a general recommendation). From cold boot, the secondary cores aren't in WFI, they're still powered off.
The simpler minimal solution (based on what Linux's hotplug does), which I worked out in the process of writing a boot shim to get a hypervisor running on the XU, takes two steps:
First write the entry point address to the Exynos holding pen (0x02073000 + 0x1c)
Then poke the power controller to switch on the relevant core(s): That way, they drop out of the secure boot path into the holding pen to find the entry point waiting for them, skipping the WFI loop and obviating the need to even touch the GIC at all.
Unless you're planning a full-on CPU hotplug implementation you can skip checking the cluster ID - if we're booting, we're on cluster 0 and nowhere else (the check for pre-production chips with backwards cluster registers should be unnecessary on the Odroid too - certainly was for me).
From my investigation, firing up the A7s is a little more involved. Judging from the Exynos big.LITTLE switcher driver, it seems you need to poke a separate set of power controller registers to enable cluster 1 first (and you may need to mess around with the CCI too, especially to have the MMUs and caches on) - I didn't get further since by that point it was more "having fun" than "doing real work"...
As an aside, Samsung's mainline patch for CPU hotplug on the 5410 makes the core power control stuff rather clearer than the mess in their downstream code, IMO.

QEMU uses PSCI
The ARM Power State Coordination Interface (PSCI) is documented at: https://developer.arm.com/docs/den0022/latest/arm-power-state-coordination-interface-platform-design-document and controls things such as powering on and off of cores.
TL;DR this is the aarch64 snippet to wake up CPU 1 on QEMU v3.0.0 ARMv8 aarch64:
/* PSCI function identifier: CPU_ON. */
ldr w0, =0xc4000003
/* Argument 1: target_cpu */
mov x1, 1
/* Argument 2: entry_point_address */
ldr x2, =cpu1_entry_address
/* Argument 3: context_id */
mov x3, 0
/* Unused hvc args: the Linux kernel zeroes them,
* but I don't think it is required.
*/
hvc 0
and for ARMv7:
ldr r0, =0x84000003
mov r1, #1
ldr r2, =cpu1_entry_address
mov r3, #0
hvc 0
A full runnable example with a spinlock is available on the ARM section of this answer: What does multicore assembly language look like?
The hvc instruction then gets handled by an EL2 handler, see also: the ARM section of: What are Ring 0 and Ring 3 in the context of operating systems?
Linux kernel
In Linux v4.19, that address is informed to the Linux kernel through the device tree, QEMU for example auto-generates an entry of form:
psci {
method = "hvc";
compatible = "arm,psci-0.2", "arm,psci";
cpu_on = <0xc4000003>;
migrate = <0xc4000005>;
cpu_suspend = <0xc4000001>;
cpu_off = <0x84000002>;
};
The hvc instruction is called from: https://github.com/torvalds/linux/blob/v4.19/drivers/firmware/psci.c#L178
static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point)
which ends up going to: https://github.com/torvalds/linux/blob/v4.19/arch/arm64/kernel/smccc-call.S#L51

Go to www.arm.com and download there evaluation copy of DS-5 developement suite. Once installed, under the examples there will be a startup_Cortex-A15MPCore directory. Look at startup.s.

Why is TSC larger in VirtualBox when I enable "TiedToExecution"?

Background (my understanding of how rdtsc is virtualized): I am experimenting with TSC values in VirtualBox. My current understanding of how VirtualBox emulates rdtsc is that in virtual mode, any call to rdtsc will be offset by a predetermined result, which is a value set in another register. This value would be rdtsc on the host when the virtual machine started.
An advantage to this strategy is that rdtsc will advance with wall clock time in an expected manner, but the disadvantage is that a process may perceive rdtsc to take longer than expected. For instance, in simple code like this:
x = rdtsc();
y = rdtsc();
z = y - x;
print z
executed on the guest, z may be larger than expected because of the wall-clock-time cost associated with trapping rdtsc. It would be even worse if the host OS swapped off the VirtualBox process in between these two calls.
From reading the VirtualBox manual (Change TSC Mode), I read there is an alternative virtualization technique which is supposed to directly simulate TSC. As I understand it, the offset value will only take into account time that the guest OS actually uses the CPU. The advantage is that with respect to cycles available, TSC will behave exactly as if it was on a host machine. The downside is that TSC will drift away from wall-clock-time as there are "missing cycles" that the guest OS is not aware of.
My goal: I am trying to set VirtualBox to do the 2nd option. I want to emulate the short-term behavior of rdtsc as if it were running in hardware as precisely as possible, and I don't care if it doesn't match wall-clock-time. I am fully aware that this is not "reliable" on SMP; it's for experimenting not for enterprise software.
What I did: First I wrote a simple test program that calls rdtsc repeatedly, then prints the results:
__inline__ uint64_t rdtsc()
{
uint32_t lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main()
{
int i;
uint64_t val[8];
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
for (i = 0; i < 8; i++) {
printf("rdtsc (%2d): %llX", i, val[i]);
if (i > 0) {
printf("\t\t (+%llX)", (val[i] - val[i - 1]));
}
printf("\n");
}
return 0;
}
I tried this program on my host machine. Then, I ran it in my VirtualBox machine. The deltas between rdtsc were essentially identical -- the only difference was the value itself on my host was about 30T more. Example output:
rdtsc ( 0): 334F2252A1824
rdtsc ( 1): 334F2252A1836 (+12)
rdtsc ( 2): 334F2252A1853 (+1D)
rdtsc ( 3): 334F2252A1865 (+12)
rdtsc ( 4): 334F2252A1877 (+12)
rdtsc ( 5): 334F2252A1889 (+12)
rdtsc ( 6): 334F2252A18A6 (+1D)
rdtsc ( 7): 334F2252A18B8 (+12)
Then, I changed the TSCTiedToExecution flag in VirtualBox, which I thought was supposed to ignore wall-clock-time in favor of more precise virtual cycle counting. I got this from the manual page I mentioned above:
./VBoxManage setextradata "HelloWorld" "VBoxInternal/TM/TSCTiedToExecution" 1
However this gave me unexpected results. The virtual program now returned:
rdtsc ( 0): F2252A1824
rdtsc ( 1): F2252A1836 (+B12)
rdtsc ( 2): F2252A1853 (+B1D)
rdtsc ( 3): F2252A1865 (+AFF)
rdtsc ( 4): F2252A1877 (+B13)
rdtsc ( 5): F2252A1889 (+AF2)
rdtsc ( 6): F2252A18A6 (+B1D)
rdtsc ( 7): F2252A18B8 (+B0C)
With TSCTiedToExecution on, rdtsc seems to be taking about 1100 cycles to execute....
Question: First, I am wondering why did I get this behavior? It seems like almost the opposite of what I would expect, and it certainly does not match with my understanding of how this is implemented.
Second, I am wondering how can I accomplish my original goal of having TSC advance for each virtual cycle as if it was on hardware?
My Setup: I am running on a 8x Intel(R) Xeon(R) CPU X5550 # 2.67GHz. VirtualBox has VMX and nested paging enabled. I compiled it from source, version: 4.1.2_OSE r38459.
Thanks in advance.
P.S. I started a bounty on this, but still no answers...

To make self crying try to disable "VBoxInternal/TM/TSCTiedToExecution" and run your test program again. The next code
ULONGLONG x1 = Cpu::Rdtsc();
ULONGLONG x2 = Cpu::Rdtsc();
DbgPrintUlong('D', x2 - x1, 30, 23);
running on VirtualBox with "VBoxInternal/TM/TSCTiedToExecution" disabled display that x2 - x1 took about 200 000 of cycles. In contrast, on machine with "VBoxInternal/TM/TSCTiedToExecution" enabled it took only 3 000 jf cycles. I think, this reduction is meant by next passage from the VirtualBox manual "In special circumstances it may be useful however to make the TSC (time stamp counter) in the guest reflect the time actually spent executing the guest."
So, I think we won't have better TSC emulation in VirtualBox for a long time.
The only thing that I can advise is to move on VmWare Workstation. It have much better emulation of TSC.

Can this be atomically executed?

I would like to know whether it is possible to ensure line is atomically executed, given that it could be executed by both the ISR and Main context. I'm working on an ARM9 (LPC313x) and using RealView 4 (armcc).
foo() {
..
stack_var = ++volatile_var; // line
..
}
I'm looking for any routine like _atomic_ for C166, direct assembly code, etc. I would prefer not to have to disable the interrupts.
Thank you very much.

No, I don't think that you ever can expect ++volatile_var to be atomic, even if you don't have the assignment. Use a proper atomic primitive for that. If your compiler doesn't provide such an extension you easily find short inline assembler for that on the web. The assembler instructions are call ldrex and strex for atomic exchange on arm, I think.
Edit: it seems that the specific processor type that is asked for in the question does not implement these instructions.
Edit: The following should work with gcc, for another compiler one probably has to adapt the __asm__ parts.
inline
size_t arm_ldrex(size_t volatile*ptr) {
size_t ret;
__asm__ volatile ("ldrex %0,[%1]\t# load exclusive\n"
: "=&r" (ret)
: "r" (ptr)
: "cc", "memory"
);
return ret;
}
inline
_Bool arm_strex(size_t volatile*ptr, size_t val) {
size_t error;
__asm__ volatile ("strex %0,%1,[%2]\t# store exclusive\n"
: "=&r" (error)
: "r" (val), "r" (ptr)
: "cc", "memory"
);
return !error;
}
inline
size_t atomic_add_fetch(size_t volatile *object, size_t operand) {
for (;;) {
size_t oldval = arm_ldrex(object);
size_t newval = oldval + operand;
if (arm_strex(object, newval)) return newval;
}
}

From a quick look, the C166 _atomic_ macro seems to utilize an instruction that effectively masks interrupts for the duration of a specified number of instructions.
There is nothing directly corresponding to that in the ARM architecture.
You could of course use the swp instruction (or __swp intrinsic in the RealView toolchain) to implement a lock around the critical section. ldrex/strex mentioned in another answer do not exist in ARM architecture version 5, which includes the ARM9 processors.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/CJAHDCHB.html and http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Chdbbbai.html respectively.
A simplistic lock implementation around this (using the RealView toolchain) would be:
{
/* Loop until lock acquired */
while (__swp(LOCKED, &lockvar) == LOCKED);
..
/* Critical section */
..
lockvar = UNLOCKED;
}
However, this will lead to deadlock in the ISR context when the Main thread is holding the lock.
I think masking interrupts around the operation is likely to be the least hairy solution, although if your Main context is executing in User mode it will require a system call to implement.

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.

As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.

You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.

hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight