I read the spinlock function code in the linux kernel. There are two functions related to spinlock. See the code below:
static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock)
{
short inc = 0x0100;
asm volatile (
LOCK_PREFIX "xaddw %w0, %1\n"
"1:\t"
"cmpb %h0, %b0\n\t"
"je 2f\n\t"
"rep ; nop\n\t"
"movb %1, %b0\n\t"
/* don't need lfence here, because loads are in-order */
"jmp 1b\n"
"2:"
: "+Q" (inc), "+m" (lock->slock)
:
: "memory", "cc");
}
static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock)
{
int inc = 0x00010000;
int tmp;
asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
"movzwl %w0, %2\n\t"
"shrl $16, %0\n\t"
"1:\t"
"cmpl %0, %2\n\t"
"je 2f\n\t"
"rep ; nop\n\t"
"movzwl %1, %2\n\t"
/* don't need lfence here, because loads are in-order */
"jmp 1b\n"
"2:"
: "+r" (inc), "+m" (lock->slock), "=&r" (tmp)
:
: "memory", "cc");
}
I have two question:
1.What's the difference between the two functions above?
2.What can I do to monitor the spinlock waiting time(the time it takes to first try the lock and finally get the lock)?does the variable inc means the spinlock waiting time?
Let me first explain how the spinlock code works. We have variables
uint16_t inc = 0x0100,
lock->slock; // I'll just call this "slock"
In the assembler code, inc is referred to as %0 and slock as %1. Moreover, %b0 denotes the lower 8 bit, i.e. inc % 0x100, and %h0 is inc / 0x100.
Now:
lock xaddw %w0, %1 ;; "inc := slock" and "slock := inc + slock"
;; simultaneously (atomic exchange and increment)
1:
cmpb %h0, %b0 ;; "if (inc / 256 == inc % 256)"
je 2f ;; " goto 2;"
rep ; nop ;; "yield();"
movb %1, %b0 ;; "inc = slock;"
jmp 1b ;; "goto 1;"
2:
Comparing the upper and lower byte of inc succeeds if inc is zero. Since inc has the value of the original lock, this happens if the lock is unlocked. In that case, the lock will already have been incremented to non-zero by the atomic exchange-and-increment, so it is now locked.
Otherwise, i.e. if the lock had already been locked, we pause a little, then update inc to the current value of the lock, and try again.
(I believe there's actually a possiblity for an overflow, if 28 threads simultaneously attempt to get the spinlock. In that case, slock is updated to 0x0100, 0x0200, ... 0xFF00, 0x0000, and would then appear to be unlocked. Maybe that's why the second version of the code uses a 16-bit wide counter, which would require 216 simultaneous attempts.)
Now let's insert a counter:
uint32_t spincounter = 0;
asm volatile( /* code below */
: "+Q" (inc), "+m" (lock->slock)
: "=r" (spincounter)
: "memory", "cc");
Now spincounter may be referred to as %2. We just need to increment the counter each time:
1:
inc %2
cmpb %h0, %b0
;; etc etc
I haven't tested this, but that's the general idea.
Related
I'm using inline assembly in GCC. I want to rotate a variable content 2 bits to the left (I moved the variable to the rax register and then rotate it 2 times). I wrote the code below but I faced segmentation fault(core dumped) error.
I would be grateful if you could help me.
uint64_t X = 13835058055282163712U;
asm volatile(
"movq %0 , %%rax\n"
"rol %%rax\n"
"rol %%rax\n"
:"=r"(X)
:"r"(X)
);
printf("%" PRIu64 "\n" , X);
The key to understanding inline asm is to understand that each asm statement has two parts:
The text of the actual assembler stuff, in which the compiler will make textual substitutions, but does not understand.
This is the AssemblerTemplate in the documentation (everything up to the first : in the __asm__()).
A description of what the assembler stuff does, in terms that the compiler does understand.
This the : OutputOperands : InputOperands : Clobbers in the documentation.
This must tell the compiler how the assembler fits in with all the code which the compiler is generating around it. The code generation is busy allocating registers to hold values, deciding what order to do things in, moving things out of loops, eliminating unused fragments of code, discarding values it no longer needs, and so on.
The actual assembler is a black box which takes the inputs described here, produces the outputs described and as a side effect may 'clobber' some registers and/or memory. This must be a complete description of what the assembler does... otherwise the compiler-generated asm around your template will clash with it and rely on false assumptions.
Armed with this information the compiler can decide what registers the assembler can use, and you should let it do that.
So, your fragment:
asm volatile(
"movq %0 , %%rax\n"
"rol %%rax\n"
"rol %%rax\n"
:"=r"(X)
:"r"(X)
);
has a few "issues":
you may have chosen %rax for the result on the basis that the asm() is like a function, and may be expected to return a result in %rax -- but that isn't so.
you went ahead and used %rax, which the compiler may (well) already have allocated to something else... so you are, effectively, 'clobbering' %rax but you have failed to tell the compiler about it !
you specified =r(X) (OutputOperand) which tells the compiler to expect an output in some register and that output will be the new value of the variable X. The %0 in the AssemblerTemplate will be replaced by the register selected for the output. Sadly, your assembly treats %0 as the input :-( And the output is, in fact, in %rax -- which, as above, the compiler is unaware of.
you also specified r(X) (InputOperand) which tells the compiler to arrange for the current value of the variable X to be placed in some register for the assembler to use. This would be %1 in the AssemblerTemplate. Sadly, your assembly does not use this input.
Even though the output and input operands both refer to X, the compiler may not make %0 the same register as %1. (This allows it to use the asm block as a non-destructive operation that leaves the original value of an input unmodified. If this isn't how your template works, don't write it that way.
generally you don't need the volatile when all the inputs and outputs are properly described by constraints. One of the fine things the compiler will do is discard an asm() if (all of) the output(s) are not used... volatile tells the compiler not to do that (and tells it a number of other things... see the manual).
Apart from that, things were wonderful. The following is safe, and avoids a mov instruction:
asm("rol %0\n"
"rol %0\n" : "+r"(X));
where "+r"(X) says that there is one combined input and output register required, taking the old value of X and returning a new one.
Now, if you don't want to replace X, then assuming Y is to be the result, you could:
asm("mov %1, %0\n"
"rol %0\n"
"rol %0\n" : "=r"(Y) : "r"(X));
But it's better to leave it up to the compiler to decide whether it needs to mov or whether it can just let an input be destroyed.
There are a couple of rules about InputOperands which are worth mentioning:
The assembler must not overwrite any of the InputOperands -- the compiler is tracking which values it has in which registers, and is expecting InputOperands to be preserved.
The compiler expects all InputOperands to be read before any OutputOperand is written. This is important when the compiler knows that a given InputOperand is not used again after the asm(), and it can therefore allocate the InputOperand's register to an OutputOperand. There is a thing called earlyclobber (=&r(foo)) to deal with this little wrinkle.
In the above, if you don't in fact use X again the compiler could allocate %0 and %1 to the same register! But the (redundant) mov will still be assembled -- remembering that the compiler really doesn't understand the AssemblerTemplate. So, you are in general better off shuffling values around in the C, not the asm(). See https://gcc.gnu.org/wiki/DontUseInlineAsm and Best practices for circular shift (rotate) operations in C++
So here are four variations on a theme, and the code generated (gcc -O2):
// (1) uses both X and Y in the printf() -- does mov %1, %0 in asm()
void Never_Inline footle(void) Dump of assembler code for function footle:
{ mov $0x492782,%edi # address of format string
unsigned long X, Y ; xor %eax,%eax
mov $0x63,%esi # X = 99
X = 99 ; rol %rsi # 1st asm
__asm__("\t rol %0\n" rol %rsi
"\t rol %0\n" : "+r"(X) mov %rsi,%rdx # 2nd asm, compiler using it as a copy-and-rotate
) ; rol %rdx
rol %rdx
__asm__("\t mov %1, %0\n" jmpq 0x4010a0 <printf#plt> # tailcall printf
"\t rol %0\n"
"\t rol %0\n" : "=r"(Y) : "r"(X)
) ;
printf("%lx %lx\n", X, Y) ;
}
// (2) uses both X and Y in the printf() -- does Y = X in 'C'
void Never_Inline footle(void) Dump of assembler code for function footle:
{ mov $0x492782,%edi
unsigned long X, Y ; xor %eax,%eax
mov $0x63,%esi
X = 99 ; rol %rsi # 1st asm
__asm__("\t rol %0\n" rol %rsi
"\t rol %0\n" : "+r"(X) mov %rsi,%rdx # compiler-generated mov
) ; rol %rdx # 2nd asm
rol %rdx
Y = X ; jmpq 0x4010a0 <printf#plt>
__asm__("\t rol %0\n"
"\t rol %0\n" : "+r"(Y)
) ;
printf("%lx %lx\n", X, Y) ;
}
// (3) uses only Y in the printf() -- does mov %1, %0 in asm()
void Never_Inline footle(void) Dump of assembler code for function footle:
{ mov $0x492782,%edi
unsigned long X, Y ; xor %eax,%eax
mov $0x63,%esi
X = 99 ; rol %rsi
__asm__("\t rol %0\n" rol %rsi
"\t rol %0\n" : "+r"(X) mov %rsi,%rsi # redundant instruction because of mov in the asm template
) ; rol %rsi
rol %rsi
__asm__("\t mov %1, %0\n" jmpq 0x4010a0 <printf#plt>
"\t rol %0\n"
"\t rol %0\n" : "=r"(Y) : "r"(X)
) ;
printf("%lx\n", Y) ;
}
// (4) uses only Y in the printf() -- does Y = X in 'C'
void Never_Inline footle(void) Dump of assembler code for function footle:
{ mov $0x492782,%edi
unsigned long X, Y ; xor %eax,%eax
mov $0x63,%esi
X = 99 ; rol %rsi
__asm__("\t rol %0\n" rol %rsi
"\t rol %0\n" : "+r"(X) rol %rsi # no wasted mov, compiler picked %0=%1=%rsi
) ; rol %rsi
jmpq 0x4010a0 <printf#plt>
Y = X ;
__asm__("\t rol %0\n"
"\t rol %0\n" : "+r"(Y)
) ;
printf("%lx\n", Y) ;
}
which, hopefully, demonstrates the compiler busily allocating values to registers, tracking which values it needs to hold on to, minimizing register/register moves, and generally being clever.
So the trick is to work with the compiler, understanding that the :OutputOperands:InputOperands:Clobbers is where you are describing what the assembler is doing.
In one of the exercises in my assembler course we have to create a program that will change all uppercase letters in char* to lowercase.
char *a = "AAA BB CCC";
char *b;
asm(
".intel_syntax noprefix;"
"xor ecx, ecx;"
"xor eax, eax;"
"lea eax, [%[a]];" // getting address of the first char into eax
"loop:"
"inc ecx;" // loop variable
"cmp ecx, 10;" // 10 cycles for very character in *a
"jne lowercase;"
"jmp end;"
"lowercase:"
"mov bl, [eax];" // copy value stored under eax address to bl
"add bl, 32;" // make it lowercase in ASCII
"mov [eax], bl;" // copy it back to address stored in eax
"inc eax;" // move pointer to the next char
"jmp loop;"
"end:"
"lea %[b], [eax];"
".att_syntax prefix;"
: [b] "=r" (b)
: [a] "r" (a)
: "eax", "ebx", "ecx"
);
My idea was to get address of the first char, increase it by 32 (making it lowercase in ASCII), save it on the same address, than move pointer to the next char and so on. Getting first char and adding 32 to it works fine, its this line that throws "Segmentation fault":
"mov [eax], bl;"
From what I understand it should change value under address held in eax to value of bl. I feel like I'm missing something obvious, but I'm pretty new to this and everything I found so far on the net leads me to believe that this is how I should do it.
In the linux kernel code, when a spinlock is locked, the spin_lock function will spinning. The code of spin_lock is below:
static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock)
{
int inc = 0x00010000;
int tmp;
asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
"movzwl %w0, %2\n\t"
"shrl $16, %0\n\t"
"1:\t"
"cmpl %0, %2\n\t"
"je 2f\n\t"
"rep ; nop\n\t"
"movzwl %1, %2\n\t"
/* don't need lfence here, because loads are in-order */
"jmp 1b\n"
"2:"
: "+r" (inc), "+m" (lock->slock), "=&r" (tmp)
:
: "memory", "cc");
}
My question is:
How can I add a time counter to monitor the spinning time of the lock?Please give me some advice.
You can use rdtsc time stamp counter to measure the interval ,you can view the below links http://www.xml.com/ldd/chapter/book/ch06.html
http://wiki.osdev.org/Inline_Assembly/Examples
I would like to ask some help form You! I have a project with a lot of C source. Most of them compiled with gcc, but some compiled with Intel compiler. This later codes have a lot of inline asm codes in Microsoft's MASM format. I would like to compile the whole project with gcc and modify as few code as I can. So I wrote a perl script which converts the intel format inline asm to GAS format. (BTW: I compile to 32 bit on a 64 bit Linux machine).
My problem that I have to specify for gcc that in the inline asm("...") which C variables are passed to the code adding the :: [var1] "m" var1, [var2] "m" var2, ... line at the end.
Is it a way to avoid this explicit specification?
My tryings:
The dummy test C code is simply replaces 4 characters the destination char array with the elements of the source char array (I know this is not the best way to do it. It is just a stupid example).
In the original function there is no explicit specification, but it can be compiled with Intel compiler (shame on me, but I did not test this, but it should work with Intel compiler as I made it based on the real code). LOOP label is used a lot of times, even in the same C source file.
#include <stdio.h>
void cp(char *pSrc, char *pDst) {
__asm
{
mov esi, pSrc
mov edi, pDst
mov edx, 4
LOOP:
mov al, [esi]
mov [edi], al
inc esi
inc edi
dec edx
jnz LOOP
};
}
int main() {
char src[] = "abcd";
char dst[] = "ABCD";
cp(src, dst);
printf("SRC: '%s', DST: '%s'\n", src, dst);
return 0;
}
Result is: SRC: 'abcd', DST: 'abcd'
The working converted cp code is (compiled with gcc).
GAS (AT&T) format (compiled: gcc -ggdb3 -std=gnu99 -m32 -o asm asm.c)
void cp(char *pSrc, char *pDst) {
asm(
"mov %[pSrc], %%esi\n\t"
"mov %[pDst], %%edi\n\t"
"mov $4, %%edx\n\t"
"LOOP%=:\n\t"
"mov (%%esi), %%al\n\t"
"mov %%al, (%%edi)\n\t"
"inc %%esi\n\t"
"inc %%edi\n\t"
"dec %%edx\n\t"
"jnz LOOP%=\n\t"
: [pDst] "=m" (pDst)
: [pSrc] "m" (pSrc)
: "esi", "edi", "edx", "al"
);
}
Intel format (compiled: gcc -ggdb3 -std=gnu99 -m32 -masm=intel -o asm asm.c)
void cp(char *pSrc, char *pDst) {
asm(".intel_syntax noprefix\n\t");
asm(
"mov esi, %[pSrc]\n\t"
"mov edi, %[pDst]\n\t"
"mov edx, 4\n\t"
"LOOP%=:\n\t"
"mov al, [esi]\n\t"
"mov [edi], al\n\t"
"inc esi\n\t"
"inc edi\n\t"
"dec edx\n\t"
"jnz LOOP%=\n\t"
: [pDst] "=m" (pDst)
: [pSrc] "m" (pSrc)
: "esi", "edi", "edx", "al"
);
asm(".intel_syntax prefix");
}
Both codes are working, but it needs some code modification (inserting the '%' characters, collecting the variables, modifying the jump labels and jump functions).
I also tried this version:
void cp(char *pSrc, char *pDst) {
asm(".intel_syntax noprefix\n\t");
asm(
"mov esi, pSrc\n\t"
"mov edi, pDst\n\t"
"mov edx, 4\n\t"
"LOOP:\n\t"
"mov al, [esi]\n\t"
"mov [edi], al\n\t"
"inc esi\n\t"
"inc edi\n\t"
"dec edx\n\t"
"jnz LOOP\n\t"
);
asm(".intel_syntax prefix");
}
But it drops
gcc -ggdb3 -std=gnu99 -masm=intel -m32 -o ./asm.exe ./asm.c
/tmp/cc2F9i0u.o: In function `cp':
/home/TAG_VR_20130311/vr/vr/slicecodec/src/./asm.c:41: undefined reference to `pSrc'
/home/TAG_VR_20130311/vr/vr/slicecodec/src/./asm.c:41: undefined reference to `pDst'
collect2: ld returned 1 exit status
Is there a way to avoid the definition of the input arguments and avoid the modifications of local labels?
ADDITION
I tried to use global variable. For this g constrain has to be used instead of m.
char pGlob[] = "qwer";
void cp(char *pDst) {
asm(".intel_syntax noprefix\n\t"
"mov esi, %[pGlob]\n\t"
"mov edi, %[pDst]\n\t"
"mov edx, 4\n\t"
"LOOP%=:\n\t"
"mov al, [esi]\n\t"
"mov [edi], al\n\t"
"inc esi\n\t"
"inc edi\n\t"
"dec edx\n\t"
"jnz LOOP%=\n\t"
".intel_syntax prefix" : [pDst] "=m" (pDst) : [pGlob] "g" (pGlob)
: "esi", "edi", "edx", "al");
}
ADDITION#2
I tried
"lea esi, pGlob\n\t" // OK
"lea esi, %[_pGlob]\n\t" // BAD
//"lea esi, pGlob_not_defined\n\t" // BAD
//gcc failed with: undefined reference to `pGlob_not_defined'
compiled to
lea esi, pGlob
lea esi, OFFSET FLAT:pGlob // BAD
//compilation fails with: Error: suffix or operands invalid for `lea'
It seems that only function local variables has to be defined. Global variable may be added to the trailer, but not really necessary. Both are working:
"mov esi, pGlob\n\t" // OK
"mov esi, %[_pGlob]\n\t" // OK
compiled to
mov esi, OFFSET FLAT:pGlob
mov esi, OFFSET FLAT:pGlob
I defined a function local variable. It has to be defined in the constraint part:
void cp(char *pDst) {
char pLoc[] = "yxcv";
asm(".intel_syntax noprefix\n\t"
...
//"mov esi, pLoc\n\t" // BAD
"mov esi, %[_pLoc]\n\t" // OK, 'm' BAD
...
".intel_syntax prefix" : [_pDst] "=m" (pDst) : [_pLoc] "g" (pLoc)
: "esi", "edi", "edx", "al");
Unfortunately it should be determined what are the global and what are the local variables. It is not easy as the asm code can be defined in C macros and even the surrounding function is not definite. I think only the precompiler has enough information to this. Maybe the code has to be precompiled with gcc -E ....
I realized that not defining the output in the constraint part, some code can be eliminated by the optimizer.
TIA!
Yes, you do need to specify the registers explicitly. GCC will not do that for you. And you can't (generally) put C variable names inside the ASM string.
Your final code block looks perfectly good, to me, but in GCC you don't need to choose which registers to use yourself. You should also use the volatile keyword to prevent the compiler thinking the code doesn't do anything given that it has no outputs.
Try this:
char pGlob[] = "qwer";
void cp(char *pDst) {
asm volatile (".intel_syntax noprefix\n\t"
"mov edx, 4\n\t"
"LOOP%=:\n\t"
"mov al, [%[pGlob]]\n\t"
"mov [%[pDst]], al\n\t"
"inc %[pGlob]\n\t"
"inc %[pDst]\n\t"
"dec edx\n\t"
"jnz LOOP%=\n\t"
".intel_syntax prefix" :: [pGlob] "g" (pGlob), [pDst] "g" (pDst) : "edx");
}
This way the compiler deals with loading the variables and chooses the registers for you (thereby eliminating a pointless copy from one register to another). Ideally you'd also eliminate the explicit use of edx, but it's not really necessary here.
Of course, in this silly example I would just recode the whole thing in C and let the compiler do its job.
I am trying to reverse engineer a c code, but this part of assembly I cant really understand. I know it is part of the SSE extension. However, somethings are really different than what I am used to in x86 instructions.
static int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
int ret;
__asm__ volatile(
"pxor %%xmm6, %%xmm6 \n\t"
ASMALIGN(4)
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((x86_reg)stride)
);
__asm__ volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
);
return ret;
}
What are the %1, %2, and %3? what does (%1,%2,%3) mean? Also what does "+r", "-r", "=r" mean?
You'll want to have a look at this GCC Inline Asssembly HOWTO.
The percent sign numbers are the instruction operands.
The inline assembler works similar to a macro preprocessor. Operands with exactly one leading percent are replaced by the the input parameters in the order as they appear in the parameter list, in this case:
%0 h output, register, r/w
%1 blk1 output, register, r/w
%2 blk2 output, register, r/w
%3 (x86_reg)stride input, register, read only
The parameters are normal C expressions. They can be further specified by "constraints", in this case "r" means the value should be in a register, opposed to "m" which is a memory operand. The constraint modifier "=r" makes this a write-only operand, "+r" is a read-write operand and "r" and normal read operand.
After the first colon the output operands appear, after the second the input operands and after the optional third the clobbered registers.
So the instruction sequence calculates the sum of the absolute differences in each byte of blk1 and blk2. This happens in 16 byte blocks, so if stride is 16, the blocks are contiguous, otherwise there are holes. Each instruction appears twice because some minimal loop unrolling is done, the h parameter is the number of 32 byte blocks to process. The second asm block seems to be useless, as the psadbw instruction sums up only in the low 16 bit of the destination register. (Did you omit some code?)