How to load multiple char values in armv7 assembly program? - arm

I am loading multiple char values in armv7 program using vldm instruction,
but all four values is loading one s register, but I need to expand this values in floating point register (q0).
Please help me. This is my C code:
void sum(){
int sum =0;
char *p =NULL;
p=( char *) malloc(sizeof( char ) *10);
for( int i=0; i<16;++i){
p[i]=i; sum +=i;
}
printf("sum =%d\n",sum);
}

Here is a typical text book example for loading/storing multiple values from/to the vector banks to general purpose registers that may hold destination and source addresses.
VLDM r1!, {d0-d7}
VSTM r0!, {d0-d7}
If you are using gdb you may get a better visual of a particular set of banks or groups of registers.
(gdb) p $q0
{u8 = {0 <repeats 16 times>}, u16 = {0, 0, 0, 0, 0, 0, 0, 0}, u32 = {0, 0, 0, 0}, u64 = {0, 0}, f32 = {0, 0, 0, 0}, f64 = {0, 0}}

Related

Best way to mask a single bit in AVX2?

For example, with an input ymm vector x and bit index i I want an output vector with only the ith bit kept and everything else zeroed.
With AVX512 k registers, I could write the following, but AVX2 and below doesn't have k registers, so what do you think is the best way to do it?
__m512i m512i_maskBit(__m512i x, unsigned i) {
__mmask8 m = _cvtu32_mask8(1u << i / 64);
__m512i vm = _mm512_maskz_set1_epi64(m, 1ull << i % 64);
return _mm512_and_si512(x, vm);
}
Here is an approach using variable shifts (just creating the mask):
__m256i create_mask(unsigned i) {
__m256i ii = _mm256_set1_epi32(i);
ii = _mm256_sub_epi32(ii,_mm256_setr_epi32(0,32,64,96,128,160,192,224));
__m256i mask = _mm256_sllv_epi32(_mm256_set1_epi32(1), ii);
return mask;
}
_mm256_sllv_epi32 (vpsllvd) was introduced by AVX2 and it shifts each 32 bit element by a variable amount of bits. If the (unsigned) shift-amount is bigger than 31 (i.e., also for signed negative numbers), the corresponding result is 0.
Godbolt link with small test code: https://godbolt.org/z/a5xfqTcGs
How about the simplest approach:
__m256i m256i_create_mask(unsigned i) {
// Get the required bit in every byte of the vector
__m256i vm = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(1u << (i & 7u)));
// Mask off the bytes that are outside the index
__m256i vi = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(i >> 3u));
__m256i vm1 = _mm256_cmpeq_epi8(vi,
_mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31));
return _mm256_and_si256(vm, vm1);
}
Here’s another approach. Not sure it’s necessarily better, it depends on CPU model and surrounding code, but it might be.
// A buffer to load vectors with a single bit set in one lane
alignas( 64 ) static const std::array<int, 16> s_oneBuffer =
{
0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0
};
__m256i maskSingleBit( __m256i x, uint32_t bitIndex )
{
// Load `1` into a single 32-bit lane of the vector
// The buffer aligned by 64 bytes, contained in a single cache line, no unaligned load penalty.
__m256i one = _mm256_loadu_si256( ( const __m256i* )( ( s_oneBuffer.data() + 8 ) - ( bitIndex / 32 ) ) );
// Left shift to move the `1` into the correct location
__m128i shift = _mm_cvtsi32_si128( bitIndex % 32 );
__m256i bit = _mm256_sll_epi32( one, shift );
// Bitwise AND with the value
return _mm256_and_si256( x, bit );
}

exact control flow with int assembler instruction in C, and the resulting segfault

Consider this totally stupid code:
int main() { __asm__("int $0x2"); }
This causes a segfault when run. 2 is code for NMI in intel's IDT (Section 6.3.1 here).
I am curious on why this segfaults though? What exactly is the control flow which will eventually cause it to segfault?
Also pasting section 6.3.3 of the manual here:
6.3.3 Software-Generated Interrupts
The INT n instruction permits interrupts to be generated from within software by supplying an interrupt vector
number as an operand. For example, the INT 35 instruction forces an implicit call to the interrupt handler for interrupt 35.
Any of the interrupt vectors from 0 to 255 can be used as a parameter in this instruction. If the processor’s
predefined NMI vector is used, however, the response of the processor will not be the same as it would be from an
NMI interrupt generated in the normal manner. If vector number 2 (the NMI vector) is used in this instruction, the
NMI interrupt handler is called, but the processor’s NMI-handling hardware is not activated.
Interrupts generated in software with the INT n instruction cannot be masked by the IF flag in the EFLAGS register.
The gate in the idt contains a descriptor privilege level (DPL) which is the largest caller privilege level (CPL) which is permitted to invoke this entry. A real NMI, which is caused by an electrical signal on the cpu, provides an artificial CPL of 0. In this way, the kernel does not have to differentiate between real signals and fake ones.
System services which are invoked via an int xx will have a numerically larger DPL to permit the instruction to open the gate with an instruction. Depending upon your kernel, it is possible that int 3 (breakpoint), 4 (overflow), and 5 (bounds) will work as direct opcodes to facilitate debugging, the "into" and "bounds" opcodes respectively.
You have found a kernel bug. Your program is trying to perform a CPU operation (int 2) that is forbidden to user-space programs, not an invalid memory access. Therefore, it should have been sent a SIGILL (Illegal instruction) signal, not a SIGSEGV signal.
The reason for the bug is probably that this particular forbidden operation is reported to the operating system with a "#GP fault" instead of a "#UD fault" (in the terms used by the x86 architecture manual). #GP faults are also used to report invalid memory accesses, and whoever wrote the code to map that to a signal didn't bother making a distinction between "actual invalid memory access" and "improper use of int reported with #GP". I observe this bug as well, on both Linux and NetBSD, so it must be an easy mistake to make.
When you're debugging a problem involving signals, it is often helpful to establish a signal handler for the troublesome signal, using sigaction with SA_SIGINFO in the flags. When you set SA_SIGINFO, the handler receives two additional arguments that provide detailed information about the signal. You don't have to use those arguments in the signal handler; instead what you do is run the program under a debugger, allow the signal to be delivered, and then inspect the details in the debugger. Here's a modification to your program that does that:
#include <signal.h>
#include <unistd.h>
#include <ucontext.h>
void handler(int s, siginfo_t *si, void *uc)
{
pause();
}
int main(void)
{
struct sigaction sa;
sa.sa_sigaction = handler;
sa.sa_flags = SA_SIGINFO | SA_RESTART;
sigemptyset(&sa.sa_mask);
sigaction(SIGBUS, &sa, 0);
sigaction(SIGFPE, &sa, 0);
sigaction(SIGILL, &sa, 0);
sigaction(SIGSEGV, &sa, 0);
sigaction(SIGSYS, &sa, 0);
sigaction(SIGTRAP, &sa, 0);
asm("int $0x2");
}
(The uc argument is a pointer to a ucontext_t, but that type is declared in <ucontext.h>, not <signal.h>, so the spec says you must define the handler to take a third argument of type void * and then cast it if you want to use it.)
I set up the handler for all of the signals corresponding to fatal, synchronous CPU exceptions, because why not. The pause is just to make execution stop indefinitely inside the handler, so I can hit control-C to break into the debugger and the signal frame will be available.
Here's what I get on Linux:
(gdb) bt
#0 0x00007ffff7eb4af4 in __libc_pause ()
at ../sysdeps/unix/sysv/linux/pause.c:29
#1 0x000055555555516d in handler (s=11, si=0x7fffffffd830, uc=0x7fffffffd700)
at test.c:5
#2 <signal handler called>
#3 main () at test.c:14
(gdb) frame 1
#1 0x000055555555516d in handler (s=11, si=0x7fffffffd830, uc=0x7fffffffd700)
at test.c:5
5 pause();
(gdb) p *si
$1 = {si_signo = 11, si_errno = 0, si_code = 128, __pad0 = 0, _sifields = {
_pad = {0 <repeats 28 times>}, _kill = {si_pid = 0, si_uid = 0}, _timer = {
si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0,
sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 0, si_sigval = {
sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0,
si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x0,
si_addr_lsb = 0, _bounds = {_addr_bnd = {_lower = 0x0, _upper = 0x0},
_pkey = 0}}, _sigpoll = {si_band = 0, si_fd = 0}, _sigsys = {
_call_addr = 0x0, _syscall = 0, _arch = 0}}}
(gdb) p *(ucontext_t *)uc
$2 = {uc_flags = 7, uc_link = 0x0, uc_stack = {ss_sp = 0x0, ss_flags = 0,
ss_size = 0}, uc_mcontext = {gregs = {0, 0, 8, 582, 93824992235632,
140737488346656, 0, 0, 11, 140737488345936, 140737488346432, 0, 0, 0,
140737352200658, 140737488346272, 93824992235964, 66050,
12103423998558259, 18, 13, 0, 0}, fpregs = 0x7fffffffd8c0,
__reserved1 = {0, 1, 140737354129808, 140737488345320, 140737353799024,
140737354129808, 8455580781, 140737354130672}}, uc_sigmask = {__val = {
0, 11, 128, 0 <repeats 13 times>}}, __fpregs_mem = {cwd = 0, swd = 0,
ftw = 0, fop = 0, rip = 140737488346656, rdp = 0, mxcsr = 895,
mxcr_mask = 0, _st = {{significand = {0, 0, 0, 0}, exponent = 0,
__glibc_reserved1 = {0, 0, 0}}, {significand = {8064, 0, 65535, 0},
exponent = 0, __glibc_reserved1 = {0, 0, 0}}, {significand = {0, 0, 0,
0}, exponent = 0, __glibc_reserved1 = {0, 0, 0}}, {significand = {0,
0, 0, 0}, exponent = 0, __glibc_reserved1 = {0, 0, 0}}, {
significand = {0, 0, 0, 0}, exponent = 0, __glibc_reserved1 = {0, 0,
0}}, {significand = {0, 0, 0, 0}, exponent = 0, __glibc_reserved1 = {
0, 0, 0}}, {significand = {0, 0, 0, 0}, exponent = 0,
__glibc_reserved1 = {0, 0, 0}}, {significand = {0, 0, 0, 0},
exponent = 0, __glibc_reserved1 = {0, 0, 0}}}, _xmm = {{element = {0,
0, 0, 0}} <repeats 16 times>}, __glibc_reserved1 = {
0 <repeats 18 times>, 1179670611, 836, 7, 0, 832, 0}}, __ssp = {0, 0, 0,
3}}
The siginfo_t structure is basically useless; it has si_code == 128, which means "this signal was generated by the kernel but we're not going to tell you anything else about it," and all the other fields are zero. I consider this to be another kernel bug.
The ucontext_t structure is more useful; in particular
(gdb) p/x ((ucontext_t *)uc)->uc_mcontext.gregs[REG_RIP]
$3 = 0x5555555551bc
This is the address of the instruction that caused the signal. If I disassemble main...
(gdb) disas main
...
0x00005555555551b7: callq 0x555555555030 <sigaction#plt>
0x00005555555551bc: int $0x2
0x00005555555551be: mov $0x0,%eax
0x00005555555551c3: leaveq
0x00005555555551c4: retq
... I see that the instruction that caused the signal is indeed the int $0x2.
On NetBSD I get something slightly different:
(gdb) p *si
$1 = { si_pad = "[garbage]", _info = {
_signo = 11, _code = 2, _errno = 0, _pad = 0, _reason = {
_rt = {_pid = -146410395, _uid = 32639, _value = {sival_int = 4,
sival_ptr = 0x4}}, _child = {_pid = -146410395, _uid = 32639,
_status = 4, _utime = 0, _stime = 0}, _fault = {
_addr = 0x7f7ff745f465 <__sigemptyset14>, _trap = 4, _trap2 = 0,
_trap3 = 0}, _poll = {_band = 140187586131045, _fd = 4}}}}
This siginfo_t has actually been filled out. si_code 2 for a SIGSEGV is SEGV_ACCERR ("Invalid permissions for mapped object") which is not nonsense. There is not enough information in the headers or the manpages for me to understand what _trap = 4 means, or why _addr is pointing to an address somewhere inside the C library, and I don't feel like source-diving the NetBSD kernel today. ;-)
Also for reasons I don't feel like investigating today, gdb on NetBSD doesn't have access to the definition of ucontext_t (even though I explicitly included ucontext.h) so I had to dump it out raw:
(gdb) p *(ucontext_t *)uc
No symbol "ucontext_t" in current context.
(gdb) x/40xg uc
0x7f7fffffd7b0: 0x00000000000a000d 0x0000000000000000
0x7f7fffffd7c0: 0x0000000000000000 0x0000000000000000
0x7f7fffffd7d0: 0x0000000000000000 0x0000000000000000
0x7f7fffffd7e0: 0x0000000000000000 0x0000000000000005
0x7f7fffffd7f0: 0x00007f7fffffdb50 0x0000000000000000
0x7f7fffffd800: 0x00007f7ff7483a0a 0x0000000000000002
0x7f7fffffd810: 0x000000000000000d 0x00007f7ff749f340
0x7f7fffffd820: 0x0000000000000246 0x00007f7fffffdb90
0x7f7fffffd830: 0x00007f7ffffffdea 0x00007f7ff511a4c0
0x7f7fffffd840: 0x00007f7ffffffdea 0x00007f7fffffdb70
0x7f7fffffd850: 0x00007f7fffffffe0 0x0000000000000000
0x7f7fffffd860: 0x0000000000000000 0x0000000000000000
0x7f7fffffd870: 0x000000000000003f 0x00007f7ff748003f
0x7f7fffffd880: 0x0000000000000004 0x0000000000000012
0x7f7fffffd890: 0x0000000000400af5 0x000000000000e033 <---
0x7f7fffffd8a0: 0x0000000000010246 0x00007f7fffffdb50
0x7f7fffffd8b0: 0x000000000000e02b 0x00007f7ff7ffd0c0
0x7f7fffffd8c0: 0x000000000000037f 0x0000000000000000
0x7f7fffffd8d0: 0x0000000000000000 0x0000ffbf00001f80
0x7f7fffffd8e0: 0x0000000000000000 0x0000000000000000
(gdb) disas main
Dump of assembler code for function main:
...
0x0000000000400af0 <+166>: callq 0x400810 <__sigaction14#plt>
0x0000000000400af5 <+171>: int $0x2
0x0000000000400af7 <+173>: leaveq
0x0000000000400af8 <+174>: retq
The only address within the memory region pointed to by uc that bears any correspondence with the text of the program is 0x0000000000400af5, which is, again, the address of the int instruction.

Arduino Binary Array is too Large

I have a three-dimensional array of binary numbers, which I use as a dictionary and pass through an LED array. The dictionary covers 27 letters, and each letter covers 30x30 pixels (where each pixel is a 0 or a 1).
I was using the Intel Edison - and the code worked well - but I ditched the Edison after having trouble connecting it to my PC (despite replacing it once). I switched to the Arduino Uno, but am now receiving an error that the array is too large.
Right now I have the array set as boolean. Is there anyway to reduce the memory demands of the array by setting it instead as bits? The array consists of just zeros and ones.
Here's a snip of the code:
boolean PHDict[27][30][30] = {
/* A */ {{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, /* this is one column of thirty, that show "A" as a letter */
You could write it as
#include <stdint.h>
//...
uint32_t PHdict[27][30] = {
{ 0x00004000, ... },
....
};
.. Where each entry contains 30 bits packed into a 32-bit number.
The size is under 4k bytes.
You would need a bit of code to unpack the bits when reading the array, and a way to generate the packed values (I.e a program which runs on your "host" computer, and generates the initialized array for the source code)
For the AVR processor, there's also a way to tell the compiler you want the array stored in PM (Flash memory) instead of DM - I think if you have it in DM, the compiler will need to put the initialization data in PM anyway, and copy it over before the program starts, so it's a good idea to explicitly store it in PM. See https://gcc.gnu.org/onlinedocs/gcc/AVR-Variable-Attributes.html#AVR-Variable-Attributes
In fact, depending on the amount of flash memory in the processor, changing it to PM may be sufficient to solve the problem, without needing to pack the bits.

0xFFFF flags in SSE

I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example:
__m128i regComp = _mm_cmpgt_epi16(regA, regB);
For the sake of argument, lets assume that regComp was loaded with { 0, 0xFFFF, 0, 0xFFFF }. I would like to convert this into say { 0, 80, 0, 80 }.
What I had in mind was to create an array of integers, initialized to 80 and load them to a register regC. Then, do a _mm_and_si128 bewteen regC and regComp and store the result in regD. However, this does not do the trick, which led me to think that I do not understand the positive flags in SSE registers. Could someone answer the question with a brief explanation why my solution does not work?
short valA[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
short valB[16] = { 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10 };
short ones[16] = { 1 };
short final[16];
__m128i vA, vB, vOnes, vRes, vRes2;
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
for( i=0 ; i < 16 ;i+=8){
vA = _mm_load_si128((__m128i *)&(valA)[i] );
vB = _mm_load_si128((__m128i *)&(valB)[i] );
vRes = _mm_cmpgt_epi16(vA,vB);
vRes2 = _mm_and_si128(vRes,vOnes);
_mm_storeu_si128((__m128i *)&(final)[i], vRes2);
}
You only set the first element of array ones to 1 (the rest of the array is initialised to 0).
I suggest you get rid of the array ones altogether and then change this line:
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
to:
vOnes = _mm_set1_epi16(1);
Probably a better solution though, if you just want to convert SIMD TRUE (0xffff) results to 1, would be to use a shift:
for (i = 0; i < 16; i += 8) {
vA = _mm_loadu_si128((__m128i *)&pA[i]);
vB = _mm_loadu_si128((__m128i *)&pB[i]);
vRes = _mm_cmpgt_epi16(vA, vB); // generate 0xffff/0x0000 results
vRes = _mm_srli_epi16(vRes, 15); // convert to 1/0 results
_mm_storeu_si128((__m128i *)&final[i], vRes2);
}
Try this for loading 1:
vOnes = _mm_set1_epi16(1);
This is shorter than creating a constant array.
Be careful, providing less array values than array size in C++ initializes the other values to zero. This was your error, and not the SSE part.
Don't forget the debugger, modern ones display SSE variables properly.

How to view a pointer like an array in GDB?

Suppose defined: int a[100] Type print a then gdb will automatically display it as an array:1, 2, 3, 4.... However, if a is passed to a function as a parameter, then gdb will treat it as a normal int pointer, type print a will display:(int *)0x7fffffffdaa0. What should I do if I want to view a as an array?
See here. In short you should do:
p *array#len
*(T (*)[N])p where T is the type, N is the number of elements and p is the pointer.
Use the x command.
(gdb) x/100w a
How to view or print any number of bytes from any array in any printf-style format using the gdb debugger
As #Ivaylo Strandjev says here, the general syntax is:
print *my_array#len
# OR the shorter version:
p *my_array#len
Example to print the first 10 bytes from my_array:
print *my_array#10
[Recommended!] Custom printf-style print formatting: however, if the commands above look like garbage since it tries to interpret the values as chars, you can force different formatting options like this:
print/x *my_array#10 = hex
print/d *my_array#10 = signed integer
print/u *my_array#10 = unsigned integer
print/<format> *my_array#10 = print according to the general printf()-style format string, <format>
Here are some real examples from my debugger to print 16 bytes from a uint8_t array named byteArray. Notice how ugly the first one is, with just p *byteArray#16:
(gdb) p *byteArray#16
$4 = "\000\001\002\003\004\005\006\a\370\371\372\373\374\375\376\377"
(gdb) print/x *byteArray#16
$5 = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff}
(gdb) print/d *byteArray#16
$6 = {0, 1, 2, 3, 4, 5, 6, 7, -8, -7, -6, -5, -4, -3, -2, -1}
(gdb) print/u *byteArray#16
$7 = {0, 1, 2, 3, 4, 5, 6, 7, 248, 249, 250, 251, 252, 253, 254, 255}
In my case, the best version, with the correct representation I want to see, is the last one where I print the array as unsigned integers using print/u, since it is a uint8_t unsigned integer array after-all:
(gdb) print/u *byteArray#16
$7 = {0, 1, 2, 3, 4, 5, 6, 7, 248, 249, 250, 251, 252, 253, 254, 255}
(int[100])*pointer worked for me thanks to suggestion in the comments by #Ruslan

Resources