I am using the openwifi project and I am trying to extract data out of the FPGA (Zedboard) to the ZYNQ. Therefore, I added a AXI DMA in scatter gather mode to my block diagram with the following configuration:
The addresses are:
As far as I know, everything should be correctly configured and connected in Vivado. Because of the additional AXI DMA, I had to add it to the linux device tree:
dma#80420000 {
#dma-cells = <0x01>;
clock-names = "s_axi_lite_aclk\0m_axi_sg_aclk\0m_axi_mm2s_aclk\0m_axi_s2mm_aclk";
clocks = <0x02 0x11 0x02 0x11 0x02 0x11 0x02 0x11>;
compatible = "xlnx,axi-dma-1.00.a";
interrupt-names = "mm2s_introut\0s2mm_introut";
interrupt-parent = <0x01>;
interrupts = <0x00 0x27 0x04 0x00 0x28 0x04>;
reg = <0x80420000 0x10000>;
xlnx,addrwidth = <0x20>;
xlnx,include-sg ;
xlnx,sg-length-width = <0x0e>;
phandle = <0x0c>;
dma-channel#80420000 {
compatible = "xlnx,axi-dma-mm2s-channel";
dma-channels = <0x1>;
interrupts = <0x00 0x44 0x04>;
xlnx,datawidth = <0x40>;
xlnx,device-id = <0x2>;
};
dma-channel#80420030 {
compatible = "xlnx,axi-dma-s2mm-channel";
dma-channels = <0x1>;
interrupts = <0x00 0x45 0x04>;
xlnx,datawidth = <0x40>;
xlnx,device-id = <0x2>;
};
};
2 AXI DMAs were already configured by openwifi (dma#80410000 & dma#80400000). The complete modified device-tree can be found here. The original tree by openwifi is here (Due to the length I can't post it on SO).
My problem is that when I am trying to load the xilinx_dma driver into linux, I get the following error:
Unhandled fault: imprecise external abort (0x406) at 0xb6f38000
pgd = df614000
[b6f38000] *pgd=1627f831, *pte=1ece359f, *ppte=1ece3e7e
Internal error: : 406 [#1] PREEMPT SMP ARM
Modules linked in: xilinx_dma(+) ad9361_drv mac80211 cfg80211 ipv6
CPU: 0 PID: 2093 Comm: insmod Not tainted 4.14.0-gb6e379910a11-dirty #2
Hardware name: Xilinx Zynq Platform
task: df7ff280 task.stack: d622c000
PC is at xilinx_dma_chan_reset+0x58/0x27c [xilinx_dma]
LR is at xilinx_dma_chan_reset+0x44/0x27c [xilinx_dma]
pc : [<bf17d600>] lr : [<bf17d5ec>] psr: 60060013
sp : d622dce0 ip : 00000000 fp : 00000001
r10: dfbb5094 r9 : 00000000 r8 : 00000000
r7 : 00000000 r6 : 00000000 r5 : 00000000 r4 : df493410
r3 : dec0de1c r2 : 00000000 r1 : dfb89374 r0 : 00000058
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 18c5387d Table: 1f61404a DAC: 00000051
Process insmod (pid: 2093, stack limit = 0xd622c210)
Stack: (0xd622dce0 to 0xd622e000)
dce0: df493410 d6023010 00000000 00000000 00000000 00000000 dfbb5094 bf17dff4
dd00: bf180dc0 df493410 bf180ca0 bf180cb4 014080c0 d6023018 d6023020 dfbb4d78
dd20: df572610 00000000 bf180c70 bf180c80 df576a50 cf73edc0 00000020 0000000e
dd40: 00000001 00000040 00000001 df572610 ffffffed bf182014 fffffdfb 00000000
dd60: 00000000 00000007 00000028 c03b2a20 df572610 c0c79fa8 c0c79fac bf182014
dd80: 00000000 c03b1240 df572610 bf182014 df572644 00000000 cf73d924 00000001
dda0: 00000000 c03b1394 00000000 bf182014 c03b12f0 c03af7a0 df491f58 df570ab4
ddc0: bf182014 d6ae7000 c0c1f458 c03b07a0 bf1810f8 00000000 bf185000 bf182014
dde0: 00000000 bf185000 d4b80600 c03b1c3c ffffe000 00000000 bf185000 c0101a20
de00: 00000000 00000000 c0c54234 c0c54220 c0994394 00000000 014000c0 e30a4000
de20: dfda5600 c0c54380 00000000 0000f730 c0c54380 c01af084 00000001 a0030013
de40: d4b80600 e3057000 bf1820c0 00000001 cf73d900 d4b80600 cf73d924 c018895c
de60: bf1820c0 cf73d924 00000001 d622df50 00000001 cf73d900 bf1820c0 c01875f0
de80: bf1820cc 00007fff bf1820c0 c0184b7c bf182108 00000000 bf1821c0 c0805304
dea0: bf1821f0 c0a0d880 bf18227c bf1821d4 c0998804 00000000 bf181104 bf000001
dec0: 0004bad4 00000000 d622df48 bf180024 00000001 00000000 00000000 00000000
dee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
df00: 00000000 00000000 7fffffff 00000000 00000003 0046e398 0000017b c0107b24
df20: d622c000 00000000 00000000 c0188008 7fffffff 00000000 00000003 00000000
df40: 00000000 e3057000 0004bad4 00000000 e305c1b7 e3057000 0004bad4 e30a223c
df60: e30a2020 e308d180 00008000 00008310 00000000 00000000 00000000 00002bb0
df80: 00000034 00000035 00000020 0000001b 00000018 00000000 be8646c4 be864844
dfa0: 45194a00 c0107960 be8646c4 be864844 00000003 0046e398 00000000 00000000
dfc0: be8646c4 be864844 45194a00 0000017b 0047adc8 00000000 0046ab60 00000000
dfe0: be864670 be864660 00465177 b6f312b2 80030030 00000003 00000000 00000000
[<bf17d600>] (xilinx_dma_chan_reset [xilinx_dma]) from [<bf17dff4>] (xilinx_dma_probe+0x424/0xb04 [xilinx_dma])
[<bf17dff4>] (xilinx_dma_probe [xilinx_dma]) from [<c03b2a20>] (platform_drv_probe+0x50/0xac)
[<c03b2a20>] (platform_drv_probe) from [<c03b1240>] (driver_probe_device+0x238/0x2e8)
[<c03b1240>] (driver_probe_device) from [<c03b1394>] (__driver_attach+0xa4/0xa8)
[<c03b1394>] (__driver_attach) from [<c03af7a0>] (bus_for_each_dev+0x4c/0x9c)
[<c03af7a0>] (bus_for_each_dev) from [<c03b07a0>] (bus_add_driver+0x188/0x20c)
[<c03b07a0>] (bus_add_driver) from [<c03b1c3c>] (driver_register+0x78/0xf4)
[<c03b1c3c>] (driver_register) from [<c0101a20>] (do_one_initcall+0x44/0x168)
[<c0101a20>] (do_one_initcall) from [<c018895c>] (do_init_module+0x60/0x1f0)
[<c018895c>] (do_init_module) from [<c01875f0>] (load_module+0x1b8c/0x23a0)
[<c01875f0>] (load_module) from [<c0188008>] (SyS_finit_module+0x9c/0xb4)
[<c0188008>] (SyS_finit_module) from [<c0107960>] (ret_fast_syscall+0x0/0x48)
Code: e5933000 e0833005 e5933000 f57ff04f (e5943000)
---[ end trace e3cdb0330d71dd39 ]---
I identified that the error occurs in this function inside xilinx_dma.c:
/* IO accessors */
static inline u32 dma_read(struct xilinx_dma_chan *chan, u32 reg)
{
return ioread32(chan->xdev->regs + reg); // Here
}
So the ioread32 doesn't work. My question is: Why is this and what can I do about it?
If I remove the new AXI DMA from the device tree or change its compatible property to something random it works fine. So I suppose something has to be wrongly configured in the AXI DMA I added.
Here my simple code is,
int main()
{
const char *str="jigneshparmar";
printf("address of str data:%p , address of str variable:%p\n",(void*)str,(void*)&str );
getchar();
return 0;
}
here the string constant "jignesh" store in read only data section.
by using size command
here the output of size is:-
gcc datasec.c -o datasec
size -A datasec
datasec :
section size addr
.interp 28 792
.note.gnu.property 32 824
.note.gnu.build-id 36 856
.note.ABI-tag 32 892
.gnu.hash 36 928
.dynsym 168 968
.dynstr 133 1136
.gnu.version 14 1270
.gnu.version_r 32 1288
.rela.dyn 192 1320
.rela.plt 24 1512
.init 27 4096
.plt 32 4128
.plt.got 16 4160
.plt.sec 16 4176
.text 405 4192
.fini 13 4600
.rodata 18 8192
.eh_frame_hdr 68 8212
.eh_frame 264 8280
.init_array 8 15800
.fini_array 8 15808
.dynamic 496 15816
.got 72 16312
.data 16 16384
.bss 8 16400
.comment 42 0
Total 2236
this .rodata size is increase when i increase the string constant.
and the address of the str which I have print that is belongs to code section.
the
./datasec
address of str data:0x55f5301f3004 ,address of str variable:0x7ffd0a2b1940
the address 0x55f5301f3004 lie in code section.
cat /proc/4018/maps
555da289d000-555da289e000 r--p 00000000 103:02 13109134 /root/Desktop/lsp-prac/datasec
555da289e000-555da289f000 r-xp 00001000 103:02 13109134 /root/Desktop/lsp-prac/datasec
555da289f000-555da28a0000 r--p 00002000 103:02 13109134 /root/Desktop/lsp-prac/datasec
555da28a0000-555da28a1000 r--p 00002000 103:02 13109134 /root/Desktop/lsp-prac/datasec
555da28a1000-555da28a2000 rw-p 00003000 103:02 13109134 /root/Desktop/lsp-prac/datasec
555da416c000-555da418d000 rw-p 00000000 00:00 0 [heap]
7f5485c38000-7f5485c5d000 r--p 00000000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485c5d000-7f5485dd5000 r-xp 00025000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485dd5000-7f5485e1f000 r--p 0019d000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485e1f000-7f5485e20000 ---p 001e7000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485e20000-7f5485e23000 r--p 001e7000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485e23000-7f5485e26000 rw-p 001ea000 103:02 9963657 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7f5485e26000-7f5485e2c000 rw-p 00000000 00:00 0
7f5485e41000-7f5485e42000 r--p 00000000 103:02 9963653 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f5485e42000-7f5485e65000 r-xp 00001000 103:02 9963653 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f5485e65000-7f5485e6d000 r--p 00024000 103:02 9963653 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f5485e6e000-7f5485e6f000 r--p 0002c000 103:02 9963653 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f5485e6f000-7f5485e70000 rw-p 0002d000 103:02 9963653 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f5485e70000-7f5485e71000 rw-p 00000000 00:00 0
7ffda6e9c000-7ffda6ebd000 rw-p 00000000 00:00 0 [stack]
7ffda6ed9000-7ffda6edd000 r--p 00000000 00:00 0 [vvar]
7ffda6edd000-7ffda6edf000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
how this is possible.?
Thanks in advance.
The Linux kernel "loads" ELF executables by mapping them into memory. This occurs at page granularity, and the offset field (the field after permissions) in /proc/PID/maps describes the offset into the file for each mapped region.
String literals are stored in ELF files in the .rodata section, and executable code in the .text section.
The Linux kernel does use the section headers to determine what to map. The ELF file format has a set of program headers that you can see with e.g. readelf -l binary. The relevant ones here are the LOAD ones; these specify what the Linux kernel maps into memory.
For example, here are the two LOAD program headers from GNU Coreutils 8.28 cat on x86-64 (readelf -l /bin/cat):
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000000079d0 0x00000000000079d0 R E 0x200000
LOAD 0x0000000000007a70 0x0000000000207a70 0x0000000000207a70
0x0000000000000650 0x00000000000007f0 RW 0x200000
When the Linux kernel executes an ELF file, it maps the LOAD program headers into memory. See how there are just read-and-execute for 0x0-0x79d0, and read-write for 0x6a60-0x8260 (memory addresses 0x207a70-0x208260), and no "read-only" at all?
Using objdump -d -s /bin/cat, we see (relevant snippets only):
0000000000001ad0 <.text>:
1ad0: 53 push %rbx
1ad1: 48 8d 35 6c 41 00 00 lea 0x416c(%rip),%rsi # 5c44 <_IO_stdin_used##Base+0x4>
1ad8: ba 05 00 00 00 mov $0x5,%edx
1add: 31 ff xor %edi,%edi
1adf: e8 3c fd ff ff callq 1820 <dcgettext#plt>
1ae4: 48 89 c3 mov %rax,%rbx
1ae7: e8 a4 fc ff ff callq 1790 <__errno_location#plt>
[snipped lots of disassembly]
5c16: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
5c1d: 00 00 00
5c20: 31 d2 xor %edx,%edx
5c22: 31 f6 xor %esi,%esi
5c24: e9 17 be ff ff jmpq 1a40 <__cxa_atexit#plt>
and
Contents of section .rodata:
5c40 01000200 77726974 65206572 726f7200 ....write error.
5c50 63617400 5b007465 73742069 6e766f63 cat.[.test invoc
5c60 6174696f 6e004d75 6c74692d 63616c6c ation.Multi-call
5c70 20696e76 6f636174 696f6e00 73686132 invocation.sha2
5c80 32347375 6d007368 61322075 74696c69 24sum.sha2 utili
5c90 74696573 00736861 32353673 756d0073 ties.sha256sum.s
5ca0 68613338 3473756d 00736861 35313273 ha384sum.sha512s
5cb0 756d000a 2573206f 6e6c696e 65206865 um..%s online he
5cc0 6c703a20 3c25733e 0a00474e 5520636f lp: <%s>..GNU co
5cd0 72657574 696c7300 656e5f00 2f757372 reutils.en_./usr
5ce0 2f736861 72652f6c 6f63616c 65005269 /share/locale.Ri
5cf0 63686172 64204d2e 20537461 6c6c6d61 chard M. Stallma
5d00 6e00546f 72626a6f 726e2047 72616e6c n.Torbjorn Granl
5d10 756e6400 62656e73 74757641 45540073 und.benstuvAET.s
5d20 74616e64 61726420 6f757470 75740025 tandard output.%
and
Contents of section .init_array:
207a70 10280000 00000000 .(......
Contents of section .fini_array:
207a78 d0270000 00000000 .'......
Contents of section .data.rel.ro:
207a80 7a5d0000 00000000 00000000 00000000 z]..............
207a90 00000000 00000000 62000000 00000000 ........b.......
207aa0 8a5d0000 00000000 00000000 00000000 .]..............
[...]
207c00 e75c0000 00000000 27630000 00000000 .\......'c......
207c10 00000000 00000000 ........
Contents of section .dynamic:
207c18 01000000 00000000 01000000 00000000 ................
207c28 0c000000 00000000 20170000 00000000 ........ .......
207c38 0d000000 00000000 2c5c0000 00000000 ........,\......
[...]
207de8 00000000 00000000 00000000 00000000 ................
207df8 00000000 00000000 00000000 00000000 ................
Contents of section .got:
207e08 187c2000 00000000 00000000 00000000 .| .............
207e18 00000000 00000000 56170000 00000000 ........V.......
[...]
207fe8 00000000 00000000 00000000 00000000 ................
207ff8 00000000 00000000 ........
Contents of section .data:
208000 00000000 00000000 08802000 00000000 .......... .....
208010 20202020 20202020 20202020 20202020
208020 20300900 00000000 21802000 00000000 0......!. .....
208030 1c802000 00000000 7b620000 00000000 .. .....{b......
See how both .text and .rodata belong to the same LOAD program header? That's why they're mapped the same way. The .data section is a separate LOAD program header, and is therefore mapped separately, with different permissions.
You see, the linker file used on x86-64 on most Linux systems (including Ubuntu 18.04.5, where the above is from), combines the .text and .rodata sections into a single LOAD program header; and since this is what controls how ELF executables are load (mapped) into memory by the Linux kernel, they get mapped into the same memory region, with the same permissions (r-xp).
Consider the following example program, maps.c:
// SPDX-License-Identifier: CC0-1.0
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
struct map_entry {
struct map_entry *next;
uintptr_t addr;
uintptr_t ends;
char line[];
};
struct map_entry *map = NULL;
static struct map_entry *map_find(const void *const ptr)
{
const uintptr_t addr = (uintptr_t)ptr;
struct map_entry *curr = map;
while (curr)
if (addr >= curr->addr && addr <= curr->ends)
return curr;
else
curr = curr->next;
return NULL;
}
static void map_init(void)
{
char *line = NULL;
size_t size = 0;
ssize_t len;
FILE *in;
/* Already mapped? */
if (map)
return;
struct map_entry *root = NULL;
in = fopen("/proc/self/maps", "r");
if (!in) {
fprintf(stderr, "Cannot read /proc/self/maps: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
while (1) {
len = getline(&line, &size, in);
if (len < 0)
break;
/* Remove newline at end. */
while (len > 0 && line[len-1] == '\n')
line[--len] = '\0';
char *ptr = line;
char *end = line;
unsigned long long val;
/* Parse start address. */
errno = 0;
val = strtoull(ptr, &end, 16);
if (errno) {
fprintf(stderr, "/proc/self/maps: %s: %s.\n", line, strerror(errno));
exit(EXIT_FAILURE);
}
if (end == ptr || *end != '-') {
fprintf(stderr, "/proc/self/maps: %s: Error parsing line.\n", line);
exit(EXIT_FAILURE);
}
ptr = ++end;
const uintptr_t addr = val;
/* Parse end address (actually one plus end address). */
errno = 0;
val = strtoull(ptr, &end, 16);
if (errno) {
fprintf(stderr, "/proc/self/maps: %s: %s.\n", line, strerror(errno));
exit(EXIT_FAILURE);
}
if (end == ptr || *end != ' ') {
fprintf(stderr, "/proc/self/maps: %s: Error parsing line.\n", line);
exit(EXIT_FAILURE);
}
const uintptr_t ends = val;
/* Allocate a new map entry for this one. */
struct map_entry *ent = malloc(sizeof (struct map_entry) + len + 1);
if (!ent) {
fprintf(stderr, "/proc/self/maps: Out of memory.\n");
exit(EXIT_FAILURE);
}
/* Copy line, including the end-of-string '\0'. */
memcpy(ent->line, line, len + 1);
ent->addr = addr;
ent->ends = ends - 1;
/* Prepend to root list. */
ent->next = root;
root = ent;
}
/* Discard line buffer, since it is no longer needed. */
free(line); /* Note: free(NULL) is safe, and does nothing. */
if (ferror(in) || !feof(in)) {
fprintf(stderr, "/proc/self/maps: Read error.\n");
exit(EXIT_FAILURE);
}
if (fclose(in)) {
fprintf(stderr, "/proc/self/maps: Error closing file.\n");
exit(EXIT_FAILURE);
}
/* Reverse the list. Since we prepended each entry, it is in reverse order. */
while (root) {
struct map_entry *curr = root;
root = root->next;
/* Prepend to map list. */
curr->next = map;
map = curr;
}
}
const char *const literal1 = "String literal 1";
const char array1[] = "String array 1";
int main(void)
{
const char *const literal2 = "String literal 2";
const char array2[] = "String array 2";
struct map_entry *ent;
map_init();
ent = map_find(&literal1);
if (ent)
printf("Variable 'literal1' has address %p:\n\t%s\n", (void *)&literal1, ent->line);
ent = map_find(literal1);
if (ent)
printf("Variable 'literal1' points to address %p:\n\t%s\n", (void *)literal1, ent->line);
ent = map_find(&array1);
if (ent)
printf("Variable 'array1' has address %p:\n\t%s\n", (void *)&array1, ent->line);
ent = map_find(array1);
if (ent)
printf("Variable 'array1' points to address %p:\n\t%s\n", (void *)array1, ent->line);
ent = map_find(&literal2);
if (ent)
printf("Variable 'literal2' has address %p:\n\t%s\n", (void *)&literal2, ent->line);
ent = map_find(literal2);
if (ent)
printf("Variable 'literal2' points to address %p:\n\t%s\n", (void *)literal2, ent->line);
ent = map_find(&array2);
if (ent)
printf("Variable 'array2' has address %p:\n\t%s\n", (void *)&array2, ent->line);
ent = map_find(array2);
if (ent)
printf("Variable 'array2' points to address %p:\n\t%s\n", (void *)array2, ent->line);
return EXIT_SUCCESS;
}
Compile it using gcc -Wall -Wextra -O2 maps.c -o maps, and run ./maps. Its output is
Variable 'literal1' has address 0x5651567add48:
5651567ad000-5651567ae000 r--p 00001000 fd:03 6953200 /home/glaerbo/kildekode/maps/maps
Variable 'literal1' points to address 0x5651565ad21f:
5651565ac000-5651565ae000 r-xp 00000000 fd:03 6953200 /home/glaerbo/kildekode/maps/maps
Variable 'array1' has address 0x5651565ad448:
5651565ac000-5651565ae000 r-xp 00000000 fd:03 6953200 /home/glaerbo/kildekode/maps/maps
Variable 'array1' points to address 0x5651565ad448:
5651565ac000-5651565ae000 r-xp 00000000 fd:03 6953200 /home/glaerbo/kildekode/maps/maps
Variable 'literal2' has address 0x7fff34c15dd8:
7fff34bf7000-7fff34c18000 rw-p 00000000 00:00 0 [stack]
Variable 'literal2' points to address 0x5651565ad1c4:
5651565ac000-5651565ae000 r-xp 00000000 fd:03 6953200 /home/glaerbo/kildekode/maps/maps
Variable 'array2' has address 0x7fff34c15df9:
7fff34bf7000-7fff34c18000 rw-p 00000000 00:00 0 [stack]
Variable 'array2' points to address 0x7fff34c15df9:
7fff34bf7000-7fff34c18000 rw-p 00000000 00:00 0 [stack]
which shows how literal1 (const char *const literal1 = "...";) belongs to memory region that is mapped r--p, but points to memory that is mapped r-xp.
("Hey, I thought you said the kernel only maps the LOAD program headers?" Yes; that particular mapping was not created by the kernel, but by the dynamic linker. I did not say that only the kernel maps ELF executables into memory; I explained how the kernel maps the minimum necessary LOAD program headers into memory and hands off the execution to that code. For dynamically linked C programs using standard C libraries, that code maps the rest of the program sections and any prerequisite dynamic libraries.)
Note however the array1 immutable char array, however, is completely in the r-xp mapped memory region, as is the string literal that literal2 refers to.
Because array2 and literal2 are declared in the main() function, they reside in the [stack] memory region.
One of two things is happening:
You're not printing the address of the string literal correctly;
You're not looking at your mapping correctly.
As I mentioned in my comment, %p expects its corresponding argument to have type void *, and a call to printf is one of the few (perhaps the only) places in C where you have to explicitly cast a pointer to void *, so it's possible the address for the string literal isn't being formatted correctly.
Otherwise, you're not looking at your mapping correctly.
I took your code and built it on my system. When I run it I get the output
address of str data:0x400580 , address of str variable:0x7fff22b9e938
You can use the objdump utility to look at the sections of your executable file - to look at the contents of .rodata, do the following:
objdump -s -j .rodata file
When I do that on the code I built, I get
Contents of section .rodata:
400570 01000200 00000000 00000000 00000000 ................
400580 6a69676e 65736870 61726d61 72000000 jigneshparmar...
400590 61646472 65737320 6f662073 74722064 address of str d
4005a0 6174613a 2570202c 20616464 72657373 ata:%p , address
4005b0 206f6620 73747220 76617269 61626c65 of str variable
4005c0 3a25700a 00 :%p..
which matches the output from the program.
Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call.
Inputs:
IN = ...1100010010010100...
MASK = ...0001111010111011...
Output:
OUT = ...0001111010111000...
edit: another example result from some comment discussion
IN = ...11111110011010110...
MASK = ...01011011001111110...
Output:
OUT = ...01011011001111110...
I want to get the contiguous adjacent 1 bits of MASK that a 1 bit of IN is within. (Is there a general term for this kind of operation? Maybe I'm not phrasing my searches properly.) I'm trying to find a way to do this that is a bit faster. I'm open to using any x86 or x86 SIMD extensions that can get this done in minimum cpu cycles. A wider data type SIMD is preferred as it will allow me to process more data at once.
The best naive solution I've come up with is the following pseudocode, which manually shifts left until there are no more matching bits, then repeats shifting right:
// (using the variables above)
testL = testR = OUT = (IN & MASK);
LoopL:
testL = (testL << 1) & MASK;
if (testL != 0) {
OUT = OUT | testL;
goto LoopL;
}
LoopR:
testR = (testR >> 1) & MASK;
if (testR != 0) {
OUT = OUT | testR;
goto LoopR;
}
return OUT;
I guess #fuz comment was on the right track.
The following example shows how the SSE and AVX2 code below works.
The algorithm starts with IN_reduced = IN & MASK because we are not interested
in IN bits at positions where MASK is 0.
IN = . . . 0 0 0 0 . . . . p q r s . . .
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = IN & MASK = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
If any of the p q r s bits is 1, then IN_reduced + MASK has a carry bit 1
at position X, which is right left to the
requested contiguous bits.
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
IN_reduced + MASK = . . 0 1 1 1 1 . . . 1 . . . . . .
X
(IN_reduced + MASK) >>1 = . . . 0 1 1 1 1 . . . 1 . . . . . .
With >> 1 this carry bit 1 is shifted to the same column as bit p
(the first bit of the contiguous bits).
Now, (IN_reduced + MASK) >>1 is actually an average of IN_reduced and MASK.
In order to avoid possible overflow of addition we use the following
average: avg(a, b) = (a & b) + ((a ^ b) >> 1) (See #Harold's comment,
see also here and here.)
With average = avg(IN_reduced, MASK) we get
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
average = . . . 0 1 1 1 1 . . . 1 . . . . . .
MASK >> 1 = . . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 .
leading_bits = (~(MASK>>1))&average = . . . 0 0 0 0 0 . . . 1 0 0 0 0 . .
We can isolate the leading carry bits with
leading_bits = (~(MASK>>1) ) & average because MASK>>1 is zero at the positions
of the carry bits
that we are interested in.
With normal addition the carry propagates from right to left. Here we use a
reverse addition: with a carry from left to right.
Reverse adding MASK and leading_bits:
rev_added = bit_swap(bit_swap(MASK) + bit_swap(leading_bits)),
This zeros the bits at
the wanted positions.
With OUT = (~rev_added) & MASK we get the result.
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
leading_bits = . . . 0 0 0 0 0 . . . 1 0 0 0 0 . .
rev_added (MASK,leading_bits) = . . . 1 1 1 1 0 . . . 0 0 0 0 1 . .
OUT = ~rev_added & MASK = . . 0 0 0 0 0 0 . . . 1 1 1 1 0 . .
The algorithm was not thoroughly tested, but the output looks ok.
The code block below contains two separate codes:
The upper half is the SSE code,
and the lower half is the AVX2 code.
(In order to avoid
bloating the answer too much with two large code blocks.)
The SSE algorithm works with 2 x 64-bit elements and the AVX2 version works with 4 x 64-bit elements.
With gcc 9.1, the algorithm compiles to about 29 instructions,
aside from 4 vmovdqa-s for loading some constants, which are likely
hoisted out of the loop in a real world application (after inlining).
These 29 instructions are a good mix of 9 shuffles (vpshufb) that execute
on port 5 (p5) on Intel Skylake, and many other instructions that often may
execute on p0, p1 or p5.
Therefore, a performance of about 3 instructions per cycle might be possible.
In that case the throughput would be about 1 function call (inlined)
per 10 cycles. In the AVX2 case this means 4 uint64_t OUT results per
about 10 cycles.
Note that the performance is independent of the data(!), which is a great
benefit of this answer I think. The solution is branchless, and loopless, and
cannot suffer from failing branch prediction.
/* gcc -O3 -m64 -Wall -march=skylake select_bits.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
int print_sse_128_bin(__m128i x);
__m128i bit_128_k(unsigned int k);
__m128i mm_bitreverse_epi64(__m128i x);
__m128i mm_revadd_epi64(__m128i x, __m128i y);
/* Select specific pieces of contiguous bits from `MASK` based on selector `IN` */
__m128i mm_select_bits_epi64(__m128i IN, __m128i MASK){
__m128i IN_reduced = _mm_and_si128(IN, MASK);
/* Compute the average of IN_reduced and MASK with avg(a,b)=(a&b)+((a^b)>>1) */
/* (IN_reduced & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* ((IN & MASK) & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* IN_reduced + ((IN_reduced ^ MASK) >>1) */
__m128i tmp = _mm_xor_si128(IN_reduced, MASK);
__m128i tmp_div2 = _mm_srli_epi64(tmp, 1);
__m128i average = _mm_add_epi64(IN_reduced, tmp_div2); /* average is the average */
__m128i MASK_div2 = _mm_srli_epi64(MASK, 1);
__m128i leading_bits = _mm_andnot_si128(MASK_div2, average);
__m128i rev_added = mm_revadd_epi64(MASK, leading_bits);
__m128i OUT = _mm_andnot_si128(rev_added, MASK);
/* Uncomment the next lines to check the arithmetic */ /*
printf("IN ");print_sse_128_bin(IN );
printf("MASK ");print_sse_128_bin(MASK );
printf("IN_reduced ");print_sse_128_bin(IN_reduced );
printf("tmp ");print_sse_128_bin(tmp );
printf("tmp_div2 ");print_sse_128_bin(tmp_div2 );
printf("average ");print_sse_128_bin(average );
printf("MASK_div2 ");print_sse_128_bin(MASK_div2 );
printf("leading_bits ");print_sse_128_bin(leading_bits );
printf("rev_added ");print_sse_128_bin(rev_added );
printf("OUT ");print_sse_128_bin(OUT );
printf("\n");*/
return OUT;
}
int main(){
__m128i IN = _mm_set_epi64x(0b11111110011010110, 0b1100010010010100);
__m128i MASK = _mm_set_epi64x(0b01011011001111110, 0b0001111010111011);
__m128i OUT;
printf("Example 1 \n");
OUT = mm_select_bits_epi64(IN, MASK);
printf("IN ");print_sse_128_bin(IN);
printf("MASK ");print_sse_128_bin(MASK);
printf("OUT ");print_sse_128_bin(OUT);
printf("\n\n");
/* 0b7654321076543210765432107654321076543210765432107654321076543210 */
IN = _mm_set_epi64x(0b1000001001001010000010000000100000010000000000100000000111100011,
0b11111110011010111);
MASK = _mm_set_epi64x(0b1110011110101110111111000000000111011111101101111100011111000001,
0b01011011001111111);
printf("Example 2 \n");
OUT = mm_select_bits_epi64(IN, MASK);
printf("IN ");print_sse_128_bin(IN);
printf("MASK ");print_sse_128_bin(MASK);
printf("OUT ");print_sse_128_bin(OUT);
printf("\n\n");
return 0;
}
int print_sse_128_bin(__m128i x){
for (int i = 127; i >= 0; i--){
printf("%1u", _mm_testnzc_si128(bit_128_k(i), x));
if (((i & 7) == 0) && (i > 0)) printf(" ");
}
printf("\n");
return 0;
}
/* From my answer here https://stackoverflow.com/a/39595704/2439725, adapted to 128-bit */
inline __m128i bit_128_k(unsigned int k){
__m128i indices = _mm_set_epi32(96, 64, 32, 0);
__m128i one = _mm_set1_epi32(1);
__m128i kvec = _mm_set1_epi32(k);
__m128i shiftcounts = _mm_sub_epi32(kvec, indices);
__m128i kbit = _mm_sllv_epi32(one, shiftcounts);
return kbit;
}
/* Copied from Harold's answer https://stackoverflow.com/a/46318399/2439725 */
/* Adapted to epi64 and __m128i: bit reverse two 64 bit elements */
inline __m128i mm_bitreverse_epi64(__m128i x){
__m128i shufbytes = _mm_setr_epi8(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8);
__m128i luthigh = _mm_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m128i lutlow = _mm_slli_epi16(luthigh, 4);
__m128i lowmask = _mm_set1_epi8(15);
__m128i rbytes = _mm_shuffle_epi8(x, shufbytes);
__m128i high = _mm_shuffle_epi8(lutlow, _mm_and_si128(rbytes, lowmask));
__m128i low = _mm_shuffle_epi8(luthigh, _mm_and_si128(_mm_srli_epi16(rbytes, 4), lowmask));
return _mm_or_si128(low, high);
}
/* Add in the reverse direction: With a carry from left to */
/* right, instead of right to left */
inline __m128i mm_revadd_epi64(__m128i x, __m128i y){
x = mm_bitreverse_epi64(x);
y = mm_bitreverse_epi64(y);
__m128i sum = _mm_add_epi64(x, y);
return mm_bitreverse_epi64(sum);
}
/* End of SSE code */
/************* AVX2 code starts here ********************************************/
/* gcc -O3 -m64 -Wall -march=skylake select_bits256.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
int print_avx_256_bin(__m256i x);
__m256i bit_256_k(unsigned int k);
__m256i mm256_bitreverse_epi64(__m256i x);
__m256i mm256_revadd_epi64(__m256i x, __m256i y);
/* Select specific pieces of contiguous bits from `MASK` based on selector `IN` */
__m256i mm256_select_bits_epi64(__m256i IN, __m256i MASK){
__m256i IN_reduced = _mm256_and_si256(IN, MASK);
/* Compute the average of IN_reduced and MASK with avg(a,b)=(a&b)+((a^b)>>1) */
/* (IN_reduced & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* ((IN & MASK) & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* IN_reduced + ((IN_reduced ^ MASK) >>1) */
__m256i tmp = _mm256_xor_si256(IN_reduced, MASK);
__m256i tmp_div2 = _mm256_srli_epi64(tmp, 1);
__m256i average = _mm256_add_epi64(IN_reduced, tmp_div2); /* average is the average */
__m256i MASK_div2 = _mm256_srli_epi64(MASK, 1);
__m256i leading_bits = _mm256_andnot_si256(MASK_div2, average);
__m256i rev_added = mm256_revadd_epi64(MASK, leading_bits);
__m256i OUT = _mm256_andnot_si256(rev_added, MASK);
/* Uncomment the next lines to check the arithmetic */ /*
printf("IN ");print_avx_256_bin(IN );
printf("MASK ");print_avx_256_bin(MASK );
printf("IN_reduced ");print_avx_256_bin(IN_reduced );
printf("tmp ");print_avx_256_bin(tmp );
printf("tmp_div2 ");print_avx_256_bin(tmp_div2 );
printf("average ");print_avx_256_bin(average );
printf("MASK_div2 ");print_avx_256_bin(MASK_div2 );
printf("leading_bits ");print_avx_256_bin(leading_bits );
printf("rev_added ");print_avx_256_bin(rev_added );
printf("OUT ");print_avx_256_bin(OUT );
printf("\n");*/
return OUT;
}
int main(){
__m256i IN = _mm256_set_epi64x(0b11111110011010110,
0b1100010010010100,
0b1000001001001010000010000000100000010000000000100000000111100011,
0b11111110011010111
);
__m256i MASK = _mm256_set_epi64x(0b01011011001111110,
0b0001111010111011,
0b1110011110101110111111000000000111011111101101111100011111000001,
0b01011011001111111);
__m256i OUT;
printf("Example \n");
OUT = mm256_select_bits_epi64(IN, MASK);
printf("IN ");print_avx_256_bin(IN);
printf("MASK ");print_avx_256_bin(MASK);
printf("OUT ");print_avx_256_bin(OUT);
printf("\n");
return 0;
}
int print_avx_256_bin(__m256i x){
for (int i=255;i>=0;i--){
printf("%1u",_mm256_testnzc_si256(bit_256_k(i),x));
if (((i&7) ==0)&&(i>0)) printf(" ");
}
printf("\n");
return 0;
}
/* From my answer here https://stackoverflow.com/a/39595704/2439725 */
inline __m256i bit_256_k(unsigned int k){
__m256i indices = _mm256_set_epi32(224,192,160,128,96,64,32,0);
__m256i one = _mm256_set1_epi32(1);
__m256i kvec = _mm256_set1_epi32(k);
__m256i shiftcounts = _mm256_sub_epi32(kvec, indices);
__m256i kbit = _mm256_sllv_epi32(one, shiftcounts);
return kbit;
}
/* Copied from Harold's answer https://stackoverflow.com/a/46318399/2439725 */
/* Adapted to epi64: bit reverse four 64 bit elements */
inline __m256i mm256_bitreverse_epi64(__m256i x){
__m256i shufbytes = _mm256_setr_epi8(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8);
__m256i luthigh = _mm256_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15, 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m256i lutlow = _mm256_slli_epi16(luthigh, 4);
__m256i lowmask = _mm256_set1_epi8(15);
__m256i rbytes = _mm256_shuffle_epi8(x, shufbytes);
__m256i high = _mm256_shuffle_epi8(lutlow, _mm256_and_si256(rbytes, lowmask));
__m256i low = _mm256_shuffle_epi8(luthigh, _mm256_and_si256(_mm256_srli_epi16(rbytes, 4), lowmask));
return _mm256_or_si256(low, high);
}
/* Add in the reverse direction: With a carry from left to */
/* right, instead of right to left */
inline __m256i mm256_revadd_epi64(__m256i x, __m256i y){
x = mm256_bitreverse_epi64(x);
y = mm256_bitreverse_epi64(y);
__m256i sum = _mm256_add_epi64(x, y);
return mm256_bitreverse_epi64(sum);
}
Output of the SSE code with an uncommented debugging section:
Example 1
IN 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010110 00000000 00000000 00000000 00000000 00000000 00000000 11000100 10010100
MASK 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111011
IN_reduced 00000000 00000000 00000000 00000000 00000000 00000000 10110100 01010110 00000000 00000000 00000000 00000000 00000000 00000000 00000100 10010000
tmp 00000000 00000000 00000000 00000000 00000000 00000000 00000010 00101000 00000000 00000000 00000000 00000000 00000000 00000000 00011010 00101011
tmp_div2 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00010100 00000000 00000000 00000000 00000000 00000000 00000000 00001101 00010101
average 00000000 00000000 00000000 00000000 00000000 00000000 10110101 01101010 00000000 00000000 00000000 00000000 00000000 00000000 00010001 10100101
MASK_div2 00000000 00000000 00000000 00000000 00000000 00000000 01011011 00111111 00000000 00000000 00000000 00000000 00000000 00000000 00001111 01011101
leading_bits 00000000 00000000 00000000 00000000 00000000 00000000 10100100 01000000 00000000 00000000 00000000 00000000 00000000 00000000 00010000 10100000
rev_added 00000000 00000000 00000000 00000000 00000000 00000000 01001001 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000001 01000111
OUT 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111000
IN 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010110 00000000 00000000 00000000 00000000 00000000 00000000 11000100 10010100
MASK 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111011
OUT 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111000
Example 2
IN 10000010 01001010 00001000 00001000 00010000 00000010 00000001 11100011 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010111
MASK 11100111 10101110 11111100 00000001 11011111 10110111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
IN_reduced 10000010 00001010 00001000 00000000 00010000 00000010 00000001 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110100 01010111
tmp 01100101 10100100 11110100 00000001 11001111 10110101 11000110 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000010 00101000
tmp_div2 00110010 11010010 01111010 00000000 11100111 11011010 11100011 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00010100
average 10110100 11011100 10000010 00000000 11110111 11011100 11100100 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110101 01101011
MASK_div2 01110011 11010111 01111110 00000000 11101111 11011011 11100011 11100000 00000000 00000000 00000000 00000000 00000000 00000000 01011011 00111111
leading_bits 10000100 00001000 10000000 00000000 00010000 00000100 00000100 00000001 00000000 00000000 00000000 00000000 00000000 00000000 10100100 01000000
rev_added 00010000 01100001 00000010 00000001 11000000 01110000 00100000 00100000 00000000 00000000 00000000 00000000 00000000 00000000 01001001 00000000
OUT 11100111 10001110 11111100 00000000 00011111 10000111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
IN 10000010 01001010 00001000 00001000 00010000 00000010 00000001 11100011 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010111
MASK 11100111 10101110 11111100 00000001 11011111 10110111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
OUT 11100111 10001110 11111100 00000000 00011111 10000111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
The following approach needs only a single loop, with the number of iterations equal to the number of 'groups' found.
I don't know if it will be more efficient than your approach; there's 6 arith/bitwise operations in each iteration.
In pseudo code (C-like):
OUT = 0;
a = MASK;
while (a)
{
e = a & ~(a + (a & (-a)));
if (e & IN) OUT |= e;
a ^= e;
}
Here's how it works, step by step, using 11010111 as an example mask:
OUT = 0
a = MASK 11010111
c = a & (-a) 00000001 keeps rightmost one only
d = a + c 11011000 clears rightmost group (and set the bit to its immediate left)
e = a & ~d 00000111 keeps rightmost group only
if (e & IN) OUT |= e; adds group to OUT
a = a ^ e 11010000 clears rightmost group, so we can proceed with the next group
c = a & (-a) 00010000
d = a + c 11100000
e = a & ~d 00010000
if (e & IN) OUT |= e;
a = a ^ e 11000000
c = a & (-a) 01000000
d = a + c 00000000 (ignoring carry when adding)
e = a & ~d 11000000
if (e & IN) OUT |= e;
a = a ^ e 00000000 done
As pointed out #PeterCordes, some operations could be optimized using x86 BMI1 instructions:
c = a & (-a): blsi
e = a & ~d: andn
This approach is good for processor architectures that do not support bitwise reversal. On architectures that do have a dedicated instruction to reverse the order of bits in an integer, wim's answer is more efficient.