I am doing some micro-controller programming where I have to load the firmware of a DSP chip at run time. The DSP chip requires that the register addresses be written in a different endianess so the addres 1024 becomes 0x04, 0x00. I have the address in a 2 element uint8_t array with the most significant byte being the 0 position and least significant byte being the 1 position. However, I need to run through a loop where i increment each register address by one every iteration. The micro controller is a different endianess so I can't simply cast the array to uint16_t* and increment.
How would i go about incrementing the address?
I would use a normal int counter, and then convert to the correct endianness before sending it to the DSP. You can use macros in the byteorder or endian family. This will be easier to debug and more portable.
What is it you are looking for from us?
1) swap before sending
2) increment the lower byte, add the carry to the upper byte (asm makes this easy)
3) endian swap and increment (x=(upper<<8)|lower; x++)
Related
My Watch dog timer has a default value of 0x0fffff and i want to write a 2 byte variable (u2 compare) in it. What happens when i assign the value simply like this
wdt_register = compare;
What happens to most significant byte of register?
Register definition. It's 3 bytes register containing H, M, L 8bit registers. 4 most significat bits of H are not used and then it's actually a 20 bit register. Datasheet named all of them as WDTCR_20.
My question is what happens when i assign a value to the register using this line (just an example of 2 byte value written to 3 byte register) :
WDTCR_20 = 0x1234;
Your WDT is a so-called special function register. In hardware, it may end up being three bytes, or it could be four bytes, some of which are fixed/read-only/unused. Your compiler's implementation of the write is itself implementation-dependent if the SFR is declared in a particular way that makes the compiler emit SFR-specific write instructions.
This effectively makes the result of the assignment implementation-dependent; the high eight bits might end up being discarded, might set some other microarchitectural flags, or might cause a trap/crash if they aren't set to a specific (likely all-zeros value). It depends on the processor's datasheet (since you didn't mention a processor/toolchain, we don't know exactly).
For example, the AVR-based atmega328p datasheet shows an example of such a register:
In this case, the one-byte register is actually only three bits, effectively (bits 7..3 are fixed to zero on read and ignored on write, and could very well have no physical flip-flop or SRAM cell associated with them).
Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.
vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
Without pdep
pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.
I know the naming convention which says if there are n*2 registers or variables which are semantically connected you should name them like following:
REGH REGL
In the case of 2*2 registers it would be:
REGHH REGHL REGLH REGLL
The last two letters stand for high-high, high-low, low-high and low-low. Is there any convention which declares the same thing for 3 registers? Like:
REGH REGM REGL
In this case the last letters stand for high, middle and low. 6 byte would look like this:
REGHH REGHM REGHL REGLH REGLM REGLL
I hope you understand what I mean. Is there any convention for this case?
The Atmel AVR Microcontroller, 1st ed. [P. 173; 6.10.1]
For a register larger than 16 bits, the bytes are numbered from the least significant byte. For example, the 32-bit ADC calibration register is named CAL. The four bytes are named CAL0, CAL1, CAL2, CAL3 (from the least to the most significant byte).
So in a 8-bit system we shouldn't even do:
REGHH REGHL REGLH REGLL
but:
REG3 REG2 REG1 REG0
On 16 bit dos machine there is options like FP_SEG and FP_OFF for converting a pointer to linear address but since these method no more exist on 32 bit compiler what are other function that can do same on 32 bit machine??
They're luckily not needed as 32-bit mode is unsegmented and hence the addresses are always linear (a simplification, but let's keep it simple).
EDIT: The first version was confusing let's try again.
In 16-bit segmented mode (I'm exclusively referring to legacy DOS programs here, it'll probably be similar for other 16-bit x86 OSes) addresses are given in a 32-bit format consisting of a16-bit segment and a 16-bit offset. These are combined to form a 20-bit linear address (This is where the infamous 640K barrier comes from, 2**20 = 1MB and 384K are reserved for the system and bios leaving ~640K for user programs) by multiply the segment by 16 = 0x10 (equivalent to shifting left by 4) and adding the offset. I.e.: linear = segment*0x10 + offset.
This means that 2**12 segment:address type pointers will refer to the same linear address, so in general there is no way to obtain the 32-bit value used to form the linear address.
In old DOS programs that used far - segmented - pointers (as opposed to near pointers, which only contained an offset and implicitly used ds segment register) they were usually treated as 32-bit unsigned integer values where the 16 most significant bits were the segment and the 16 least significant bits the offset. This gives the following macro definitions for FP_SEG and FP_OFF (using the types from stdint.h):
#define FP_SEG(x) (uint16_t)((uint32_t)(x) >> 16) /* grab 16 most significant bits */
#define FP_OFF(x) (uint16_t)((uint32_t)(x)) /* grab 16 least significant bits */
To convert a 20-bit linear address to a segmented address you have many options (2**12). One way could be:
#define LIN_SEG(x) (uint16_t)(((uint32_t)(x)&0xf0000)>>4)
#define LIN_OFF(x) (uint16_t)((uint32_t)(x))
Finally a quick example of how it all works together:
Segmented address: a = 0xA000:0x0123
As 32-bit far pointer b = 0xA0000123
20-bit linear address: c = 0xA0123
FP_SEG(b) == 0xA000
FP_OFF(b) == 0x0123
LIN_SEG(c) = 0xA000
LIN_OFF(c) = 0x0123
I need to alloc an array of uint64_t[1e9] to count something, and I know the items are between (0,2^39).
So I want to calloc 5*1e9 bytes for the array.
Then I found that, if I want to make the uint64_t meanful, it is difficult to by pass the byte order.
There should be 2 ways.
One is to check the endianness first, so that we can memcpy the 5 bytes to either first or last of the whole 8 bytes.
The other is to use 5 bitshift and then bit-or them together.
I think the former should be faster.
So, under GCC or libc or GNU system, is there any header file to indicate whether the current system is Little Endian or Big Endian ? I know x86_64 is Little Endian, but I don't like to write a unportable code.
Of course any other idears are welcomed.
Add:
I need use the array to count many strings use D-left hashing. I plan to use 21bit for key and 18bit for counting.
When you say "faster"... how often is this code executed? 5 times <<8 plus an | probably costs less than 100ns. So if that code is executed 10'000 times, that adds up to 1 (one) second.
If the code is executed less times and you need more than 1 second to implement an endian-clean solution, you're wasting everyone's time.
That said, the solution to figure out the endianess is simple:
int a = 1;
char * ptr = (char*)&a;
bool littleEndian = *ptr == 1;
Now all you need it a big endian machine and a couple of test cases to make sure your memcpy solution works. Note that you need to need to call memcpy five times in one of the two cases to reorder the bytes.
Or you could simply shift and or five times...
EDIT I guess I misunderstood your question a bit. You're saying that you want to use the lowest 5 bytes (=40 bits) of the uint64_t as a counter, yes?
So the operation will be executed many, many times. Again, memcpy is utterly useless. Let's take the number 0x12345678 (32bit). In memory, that looks like so:
0x12 0x34 0x56 0x78 big endian
0x78 0x56 0x34 0x12 little endian
As you can see, the bytes are swapped. So to convert between the two, you must either use bit-shifting or byte swapping. memcpy doesn't work.
But that doesn't actually matter since the CPU will do the decoding for you. All you have to do is to shift the bits in the right place.
key = item & 0x1FFFFF
count = (item >>> 21)
to read and
item = count << 21 | key
to write. Now you just need to build the key from the five bytes and you're done:
key = (((hash[0] << 8) | (hash[1]<<8)) | ....
EDIT 2
It seems you have an array of 40bit ints and you want to read/write that array.
I have two solutions: Using memcpy should work as long as the data isn't copied between CPUs of different endianess (read: when you save/load the data to/from disk). But the function call might be too slow for such a huge array.
The other solution is to use two arrays:
int lower[];
unit8_t upper[]
that is: Save the bits 33-40 in a second array. To read/write the values, one shift+or is necessary.
If you treat numbers as numbers, and not an array of bytes, your code will be endianess-agnostic. Hence, I would go for the shift and or solution.
Having said that, I really didn't catch what you are trying to do? Do you really need one billion entries, each five bytes long? If the data you are sampling is sparse, you might get away with allocating far less memory.
Well, I just find the kernel headers come with <asm/byteorder.h>.
inline memcpy to a while(i<x+3){++*i=++*j} may still slower since cache operation is slower than registers.
another way for memcpy is:
union dat {
uint64_t a;
char b[8];
} d;