What data structure for an array of bit flags? - arrays

I'm porting some imperative code to Haskell. My goal is to analyze an executable, therefore each byte of the text section gets assigned a number of flags, which would all fit in a byte (6 bits to be precise).
In a language like C, I would just allocate an array of bytes, zero them and update them as I go. How would I do this efficiently in Haskell?
In other words: I'm looking for a ByteString with bit-wise access and constant time updates as I disassemble more of the text section.
Edit: Of course, any kind of other data structure would do if it was similarly efficient.

The implementation for unboxed arrays of Bool-s in array is a packed bitarray. You can do mutable updates on such arrays in the ST Monad (this is essentially the same runtime behaviour as in C).

You can use a vector with any data type that's an instance of the Bits typeclass, for example Word64 for 64 bits per element in the vector. Use an unboxed vector if you want the array to be contiguous in memory.

Related

Fixed size arrays in Haskell

I was trying to write a library for linear algebra operations in Haskell. In order to be able to define safe operations for matrices and vectors I wanted to encode their dimensions in their types. After some research I found that using DataKinds one is able to do that, similar to the way it's done here. For example:
data Vector (n :: Nat) a
dot :: Num a => Vector n a -> Vector n a -> a
In the aforementioned article, as well as in some libraries, the size of the vectors is a phantom type and the vector type itself is a wrapper around an Array. In trying to figure out if there is a array type with its size at the type-level in the standard library I started wondering about the underlying representation of arrays. From what I could gather form this commentary on GHC memory layout, arrays need to store their size on the heap so a 3-dimensional vector would need to take up 1 more word than necessary. Of course we could use the following definition:
data Vector3 a = Vector3 a a a
which might be fine if we only care about 3D geometry, but it doesn't allow for vectors of arbitrary size and also it makes indexing awkward.
So, my question is this. Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library? As far, as I understand the only thing that it would need is a different info table, which would store the size, instead of it being stored for at each heap object. Also, the compiler could choose between Array and SmallArray automatically.
Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library?
Sure. I suspect if you wrote up your use case carefully and implemented this, GHC HQ would accept a patch. You might want to do the writeup first and double-check that they're into it to avoid wasting time on a patch they won't accept, though; I certainly don't speak for them.
Also, the compiler could choose between Array and SmallArray automatically.
I'm not an expert here, but I kinda doubt this. Usually supporting polymorphism means you need a uniform representation.

Is it possible to do bitwise operation on multiple successive array elements?

I am working on an old implementation of AES I coded a few years ago, and I wanted to modify my ShiftRows function which is very inefficient.
For moment my ShiftRows basically just swaps value of successive array element (represented by one byte) n times to effectuate a cyclic permutation.
I wondered if it was possible to take my array of element and cast it as a single variable to do the permuatation using the bit shift operator?
The rows are 4 unsigned char, so 4 bytes each.
In the following code only the first byte (corresponding to 'a') seems to be affected by the bitshift.
char array[4][4] = {"abcd", "efgh", "ijkl", "mnop"};
int32_t somevar;
somevar = (int32_t)*array[0] >> 16;
It's been a long time since I didn't practice C so I am probably doing some stupid errors.
First, if your primary goal is a fast AES implementation, rather than either practicing C or a fast-but-portable AES implementation (that is, portability is primary and efficiency is secondary), then you would need to write in assembly language, not C, or at least use compiler features for specific targets that let you write near-assembly code. For example, Intel processors have AES-assist instructions, and GCC has built-in functions for them.
Second, if you are going to do this in C, your primary job, ideally, is to express the desired operations clearly to the compiler. By this, I mean you want the operations to be transparent to the compiler so that its optimizer can work. Using various techniques to reinterpret data (from char to int, for example) can block the compiler’s ability to optimize. (Or they might not, depending on compiler quality and the specific code you write.)
If you are aiming for portable code, it is likely better to simply write the character motions you want (just write simple assignment statements that move array elements). Good compilers can translate these efficiently, even combining multiple byte-move operations into single word-move operations if the hardware supports it.
When you are writing “fancy” code to try to optimize, it is important to be aware of rules of standard C, properties of the compiler(s) you are working with, and the hardware you are targeting.
For example, you have char array[4][4]. This declares an array with no particular alignment. The compiler might put this array anywhere, with any alignment—it is not necessarily aligned to a multiple of four bytes, for example. If you then take a pointer to the first row of this array and convert it to a pointer to an int, then an instruction to load an int may fail on some processors because they require int objects to be aligned to multiples of four bytes. On other processors, the load may work but be slower than an aligned load.
One solution for this is not to declare a bare array and not to convert pointers. Instead, you would declare a union, one member of which might be an array of four uint32_t and the other of which might be an array of four arrays of four uint8_t. The presence of the uint32_t array in the union would compel the compiler to align it suitably for the hardware. Additionally, reinterpreting data through unions is allowed in C, whereas reinterpreting data through converted pointers is not proper C code. (Even if the alignment requirements are satisfied, reinterpreting data through pointers generally violates aliasing rules.)
On another note, it is generally preferable to use unsigned types when working with bits as is done in cryptographic code. Instead of char and int32_t, you may be better off with uint8_t and uint32_t.
Regarding your specific code:
somevar = (int32_t)*array[0] >> 16;
array[0] is the first row of array. By the rules of C, it is automatically converted to a pointer to its first element, so it becomes &array[0][0]. Then *array[0] is *&array[0][0], which is array[0][0], which is the first char in the first row of the array. So the expression so far is just the value of the first char. Then the cast (int32_t) converts the type of the expression to int32_t. This does not change the value, so the result is simply the value of the first char in the first row.
What you were likely thinking of was either * (uint32_t *) &array[0] or * (uint32_t) array[0]. These take either the address of the first row (the former expression) or the address of the first element of the first row (the latter expression) (these denote the same location but are different types) and convert it to a pointer to a uint32_t. Then the * is intended to fetch the uint32_t at that address. That violates C rules and should be avoided.
Instead, you can use:
union
{
uint32_t words[4];
uint8_t bytes[4][4];
} block;
Then you can access individual bytes with block.bytes[i][j] or words of four bytes with block.words[i]. Whether this is a good idea or not depends on context and goals.

Are there real vectors (one-dimensional arrays) in Perl?

I know that traditional "lists" in Perl implemented internally exactly as a real double-linked lists. So indexed access to the list elements is slow. This is a cost of dynamic nature of lists, which can be sliced, expanded, shrinked.
But for performance reasons it will be very good to have possibility to malloc() some memory chunk and create vector of static size and predefined size of its elements: for example, fixed-size double-linked list may be represented as a sequence of elements which size will be 4(prev_v_index) + 4(next_v_index) + 8(data_ptr aka REF) = 16 bytes. So we can access every element of this vector as we usually do it in compiled languages like C: elem_ptr=vector_ptr+(index*elem_size) - access to elements will be very fast with some architecure-specific alignment (8 bytes for x86_64).
Maybe there is already some XS module for manipulating with the fixed-vectors in Perl5?
Perl's arrays (#array variables or [...] references) do use a contiguous memory region. They are not linked lists. However, these arrays only hold pointers to the scalar values, not the values themselves. This is a necessary restriction of the Perl data model.
If you know C++, a Perl array can be thought of as similar to a std::vector<Scalar*>, except that Perl's arrays can push and pop at the front and the back.
To resize a Perl array, you can assign the last index. E.g. to pre-allocate 50 elements:
my #array;
$#array = 50 - 1;
If you need compact data storage within Perl, then you will have to use strings. Given a fixed-size record, you can get and set one record with substr, and pack/unpack the data from and to Perl data structures.
You can use the vec function to use a string as a vector. For example, you could pack Boolean values into individual bits.
vec EXPR,OFFSET,BITS
Treats the string in EXPR as a bit vector made up of elements of
width BITS and returns the value of the element specified by
OFFSET as an unsigned integer. BITS therefore specifies the
number of bits that are reserved for each element in the bit
vector. This must be a power of two from 1 to 32 (or 64, if your
platform supports that).
That said, your concern about array access being "slow" is unwarranted and your beliefs about perl's internals is incorrect. Array performance is likely to be fast enough. Don't try to "optimize" around it until you've profiled your code and proven that its a bottleneck.

Why array of complex numbers is declared row-wise in fftw?

The fftw manual states that (Page 4)
The data is an array of type fftw_complex, which is by default a double[2] composed of the real (in[i][0]) and imaginary (in[i][1]) parts of a complex number.
For doing FFT of a time series of n samples, the size of the matrix becomes n rows and 2 columns. If we wish to do element-by-element manipulation, isn't accessing in[i][0] for different values of i slow for large values of n since C stores 2D arrays in row-major format?
The real and imaginary part are stored consecutively in memory (assuming little endian lay out where byte 0 of R0 is at the smallest address) :
In-1,Rn-1..|I1,R1|I0,R0
That means that it's possible to copy an element i into place accessing the same cacheline (usually 64byte today), as both real & imaginary are adjacent. If you stored the 2D array in the Fortan order and wanted to assign to one element, then you immediately access memory on 2 different cachelines, as they are stored N*sizeof double locations apart in memory - Row & COlumn Major order
Now if your processing was just operating on the real parts in one thread and the imaginary seperately in another, for some reason, then yes it would be more efficient to store them in Column major order, or even as seperate parallel arrays. In general though, data is stored close together because it is used together.
All arrays in C are really single dimensional byte arrays, unless you store an array of pointers to arrays, usually done with things like strings with varying lengths.
Sometimes in matrix calculations, it's actually faster to first transpose one array, because of the rules of matrix multiplication, it's complex but if you want the real nitty gritty details search for Ulrich Dreppers article at LWN.net about memory which shows an example that benefits from this technique (section 5 IIRC).
Very often Scientific numberic libraries have worked in Column major order, because Fortran compatability was more important than using the array in a natural way. Most languages prefer Row major, as it's generally more desirable, when you store fixed length strings in a table for instance.

The fastest way to copy a 32bits array into 16bits arrays?

What is the best way to copy a 32bits array into 16bits arrays?
I know that "memcpy" uses hardware instruction.But is there a standard function to copy arrays with "changing size" in each element?
I use gcc for armv7 (cortex A8).
uint32_t tab32[500];
uint16_t tab16[500];
for(int i=0;i<500;i++)
tab16[i]=tab32[i];
On ARM cortex A8 with Neon instruction set, the fastest methods use interleaved read/write instructions:
vld2.16 {d0,d1}, [r0]!
vst1.16 {d0}, [r1]!
or saturating instructions to convert a vector of 32-bit integers to a vector of 16-bit integers.
Both of these methods are available in c using gcc intrinsic. It's also possible that gcc can autovectorize a carefully written c-code to use nothing but these particular instructions. This would basically require that there's a one to one correspondence with all the side effects of these instructions and the c code.
There is no standard function that does this, mostly because it would be very specific to your application.
If you know that the integers in tab32 will be small enough to fit in a uint16_t, the code in your question is probably the best you can get (the compiler will do the rest if it can optimize something).
Well if you don't need to modify the data you can use a pointer to uint16_t on the 32 bits array. It assume that the bare memory make sense as an array of 16 bits unsigned int.
edit: on hold, something is not clear in the question
Using memcpy will be the fastest way in my opinion. memcpy's are optimized separately for each architecture so you should be good.
On the other hand, since registers are 32bit in ARM, and 16bit values are zero/sign extended to 32bit in the back end anyway. So, I think, it would be more efficient to leave them as 32bit arrays and not copy the data into 16bit arrays (You should actually measure to make the correct decision).
There is one method which can save you size and improve performance (hopefully) If you store the incoming value in an int-array but each int would have two of 16bit values.
For example: int[4] would look like this:
----------------------------------------------------------------
| 32bit || 32bit || 32bit || 32bit |
----------------------------------------------------------------
| 16bit | 16bit|| 16bit | 16bit|| 16bit | 16bit|| 16bit | 16bit|
----------------------------------------------------------------
There will be a little preprocessing required(like reading the values as char(bytes) and then, (char*) typecasting on the int-array to store two values in one slot.
The last approach is not guaranteed to give you better performance unless all your algorithm (which you are going to apply on the array) work seamlessly with this layout of elements. Maybe you have to modify the algorithms a little bit to work with this data-structure. For e.g. some bit manipulation algorithms (and, or etc) can be applied to this data-structure without so much work.

Resources