Converting from Source-based Indices to Destination-based Indices - c

I'm using AVX2 instructions in some C code.
The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst, by permuting a based on idx. This seems equivalent to dst[i] = a[idx[i]] for i in 0..7. I'm calling this source based, because the move is indexed based on the source.
However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7.
How can I convert from source-based form to destination-based form? An example test case is:
{2 1 0 5 3 4 6 7} source-based form.
{2 1 0 4 5 3 6 7} destination-based equivalent
For this conversion, I'm staying in ymm registers, so that means that destination-based solutions don't work. Even if I were to insert each separately, since it only operates on constant indexes, you can't just set them.

I guess you're implicitly saying that you can't modify your code to calculate source-based indices in the first place? I can't think of anything you can do with x86 SIMD, other than AVX512 scatter instructions that take dst-based indices. (But those are not very fast on current CPUs, even compared to gather loads. https://uops.info/)
Storing to memory, inverting, and reloading a vector might actually be best. (Or transferring to integer registers directly, not through memory, maybe after a vextracti128 / packusdw so you only need two 64-bit transfers from vector to integer regs: movq and pextrq).
But anyway, then use them as indices to store a counter into an array in memory, and reload that as a vector. This is still slow and ugly, and includes a store-forwarding failure delay. So it's probably worth your while to change your index-generating code to generate source-based shuffle vectors if at all possible.

To benchmark the solution, I modified the code from my other answer to compare the performance of the scatter instruction (USE_SCATTER defined) with a store and sequential permute (USE_SCATTER undefined). I had to copy the result back to the permutation pattern perm in order to prevent the compiler from optimizing the loop body away:
#ifdef TEST_SCATTER
#define REPEATS 1000000001
#define USE_SCATTER
__m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
__m512i perm = _mm512_set_epi32(7,9,3,0,5,8,13,11,4,2,15,1,12,6,10,14);
uint32_t outA[16] __attribute__ ((aligned(64)));
uint32_t id[16], in[16];
_mm512_storeu_si512(id, ident);
for (int i = 0; i < 16; i++) printf("%2d ", id[i]); puts("");
_mm512_storeu_si512(in, perm);
for (int i = 0; i < 16; i++) printf("%2d ", in[i]); puts("");
#ifdef USE_SCATTER
puts("scatter");
for (long t = 0; t < REPEATS; t++) {
_mm512_i32scatter_epi32(outA, perm, ident, 4);
perm = _mm512_load_si512(outA);
}
#else
puts("store & permute");
uint32_t permA[16] __attribute__ ((aligned(64)));
for (long t = 0; t < REPEATS; t++) {
_mm512_store_si512(permA, perm);
for (int i = 0; i < 16; i++) outA[permA[i]] = i;
perm = _mm512_load_si512(outA);
}
#endif
for (int i = 0; i < 16; i++) printf("%2d ", outA[i]); puts("");
#endif
Here's the output for the two cases (using the builtin time command of tcsh, the u output is user-space time in seconds):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14 10 6 12 1 15 2 4 11 13 8 5 0 3 9 7
store & permute
12 4 6 13 7 11 2 15 10 14 1 8 3 9 0 5
10.765u 0.001s 0:11.22 95.9% 0+0k 0+0io 0pf+0w
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14 10 6 12 1 15 2 4 11 13 8 5 0 3 9 7
scatter
12 4 6 13 7 11 2 15 10 14 1 8 3 9 0 5
10.740u 0.000s 0:11.19 95.9% 0+0k 40+0io 0pf+0w
The runtime is about the same (Intel(R) Xeon(R) W-2125 CPU # 4.00GHz, clang++-6.0, -O3 -funroll-loops -march=native). I checked the assembly code generated. With USE_SCATTER defined, the compiler generates vpscatterdd instructions, without it generates complex code using vpextrd, vpextrq, and vpextracti32x4.
Edit: I was worried that the compiler may have found a specific solution for the fixed permutation pattern I used. So I replaced it with a randomly generated pattern from std::random_shuffe(), but the time measurements are about the same.
Edit: Following the comment by Peter Cordes, I wrote a modified benchmark that hopefully measures something like throughput:
#define REPEATS 1000000
#define ARRAYSIZE 1000
#define USE_SCATTER
std::srand(unsigned(std::time(0)));
// build array with random permutations
uint32_t permA[ARRAYSIZE][16] __attribute__ ((aligned(64)));
for (int i = 0; i < ARRAYSIZE; i++)
_mm512_store_si512(permA[i], randPermZMM());
// vector register
__m512i perm;
#ifdef USE_SCATTER
puts("scatter");
__m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
for (long t = 0; t < REPEATS; t++)
for (long i = 0; i < ARRAYSIZE; i++) {
perm = _mm512_load_si512(permA[i]);
_mm512_i32scatter_epi32(permA[i], perm, ident, 4);
}
#else
uint32_t permAsingle[16] __attribute__ ((aligned(64)));
puts("store & permute");
for (long t = 0; t < REPEATS; t++)
for (long i = 0; i < ARRAYSIZE; i++) {
perm = _mm512_load_si512(permA[i]);
_mm512_store_si512(permAsingle, perm);
uint32_t *permAVec = permA[i];
for (int e = 0; e < 16; e++)
permAVec[permAsingle[e]] = e;
}
#endif
FILE *f = fopen("testperm.dat", "w");
fwrite(permA, ARRAYSIZE, 64, f);
fclose(f);
I use an array of permutation patterns which are modified sequentially without dependencies.
These are the results:
scatter
4.241u 0.002s 0:04.26 99.5% 0+0k 80+128io 0pf+0w
store & permute
5.956u 0.002s 0:05.97 99.6% 0+0k 80+128io 0pf+0w
So throughput is better when using the scatter command.

I had the same problem, but in the opposite direction: destination indices were easy to compute, but source indices were required for the application of SIMD permute instructions. Here's a solution for AVX-512 using a scatter instruction as suggested by Peter Cordes; it should also apply to the opposite direction:
__m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
__m512i perm = _mm512_set_epi32(7,9,3,0,5,8,13,11,4,2,15,1,12,6,10,14);
uint32_t id[16], in[16], out[16];
_mm512_storeu_si512(id, ident);
for (int i = 0; i < 16; i++) printf("%2d ", id[i]); puts("");
_mm512_storeu_si512(in, perm);
for (int i = 0; i < 16; i++) printf("%2d ", in[i]); puts("");
_mm512_i32scatter_epi32(out, perm, ident, 4);
for (int i = 0; i < 16; i++) printf("%2d ", out[i]); puts("");
An identity mapping ident is distributed to the out array according to the index pattern perm. The idea is basically the same as the one described for inverting a permutation. Here's the output:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14 10 6 12 1 15 2 4 11 13 8 5 0 3 9 7
12 4 6 13 7 11 2 15 10 14 1 8 3 9 0 5
Note that I have permutations in the mathematical sense (no duplicates). With duplicates, the out store needs to be initialized since some elements could remain unwritten.
I also see no easy way to accomplish this within registers. I thought about cycling through the given permutation by repeatedly applying a permute instruction. As soon as the identity pattern is reached, the one before is the inverse permutation (this goes back to the idea by EOF on unzip operations). However, the cycles can be long. The maximum number of cycles that may be required is given by Landau's function which for 16 elements is 140, see this table. I could show that it possible to shorten this to a maximum of 16 if the individual permutation subcycles are frozen as soon as they coincide with the identity elements. The shortens the average from 28 to 9 permute instructions for a test on random permutation patterns. However, it is still not an efficient solution (much slower than the scatter instruction in the throughput benchmark described in my other answer).

Related

Number of possible sequences of 16 symbols are there given some restrictions

How many possible sequences can be formed that obey the following rules:
Each sequence is formed from the symbols 0-9a-f.
Each sequence is exactly 16 symbols long.
0123456789abcdef ok
0123456789abcde XXX
0123456789abcdeff XXX
Symbols may be repeated, but no more than 4 times.
00abcdefabcdef00 ok
00abcde0abcdef00 XXX
A symbol may not appear three times in a row.
00abcdefabcdef12 ok
000bcdefabcdef12 XXX
There can be at most two pairs.
00abcdefabcdef11 ok
00abcde88edcba11 XXX
Also, how long would it take to generate all of them?
In combinatorics, counting is usually pretty straight-forward, and can be accomplished much more rapidly than exhaustive generation of each alternative (or worse, exhaustive generative of a superset of possibilities, in order to filter them). One common technique is to reduce a given problem to combinatons of a small(ish) number of disjoint sub-problems, where it is possible to see how many times each subproblem contributes to the total. This sort of analysis can often result in dynamic programming solutions, or, as below, in a memoised recursive solution.
Because combinatoric results are usually enormous numbers, brute-force generation of every possibility, even if it can be done extremely rapidly for each sequence, is impractical in all but the most trivial of cases. For this particular question, for example, I made a rough back-of-the-envelope estimate in a comment (since deleted):
There are 18446744073709551616 possible 64-bit (16 hex-digit) numbers, which is a very large number, about 18 billion billion. So if I could generate and test one of them per second, it would take me 18 billion seconds, or about 571 years. So with access to a cluster of 1000 96-core servers, I could do it all in about 54 hours, just a bit over two days. Amazon will sell me one 96-core server for just under a dollar an hour (spot prices), so a thousand of them for 54 hours would cost a little under 50,000 US dollars. Perhaps that's within reason. (But that's just to generate.)
Undoubtedly, the original question has is part of an exploration of the possibility of trying every possible sequence by way of cracking a password, and it's not really necessary to produce a precise count of the number of possible passwords to demonstrate the impracticality of that approach (or its practicality for organizations which have a budget sufficient to pay for the necessary computing resources). As the above estimate shows, a password with 64 bits of entropy is not really that secure if what it is protecting is sufficiently valuable. Take that into account when generating a password for things you treasure.
Still, it can be interesting to compute precise combinatoric counts, if for no reason other than the intellectual challenge.
The following is mostly a proof-of-concept; I wrote it in Python because Python offers some facilities which would have been time-consuming to reproduce and debug in C: hash tables with tuple keys and arbitrary precision integer arithmetic. It could be rewritten in C (or, more easily, in C++), and the Python code could most certainly be improved, but given that it only takes 70 milliseconds to compute the count request in the original question, the effort seems unnecessary.
This program carefully groups the possible sequences into different partitions and caches the results in a memoisation table. For the case of sequences of length 16, as in the OP, the cache ends up with 2540 entries, which means that the core computation is only done 2540 times:
# The basis of the categorization are symbol usage vectors, which count the
# number of symbols used (that is, present in a prefix of the sequence)
# `i` times, for `i` ranging from 1 to the maximum number of symbol uses
# (4 in the case of this question). I tried to generalise the code for different
# parameters (length of the sequence, number of distinct symbols, maximum
# use count, maximum number of pairs). Increasing any of these parameters will,
# of course, increase the number of cases that need to be checked and thus slow
# the program down, but it seems to work for some quite large values.
# Because constantly adjusting the index was driving me crazy, I ended up
# using 1-based indexing for the usage vectors; the element with index 0 always
# has the value 0. This creates several inefficiencies but the practical
# consequences are insignificant.
### Functions to manipulate usage vectors
def noprev(used, prevcnt):
"""Decrements the use count of the previous symbol"""
return used[:prevcnt] + (used[prevcnt] - 1,) + used[prevcnt + 1:]
def bump1(used, count):
"""Registers that one symbol (with supplied count) is used once more."""
return ( used[:count]
+ (used[count] - 1, used[count + 1] + 1)
+ used[count + 2:]
)
def bump2(used, count):
"""Registers that one symbol (with supplied count) is used twice more."""
return ( used[:count]
+ (used[count] - 1, used[count + 1], used[count + 2] + 1)
+ used[count + 3:]
)
def add1(used):
"""Registers a new symbol used once."""
return (0, used[1] + 1) + used[2:]
def add2(used):
"""Registers a new symbol used twice."""
return (0, used[1], used[2] + 1) + used[3:]
def count(NSlots, NSyms, MaxUses, MaxPairs):
"""Counts the number of sequences of length NSlots over an alphabet
of NSyms symbols where no symbol is used more than MaxUses times,
no symbol appears three times in a row, and there are no more than
MaxPairs pairs of symbols.
"""
cache = {}
# Canonical description of the problem, used as a cache key
# pairs: the number of pairs in the prefix
# prevcnt: the use count of the last symbol in the prefix
# used: for i in [1, NSyms], the number of symbols used i times
# Note: used[0] is always 0. This problem is naturally 1-based
def helper(pairs, prevcnt, used):
key = (pairs, prevcnt, used)
if key not in cache:
# avail_slots: Number of remaining slots.
avail_slots = NSlots - sum(i * count for i, count in enumerate(used))
if avail_slots == 0:
total = 1
else:
# avail_syms: Number of unused symbols.
avail_syms = NSyms - sum(used)
# We can't use the previous symbol (which means we need
# to decrease the number of symbols with prevcnt uses).
adjusted = noprev(used, prevcnt)[:-1]
# First, add single repeats of already used symbols
total = sum(count * helper(pairs, i + 1, bump1(used, i))
for i, count in enumerate(adjusted)
if count)
# Then, a single instance of a new symbol
if avail_syms:
total += avail_syms * helper(pairs, 1, add1(used))
# If we can add pairs, add the already not-too-used symbols
if pairs and avail_slots > 1:
total += sum(count * helper(pairs - 1, i + 2, bump2(used, i))
for i, count in enumerate(adjusted[:-1])
if count)
# And a pair of a new symbol
if avail_syms:
total += avail_syms * helper(pairs - 1, 2, add2(used))
cache[key] = total
return cache[key]
rv = helper(MaxPairs, MaxUses, (0,)*(MaxUses + 1))
# print("Cache size: ", len(cache))
return rv
# From the command line, run this with the command:
# python3 SLOTS SYMBOLS USE_MAX PAIR_MAX
# There are defaults for all four argument.
if __name__ == "__main__":
from sys import argv
NSlots, NSyms, MaxUses, MaxPairs = 16, 16, 4, 2
if len(argv) > 1: NSlots = int(argv[1])
if len(argv) > 2: NSyms = int(argv[2])
if len(argv) > 3: MaxUses = int(argv[3])
if len(argv) > 4: MaxPairs = int(argv[4])
print (NSlots, NSyms, MaxUses, MaxPairs,
count(NSlots, NSyms, MaxUses, MaxPairs))
Here's the result of using this program to compute the count of all valid sequences (since a sequence longer than 64 is impossible given the constraints), taking less than 11 seconds in total:
$ time for i in $(seq 1 65); do python3 -m count $i 16 4; done
1 16 4 2 16
2 16 4 2 256
3 16 4 2 4080
4 16 4 2 65040
5 16 4 2 1036800
6 16 4 2 16524000
7 16 4 2 263239200
8 16 4 2 4190907600
9 16 4 2 66663777600
10 16 4 2 1059231378240
11 16 4 2 16807277588640
12 16 4 2 266248909553760
13 16 4 2 4209520662285120
14 16 4 2 66404063202640800
15 16 4 2 1044790948722393600
16 16 4 2 16390235567479693920
17 16 4 2 256273126082439298560
18 16 4 2 3992239682632407024000
19 16 4 2 61937222586063601795200
20 16 4 2 956591119531904748877440
21 16 4 2 14701107045788393912922240
22 16 4 2 224710650516510785696509440
23 16 4 2 3414592455661342007436384000
24 16 4 2 51555824538229409502827923200
25 16 4 2 773058043102197617863741843200
26 16 4 2 11505435580713064249590793862400
27 16 4 2 169863574496121086821681298457600
28 16 4 2 2486228772352331019060452730124800
29 16 4 2 36053699633157440642183732148192000
30 16 4 2 517650511567565591598163978874476800
31 16 4 2 7353538304042081751756339918288153600
32 16 4 2 103277843408210067510518893242552998400
33 16 4 2 1432943471827935940003777587852746035200
34 16 4 2 19624658467616639408457675812975159808000
35 16 4 2 265060115658802288611235565334010714521600
36 16 4 2 3527358829586230228770473319879741669580800
37 16 4 2 46204536626522631728453996238126656113459200
38 16 4 2 595094456544732751483475986076977832633088000
39 16 4 2 7527596027223722410480884495557694054538752000
40 16 4 2 93402951052248340658328049006200193398898022400
41 16 4 2 1135325942092947647158944525526875233118233702400
42 16 4 2 13499233156243746249781875272736634831519281254400
43 16 4 2 156762894800798673690487714464110515978059412992000
44 16 4 2 1774908625866508837753023260462716016827409668608000
45 16 4 2 19556269668280714729769444926596793510048970792448000
46 16 4 2 209250137714454234944952304185555699000268936613376000
47 16 4 2 2169234173368534856955926000562793170629056490849280000
48 16 4 2 21730999613085754709596718971411286413365188258316288000
49 16 4 2 209756078324313353775088590268126891517374425535395840000
50 16 4 2 1944321975918071063760157244341119456021429461885104128000
51 16 4 2 17242033559634684233385212588199122289377881249323872256000
52 16 4 2 145634772367323301463634877324516598329621152347129008128000
53 16 4 2 1165639372591494145461717861856832014651221024450263064576000
54 16 4 2 8786993110693628054377356115257445564685015517718871715840000
55 16 4 2 61931677369820445021334706794916410630936084274106426433536000
56 16 4 2 404473662028342481432803610109490421866960104314699801413632000
57 16 4 2 2420518371006088374060249179329765722052271121139667645435904000
58 16 4 2 13083579933158945327317577444119759305888865127012932088217600000
59 16 4 2 62671365871027968962625027691561817997506140958876900738150400000
60 16 4 2 259105543035583039429766038662433668998456660566416258886520832000
61 16 4 2 889428267668414961089138119575550372014240808053275769482575872000
62 16 4 2 2382172342138755521077314116848435721862984634708789861244239872000
63 16 4 2 4437213293644311557816587990199342976125765663655136187709235200000
64 16 4 2 4325017367677880742663367673632369189388101830634256108595793920000
65 16 4 2 0
real 0m10.924s
user 0m10.538s
sys 0m0.388s
This program counts 16,390,235,567,479,693,920 passwords.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
enum { RLength = 16 }; // Required length of password.
enum { NChars = 16 }; // Number of characters in alphabet.
typedef struct
{
/* N[i] counts how many instances of i are left to use, as constrained
by rule 3.
*/
unsigned N[NChars];
/* NPairs counts how many more pairs are allowed, as constrained by
rule 5.
*/
unsigned NPairs;
/* Used counts how many characters have been distinguished by choosing
them as a represenative. Symmetry remains unbroken for NChars - Used
characters.
*/
unsigned Used;
} Supply;
/* Count the number of passwords that can be formed starting with a string
(in String) of length Length, with state S.
*/
static uint64_t Count(int Length, Supply *S, char *String)
{
/* If we filled the string, we have one password that obeys the rules.
Return that. Otherwise, consider suffixing more characters.
*/
if (Length == RLength)
return 1;
// Initialize a count of the number of passwords beginning with String.
uint64_t C = 0;
// Consider suffixing each character distinguished so far.
for (unsigned Char = 0; Char < S->Used; ++Char)
{
/* If it would violate rule 3, limiting how many times the character
is used, do not suffix this character.
*/
if (S->N[Char] == 0) continue;
// Does the new character form a pair with the previous character?
unsigned IsPair = String[Length-1] == Char;
if (IsPair)
{
/* If it would violate rule 4, a character may not appear three
times in a row, do not suffix this character.
*/
if (String[Length-2] == Char) continue;
/* If it would violate rule 5, limiting how many times pairs may
appear, do not suffix this character.
*/
if (S->NPairs == 0) continue;
/* If it forms a pair, and our limit is not reached, count the
pair.
*/
--S->NPairs;
}
// Count the character.
--S->N[Char];
// Suffix the character.
String[Length] = Char;
// Add as many passwords as we can form by suffixing more characters.
C += Count(Length+1, S, String);
// Undo our changes to S.
++S->N[Char];
S->NPairs += IsPair;
}
/* Besides all the distinguished characters, select a representative from
the pool (we use the next unused character in numerical order), count
the passwords we can form from it, and multiply by the number of
characters that were in the pool.
*/
if (S->Used < NChars)
{
/* A new character cannot violate rule 3 (has not been used 4 times
yet, rule 4 (has not appeared three times in a row), or rule 5
(does not form a pair that could pass the pair limit). So we know,
without any tests, that we can suffix it.
*/
// Use the next unused character as a representative.
unsigned Char = S->Used;
/* By symmetry, we could use any of the remaining NChars - S->Used
characters here, so the total number of passwords that can be
formed from the current state is that number times the number that
can be formed by suffixing this particular representative.
*/
unsigned Multiplier = NChars - S->Used;
// Record another character is being distinguished.
++S->Used;
// Decrement the count for this character and suffix it to the string.
--S->N[Char];
String[Length] = Char;
// Add as many passwords as can be formed by suffixing a new character.
C += Multiplier * Count(Length+1, S, String);
// Undo our changes to S.
++S->N[Char];
--S->Used;
}
// Return the computed count.
return C;
}
int main(void)
{
/* Initialize our "supply" of characters. There are no distinguished
characters, two pairs may be used, and each character may be used at
most 4 times.
*/
Supply S = { .Used = 0, .NPairs = 2 };
for (unsigned Char = 0; Char < NChars; ++Char)
S.N[Char] = 4;
/* Prepare space for string of RLength characters preceded by a sentinel
(-1). The sentinel permits us to test for a repeated character without
worrying about whether the indexing goes outside array bounds.
*/
char String[RLength+1] = { -1 };
printf("There are %" PRIu64 " possible passwords.\n",
Count(0, &S, String+1));
}
The number of possibilities there are, is fixed. You could either come up with an algorithm to generate valid combinations, or you could just iterate over the entire problem space and check each combination using a simple function that checks for the validity of the combination.
How long it takes, depends on the computer and the efficiency. You could easily make it a multithreaded application.

MPI IO - MPI_FIle_seek - How to calculate correct offset size for shared file

Update. Now the junk is removed from the end of the shared file, but ther is still som "junk" in the middle of the file where process 0 ends writing and process 1 starts writing:
10 4 16 16 0 2 2 3 1 3 4 2 4 5 1 0 4 6 2 8 5 3 8 10 4 9 5 4 ^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#10 4 16 16 1 2 6 0 3 5 2 2 2 8 1 5 6 5 6 6 4 8 9 7 6 2 1 3 6 4 10 2 5 7 7 6 10 6 5 9 9 10 6 7 5 8
However if i count the the jiberish, i get to 40. When i try to do;
offset = (length-40)*my_rank;
It works, but it is not a very scalable and robust solution. Therfor i need to compute this number for a more generell solution. Does anybody see how can be done, here is my current function:
#define MAX_BUFF 50
int write_parallel(Context *context, int num_procs, int my_rank, MPI_Status status){
int written_chars = 0;
int written_chars_accumulator = 0;
int n = context->n;
void * charbuffer = malloc(n*MAX_BUFF);
if (charbuffer == NULL) {
exit(1);
}
MPI_File file;
MPI_Offset offset;
MPI_File_open(MPI_COMM_WORLD,"test_write.txt",
MPI_MODE_CREATE|MPI_MODE_WRONLY,
MPI_INFO_NULL, &file);
written_chars = snprintf((char *)charbuffer, n*MAX_BUFF, "%d %d %d %d\n", n, context->BOX_SIDE, context->MAX_X, context->MAX_Y);
if (written_chars < 0){ exit(1); }
written_chars_accumulator += written_chars;
int i,j;
for(i=0;i<n;i++){
if(context->allNBfrom[i]>0){
written_chars = snprintf((char *)charbuffer+written_chars_accumulator, (n*MAX_BUFF - written_chars_accumulator), "%d %d %d ", i, context->x[i], context->y[i]);
if (written_chars < 0){ exit(1); }
written_chars_accumulator += written_chars;
for(j=0;j<context->allNBfrom[i];j++){
written_chars = snprintf((char *)charbuffer+written_chars_accumulator, (n*MAX_BUFF - written_chars_accumulator), "%d ", context->delaunayEdges[i][j]);
if (written_chars < 0){ exit(1); }
written_chars_accumulator += written_chars;
}
written_chars = snprintf((char *)charbuffer+written_chars_accumulator, (n*MAX_BUFF - written_chars_accumulator), "\n");
if (written_chars < 0){ exit(1); }
written_chars_accumulator += written_chars;
}
}
int length = strlen((char*)charbuffer);
offset = (length-40)*my_rank; //Why is this correct? the constant = 40 needs to be computet in some way...
//printf("proc=%d:\n%s",my_rank,charbuffer);
MPI_File_seek(file,offset,MPI_SEEK_SET);
MPI_File_write(file,charbuffer,length,MPI_CHAR,&status);
MPI_File_close(&file);
return 0;
}
Her is my current result, with this solution which is also correct: 10 4 16 16 0 2 2 3 1 3 4 2 4 5 1 0 4 6 2 8 5 3 8 10 4 9 5 4 10 4 16 16 1 2 6 0 3 5 2 2 2 8 1 5 6 5 6 6 4 8 9 7 6 2 1 3 6 4 10 2 5 7 7 6 10 6 5 9 9 10 6 7 5 8
But it will not scale because, I dont know how to compute number of jiberish elemtens. Does anybody have a clue ?
If I understand your code your goal is to remove the NULL-chars in between your text blocks. In this parallel writing approach there is no way to solve this without violating the safe boundaries of your buffers. None of the threads now how long the output of the other threads is going to be in advance. This makes it hard (/impossible) to have dynamic ranges for the write offset.
If you shift your offset then you will be writing in a aria not reserved for that thread and the program could overwrite data.
In my opinion there are two solutions to your problem of removing the nulls from the file:
Write separate files and concatenate them after the program is finished.
Post-process your output-file with a program that reads/copy chars form your output-file and skips NULL-bytes (and saves the result as a new file).
offset = (length-40)*my_rank; //Why is this correct? the constant = 40 needs to be computet in some way...
The way you compute this is with MPI_Scan. As others have pondered, you need to know how much data each process will contribute.
I'm pretty sure I've answered this before. Adapted to your code, where each process has computed a 'length' of some string:
length = strlen(charbuffer);
MPI_Scan(&length, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -=length; /* MPI_Scan is inclusive, but that
also means it has a defined value on rank 0 */
MPI_File_write_at_all(fh, new_offset, charbuffer, length, MPI_CHAR, &status);
The important feature of MPI_Scan is that it runs through your ranks and applies an operation (in this case, SUM) over all the preceding ranks. Afte r the call, Rank 0 is itself. Rank 1 holds the sum of itself and rank 0; rank 2 holds the sum of itself, rank 1 and rank 0... and so on.

Incorrect output with counting and tallying bits in C

This is only my 2nd programming class. There are 30 rooms. We have to see what is in each room and tally it. I already used the for loop to go through the 30 rooms and I know I have tried to use a bit counter to see what is in each room. I am not getting the correct sample output after I redirect the sample output. When I printf("%d", itemCnt[loc]);, my output is 774778414trolls
When I printf("%d", itemCnt[0]);, my output is 0trolls. I'm just trying to get one output right so I can figure out how to get the rest of the 8 outputs. From the sample output, the first number is supposed to be 6, followed by 6, 1, 4, 4 ... and so on. Below are sample inputs/outputs and what I have so far in code.
Sample input:
1 20 ##
2 21 #A
3 22 ##
4 23 #1
5 22 ##
6 22 ##
7 22 ##
8 22 ##
9 23 #Z Here be trolls � not!
10 23 #+
12 23 ##
13 24 ##
11 22 ##
14 22 #2
15 21 #1
16 20 ##
17 19 ##
18 20 ##
19 19 ##
20 18 ##
21 17 #*
22 16 #*
23 15 #%
0 14 #7
0 gold_bar
1 silver_bar
2 diamond
3 copper_ring
4 jumpy_troll
5 air
6 angry_troll
7 plutonium_troll
Sample Output:
6 gold_bar
6 silver_bar
1 diamond
4 copper_ring
4 jumpy_troll
8 air
15 angry_troll
0 plutonium_troll
code
int main()
{
// contains x and y coordinate
int first, second;
char third[100];
char fourth[100];
char Map[30][30];
// map initialization
for(int x=0; x<30; x++){
for(int y=0; y<30; y++){
Map[x][y] = '.';
}
}
while(scanf("%d %d %s",&first, &second, third) != -1) {
// Condition 1: a zero coordinate
if (first==0 || second==0) exit(0);
// Condition 2: coordinate out of range
if (first<0 || first>30 || second<0 || second>30){
printf("Error: out of range 0-30!\n");
exit(1);
}
Map[second-1][first-1] = third[1];
fgets(fourth, 100, stdin);
// bit counter
int itemCnt[8] = {0}; // array to hold count of items, index is item type
unsigned char test; // holds contents of room.
int loc;
for(loc = 0; loc < 8; loc++) // loop over every bit and see if it is set
{
unsigned char bitPos = 1 << loc; // generate a bit-mask
if((test & bitPos) == bitPos)
++itemCnt[loc];
}
// print the map
for(int h=0; h<30; h++){
for(int v=0; v<30; v++){
printf("%c", Map[h][v]);
}
printf("\n");
}
// print values
printf("%d", itemCnt[0]);
}
return 0;
}
test is not initialized. It looks like you intended to assign 'third[1]' to test.
Also, 774778414 = 0x2E2E2E2E in hex, and 0x2E is the numeric value of ASCII '.', your initial value for map locations. (Tip: when you see wild decimals like that, try Google. I entered, "774778414 in hex" without the quotes.)
I would also suggest breaking down the code into two functions: the first reads from stdin to populate Map (like you do), and the second reads from stdin to populate 8 C strings to describe your objects. It's important to note, the first loop should not go until end of input, because your posted input continues with descriptions, not strictly 3 fields like the beginning.

C array permutations with macros

Is it possible to generate a specific permutation of an array with a macro in C?
i.e. If I have an array X with elements:
0 1 2 3 4 5
x = ["0","1","1","0","1","0"]
I was thinking there may be some macro foo for something like this:
#define S_2Permute(x) = [x[5], x[3], x[4], x[2], x[1]]
where I redefine the order of the array, so the element in the original position 5 is now in position 0.
Any ideas?
EXAMPLE USE
I am starting to create an implementation of the DES encryption algorithm. DES requires several permutation/expansions where I would have to re-order all of the elements in the array, sometimes shrinking the array and sometimes expanding it. I was hoping to just be able to define a macro to permute the arrays for me.
EDIT2
Well in DES the first step is something called the initial permutation. So initially I have some 64-bit key, which for this example can be 0-15 hex:
0123456789ABCDEF
which expands to:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
The IP (initial permutation) would permute this string so that every element in the array would be in a new position:
IP =
58 50 42 34 26 18 10 2
60 52 44 36 28 20 12 4
62 54 46 38 30 22 14 6
64 56 48 40 32 24 16 8
57 49 41 33 25 17 9 1
59 51 43 35 27 19 11 3
61 53 45 37 29 21 13 5
63 55 47 39 31 23 15 7
So the new 1st element in the bitstring would be the 58th element(bit) from the original bitstring.
So I would have all of these bits stored in an array of characters:
x = [0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,1,1,0,0,
1,1,1,1,0,0,0,1,0,0,1,1,0,1,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,1,0,1,1,1,1]
and then just call
IP_PERMUTE(x);
And macro magic will have moved all of the bits into the new correct positions.
Absolutely - you're almost there with your example. Try this:
#define S_2Permute(x) {x[5], x[3], x[4], x[2], x[1]}
Then later:
int x[] = {1,2,3,4,5,6};
int y[] = S_2Permute(x); // y is now {6,4,5,3,2}
Two things to remember:
1) in C, arrays are numbered from 0, so it's possible you meant:
#define S_2Permute(x) {x[4], x[2], x[3], x[1], x[0]}
2) If you're using gcc, you can compile with -E to see the output from the preprocessor (very good for debugging macro expansions).
However, I don't think I'd actually do it this way - I'd say the code will be easier to read (and potentially less error prone) if you generate the permutations programmatically - and I doubt that it'll be a large performance hit.
Since you say you're having trouble compiling this, here's a test program that works for me in gcc 4.6.1:
#include <stdio.h>
#define S_2Permute(x) {x[5], x[3], x[4], x[2], x[1]}
int main(void) {
int x[] = {1,2,3,4,5,6};
int y[] = S_2Permute(x);
for(int i = 0; i < 5; i++) {
printf("%d,",y[i]);
}
printf("\n");
}
I compiled with gcc test.c -std=c99 -Wall
I'm new so apologies if it's not ok to offer a different means of solution but have you considered using an inline function instead of a macro?
I love single lines of code that do a lot as much as the next guy, but it makes more sense to me to do it this way:
//I would have an array that defined how I wanted to swap the positions, I'll assume 5 elements
short reordering[5] = {4,1,3,2,0};
inline void permuteArray(char array[]) {
char swap = array[reordering[0]];
array[reordering[0]] = array[reordinger[1]];
array[reordering[1]] = array[reordinger[2]];
array[reordering[2]] = array[reordinger[3]];
array[reordering[3]] = array[reordinger[4]];
array[reordering[4]] = swap;
}
This may not be as pretty or efficient as a macro, but it could save you some headaches managing and maintaining your code (and could always be swapped for the macro version Timothy suggest.
I am doing something vary similar. This is my code. The variable that comes in is a ulong, so then i convert it to a bit array and then rearrange all the bits and then turn it back into a ulong.
public override ulong Permutation(ulong input, int[] permuation)
{
byte[] test = BitConverter.GetBytes(input);
BitArray test2 = new BitArray(test);
BitArray final = new BitArray(test);
ulong x = 0;
ulong y = 1;
for (int i = 0; i < permuation.Length; i++)
{
final[i] = test2[(permuation[i]-1)];
}
for (int i = 0; i < final.Length; i++)
{
if (final[i] == true)
{
x += (1 * y);
}
else
{
x += (0 * y);
}
y = y * 2;
}
return x;
}

Fast method to copy memory with translation - ARGB to BGR

Overview
I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and Red.
So the brute force method is:
// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB orig[IMAGESIZE];
BGR dest[IMAGESIZE];
for(x = 0; x < IMAGESIZE; x++)
{
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
}
However, I need more speed than is provided by a loop and three byte copies. I'm hoping there might be a few tricks I can use to reduce the number of memory reads and writes, given that I'm running on a 32 bit machine.
Additional info
Every image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them into 12 RGB bytes per loop. Perhaps this fact can be used to speed things up, especially as it falls nicely into 32 bit boundaries.
I have access to OpenCL - and while that requires moving the entire buffer into the GPU memory, then moving the result back out, the fact that OpenCL can work on many portions of the image simultaneously, and the fact that large memory block moves are actually quite efficient may make this a worthwhile exploration.
While I've given the example of small buffers above, I really am moving HD video (1920x1080) and sometimes larger, mostly smaller, buffers around, so while a 32x32 situation may be trivial, copying 8.3MB of image data byte by byte is really, really bad.
Running on Intel processors (Core 2 and above) and thus there are streaming and data processing commands I'm aware exist, but don't know about - perhaps pointers on where to look for specialized data handling instructions would be good.
This is going into an OS X application, and I'm using XCode 4. If assembly is painless and the obvious way to go, I'm fine traveling down that path, but not having done it on this setup before makes me wary of sinking too much time into it.
Pseudo-code is fine - I'm not looking for a complete solution, just the algorithm and an explanation of any trickery that might not be immediately clear.
I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.
Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).
In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.
Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.
--- #PeterCordes
The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).
void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
unsigned x;
for(x = 0; x < imageSize; x++) {
*((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
// warning: strict-aliasing UB. Use memcpy for unaligned loads/stores
}
}
The second method performs the same operation, but uses an inline assembly loop instead of a C loop.
void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
"0:\n\t" // do {
"movl (%1),%%eax\n\t"
"bswapl %%eax\n\t"
"movl %%eax,(%0)\n\t" // copy a dword byte-reversed
"add $4,%1\n\t" // orig += 4 bytes
"add $3,%0\n\t" // dest += 3 bytes
"dec %2\n\t"
"jnz 0b" // }while(--imageSize)
: "+r" (dest), "+r" (orig), "+r" (imageSize)
: // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
: "flags", "eax", "memory"
);
}
The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)
typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
__builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
}
}
Finally, the fourth version is the inline assembly equivalent of the third.
void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
asm volatile (
"lddqu %3,%%xmm1\n\t"
"0:\n\t"
"lddqu (%1),%%xmm0\n\t"
"pshufb %%xmm1,%%xmm0\n\t"
"movdqu %%xmm0,(%0)\n\t"
"add $16,%1\n\t"
"add $12,%0\n\t"
"sub $4,%2\n\t"
"jnz 0b"
: "+r" (dest), "+r" (orig), "+r" (imagesize)
: "m" (mask) // whole array as a memory operand. "x" would get the compiler to load it
: "flags", "xmm0", "xmm1", "memory"
);
}
(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)
On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:
Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3: 9.3163 milliseconds
Version 4: 9.3584 milliseconds
As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.
Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.
Average | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1: 11.13120 ms | 0.81076 ms (7.3%)
Version 2: 11.27092 ms | 0.66209 ms (5.9%)
Version 3: 9.29184 ms | 0.27851 ms (3.0%)
Version 4: 9.40948 ms | 0.32702 ms (3.5%)
Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.
Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314
The obvious, using pshufb.
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 4 == 0);
__m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 16, dest += 12) {
_mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
}
}
Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.
EDIT2: More detail about the underlying code structure:
With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.
From LSB to MSB, the inputs look like this, ignoring the specific components:
s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333
and the ouptuts look like this:
d[0]: 000 000 000 000 111 1
d[1]: 11 111 111 222 222 22
d[2]: 2 222 333 333 333 333
So to generate those outputs, you need to do the following (I will specify the actual transformations later):
d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:
combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
where (1 means select the left pixel, 0 means select the right pixel):
mask(0)= 111 111 111 111 000 0
mask(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000
But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):
d[0]=
s[0][0].Blue s[0][0].Green s[0][0].Red
s[0][1].Blue s[0][1].Green s[0][1].Red
s[0][2].Blue s[0][2].Green s[0][2].Red
s[0][3].Blue s[0][3].Green s[0][3].Red
s[1][0].Blue s[1][0].Green s[1][0].Red
s[1][1].Blue
If you translate the above into byte offsets from source to dest, you get:
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7
(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)
Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):
f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 X X X X
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
f_1_high= X X X X X X X X 3 2 1 7 6 5 11 10
f_2_low= 9 15 14 13 X X X X X X X X X X X X
f_2_high= X X X X 3 2 1 7 6 5 11 10 9 15 14 13
We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.
EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 16 == 0);
__m128i shuf0 = _mm_set_epi8(
-128, -128, -128, -128, // top 4 bytes are not used
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
__m128i shuf1 = _mm_set_epi8(
7, 1, 2, 3, // top 4 bytes go to the first pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
__m128i shuf2 = _mm_set_epi8(
10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9); // bottom 4 go to third pixel
__m128i shuf3 = _mm_set_epi8(
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
-128, -128, -128, -128); // unused
__m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
__m128i mask1 = _mm_set_epi32(0, 0, -1, -1);
__m128i mask2 = _mm_set_epi32(0, 0, 0, -1);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 64, dest += 48) {
__m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
__m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
__m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
__m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
_mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
_mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
_mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
}
}
I am coming a little late to the party, seeming that the community has already decided for poseur's pshufb-answer but distributing 2000 reputation, that is so extremely generous i have to give it a try.
Here's my version without platform specific intrinsics or machine-specific asm, i have included some cross-platform timing code showing a 4x speedup if you do both the bit-twiddling like me AND activate compiler-optimization (register-optimization, loop-unrolling):
#include "stdlib.h"
#include "stdio.h"
#include "time.h"
#define UInt8 unsigned char
#define IMAGESIZE (1920*1080)
int main() {
time_t t0, t1;
int frames;
int frame;
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
if(!orig) {printf("nomem1");}
BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
if(!dest) {printf("nomem2");}
printf("to start original hit a key\n");
getch();
t0 = time(0);
frames = 1200;
for(frame = 0; frame<frames; frame++) {
int x; for(x = 0; x < IMAGESIZE; x++) {
dest[x].Red = orig[x].Red;
dest[x].Green = orig[x].Green;
dest[x].Blue = orig[x].Blue;
x++;
}
}
t1 = time(0);
printf("finished original of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook the original took 16 sec
// (8 sec with compiler optimization -O3) so at 60 FPS
// (instead of the 1200) this would be faster than realtime
// (if you disregard any other rendering you have to do).
// However if you either want to do other/more processing
// OR want faster than realtime processing for e.g. a video-conversion
// program then this would have to be a lot faster still.
printf("to start alternative hit a key\n");
getch();
t0 = time(0);
frames = 1200;
unsigned int* reader;
unsigned int* end = reader+IMAGESIZE;
unsigned int cur; // your question guarantees 32 bit cpu
unsigned int next;
unsigned int temp;
unsigned int* writer;
for(frame = 0; frame<frames; frame++) {
reader = (void*)orig;
writer = (void*)dest;
next = *reader;
reader++;
while(reader<end) {
cur = next;
next = *reader;
// in the following the numbers are of course the bitmasks for
// 0-7 bits, 8-15 bits and 16-23 bits out of the 32
temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
*writer = temp;
reader++;
writer++;
cur = next;
next = *reader;
temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
*writer = temp;
reader++;
writer++;
}
}
t1 = time(0);
printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);
// on my core 2 subnotebook this alternative took 10 sec
// (4 sec with compiler optimization -O3)
}
The results are these (on my core 2 subnotebook):
F:\>gcc b.c -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds
F:\>gcc b.c -O3 -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds
You want to use a Duff's device: http://en.wikipedia.org/wiki/Duff%27s_device. It's also working in JavaScript. This post however it's a bit funny to read http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html. Imagine a Duff device with 512 Kbytes of moves.
In combination with one of the fast conversion functions here, given access to Core 2s it might be wise to split the translation into threads, which work on their, say, fourth of the data, as in this psudeocode:
void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
thread threads[] = {
create_thread(bgrFromArgb, dest, src, n/4),
create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
}
join_threads(threads);
}
This assembly function should do, however I don't know if you would like to keep old data or not, this function overrides it.
The code is for MinGW GCC with intel assembly flavour, you will have to modify it to suit your compiler/assembler.
extern "C" {
int convertARGBtoBGR(uint buffer, uint size);
__asm(
".globl _convertARGBtoBGR\n"
"_convertARGBtoBGR:\n"
" push ebp\n"
" mov ebp, esp\n"
" sub esp, 4\n"
" mov esi, [ebp + 8]\n"
" mov edi, esi\n"
" mov ecx, [ebp + 12]\n"
" cld\n"
" convertARGBtoBGR_loop:\n"
" lodsd ; load value from [esi] (4byte) to eax, increment esi by 4\n"
" bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
" stosd ; store 4 bytes to [edi], increment edi by 4\n"
" sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
" loop convertARGBtoBGR_loop\n"
" leave\n"
" ret\n"
);
}
You should call it like so:
convertARGBtoBGR( &buffer, IMAGESIZE );
This function is accessing memory only twice per pixel/packet (1 read, 1 write) comparing to your brute force method that had (at least / assuming it was compiled to register) 3 read and 3 write operations. Method is the same but implementation makes it more efficent.
You can do it in chunks of 4 pixels, moving 32 bits with unsigned long pointers. Just think that with 4 32 bits pixels you can construct by shifting and OR/AND, 3 words representing 4 24bits pixels, like this:
//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR 32 bits writing (4 pixels)
Shifting operations are always done by 1 instruction cycle in all modern 32/64 bits processors (barrel shifting technique) so its the fastest way of constructing those 3 words for writing, bitwise AND and OR are also blazing fast.
Like this:
//assuming we have 4 ARGB1 ... ARGB4 pixels and 3 32 bits words, W1, W2 and W3 to write
// and *dest its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;
and so on.... with next pixels in a loop.
You'll need some adjusting with images that are not multiple of 4, but I bet this is the fastest approach of all, without using assembler.
And btw, forget about using structs and indexed access, those are the SLOWER ways of all for moving data, just take a look at a disassembly listing of a compiled C++ program and you'll agree with me.
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
Aside from assembly or compiler intrinsics, I might try doing the following, while very carefully verifying the end behavior, as some of it (where unions are concerned) is likely to be compiler implementation dependent:
union uARGB
{
struct ARGB argb;
UInt32 x;
};
union uBGRA
{
struct
{
BGR bgr;
UInt8 Alpha;
} bgra;
UInt32 x;
};
and then for your code kernel, with whatever loop unrolling is appropriate:
inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
uARGB* puargb = (uARGB*)pargb;
uBGRA ubgra;
ubgra.x = __byte_reverse_32(pargb->x);
*pbgr = ubgra.bgra.bgr;
}
where __byte_reverse_32() assumes the existence of a compiler intrinsic that reverses the bytes of a 32-bit word.
To summarize the underlying approach:
view ARGB structure as a 32-bit integer
reverse the 32-bit integer
view the reversed 32-bit integer as a (BGR)A structure
let the compiler copy the (BGR) portion of the (BGR)A structure
Although you can use some tricks based on CPU usage,
This kind of operations can be done fasted with GPU.
It seems that you use C/ C++... So your alternatives for GPU programming may be ( on windows platform )
DirectCompute ( DirectX 11 ) See this video
Microsoft Research Project Accelerator Check this link
Cuda
"google" GPU programming ...
Shortly use GPU for this kind of array operations for make faster calculations. They are designed for it.
I haven't seen anyone showing an example of how to do it on the GPU.
A while ago I wrote something similar to your problem. I received data from a video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (just the Y component). I also wanted to draw areas that are too dark in blue and oversaturated regions in red.
I started out with the smooth_opengl3.c example from the freeglut distribution.
The data is copied as YUV into the texture and then the following GLSL shader programs are applied. I'm sure GLSL code runs on all macs nowadays and it will be significantly faster than all the CPU approaches.
Note that I have no experience on how you get the data back. In theory glReadPixels should read the data back but I never measured its performance.
OpenCL might be the easier approach, but then I will only start developing for that when I have a notebook that supports it.
(defparameter *vertex-shader*
"void main(){
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
gl_FrontColor = gl_Color;
gl_TexCoord[0] = gl_MultiTexCoord0;
}
")
(progn
(defparameter *fragment-shader*
"uniform sampler2D textureImage;
void main()
{
vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
float v=q.z;
if(int(gl_FragCoord.x)%2 == 0)
v=q.x;
float x=0; // 1./255.;
v-=.278431;
v*=1.7;
if(v>=(1.0-x))
gl_FragColor = vec4(255,0,0,255);
else if (v<=x)
gl_FragColor = vec4(0,0,255,255);
else
gl_FragColor = vec4(v,v,v,255);
}
")

Resources