memmove vs. copying individual array elements - c

In CLRS chapter 2 there is an exercise which asks whether the worst-case running time of insertion sort be improved to O(n lg n). I saw this question and found that it cannot be done.
The worst-case complexity cannot be improved but would by using memmove the real running time be better compared to individually moving the array elements?
Code for individually moving elements
void insertion_sort(int arr[], int length)
{
/*
Sorts into increasing order
For decreasing order change the comparison in for-loop
*/
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
arr[k + 1] = arr[k];
}
arr[k + 1] = temp;
}
}
Code for moving elements by using memmove
void insertion_sort(int arr[], int length)
{
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
;
}
if (k != j - 1){
memmove(&arr[k + 2], &arr[k + 1], sizeof(int) *(j - k - 2));
}
arr[k + 1] = temp;
}
}
I couldn't get the 2nd one to run perfectly but that is an example of what I am thinking of doing.
Would there be any visible speed improvements by using memmove?

The implementation behind memmove() might be more optimized in your C library. Some architectures have instructions for moving whole blocks of memory at once very efficiently. The theoretical running-time complexity won't be improved, but it may still run faster in real life.

memmove would be perfectly tuned to make maximum use of the available system resources (unique for each implementation, of course).
Here is a little quote from Expert C Programming - Deep C Secrets on the difference between using a loop and using memcpy (preceding it are two code snippets one copying a source into a destination using a for loop and another memcpy):
In this particular case both the source and destination use the same
cache line, causing every memory reference to miss the cache and
stall the processor while it waited for regular memory to deliver.
The library memcpy() routine is especially tuned for high performance.
It unrolls the loop to read for one cache line and then write, which
avoids the problem. Using the smart copy, we were able to get a huge
performance improvement. This also shows the folly of drawing
conclusions from simple-minded benchmark programs.
This dates back from 1994 but it still illustrates how much better optimised the standard library functions are compared to anything you roll on your own. The loop case took around 7 seconds to run versus 1 for the memcpy.
While memmove will be only slightly slower than memcpy due to the assumptions it needs to make about the source and destination (in memcpy they cannot overlap) it should still be far superior to any standard loop.
Note that this does not affect complexity (as it's been pointed out by another poster). Complexity does not depend on having a bigger cache or an unrolled loop :)
As requested here are the code snippets (slightly changed):
#include <string.h>
#define DUMBCOPY for (i = 0; i < 65536; i++) destination[i] = source[i]
#define SMARTCOPY memcpy(destination, source, 65536)
int main()
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; j++)
DUMBCOPY; /* or put SMARTCOPY here instead */
return 0;
}
On my machine (32 bit, Linux Mint, GCC 4.6.3) I got the following times:
Using SMARTCOPY:
$ time ./a.out
real 0m0.002s
user 0m0.000s
sys 0m0.000s
Using DUMBCOPY:
$ time ./a.out
real 0m0.050s
user 0m0.036s
sys 0m0.000s

It all depends on your compiler and other implementation details. It is true that memmove can be implemented in some tricky super-optimized way. But at the same time a smart compiler might be able to figure out what your per-element copying code is doing and optimize it in the same (or very similar) way. Try it and see for yourself.

You can not beat memcpy with C implementation. Because it is written in asm and with a good algorithms.
If you write asm code for a specific cpu in mind, and develop a good algorithms considering cache, you may have a chance.
Standard library functions are so well optimized, it is always better to use them.

Related

Function to multiply 3x3 matrices gives wrong answer for middle column only

While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
For example:
[1 2 3] [23 4 6]
[4 5 6] * [ 2 35 0]
[7 8 9] [14 2 43]
The answer I receive is:
[ 69 80 135]
[190 273 282]
[303 326 429]
The actual answer should be:
[ 69 83 135]
[190 279 282]
[303 335 429]
Isolating the middle columns for clarity:
Received Expected
[ 80] [ 83]
[273] [279]
[326] [335]
My code is as follows:
#include <stdio.h>
typedef struct mat_3x3
{
double values [3][3];
} mat_3x3;
void SetMatrix(mat_3x3 * matrix, double vals[3][3])
{
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
(matrix->values)[i][j] = vals[i][j];
}
}
putchar('\n');
}
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
{
mat_3x3 result;
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
double temp = 0;
for (int k = 0; k < 3; k++)
{
temp += ((m1->values)[i][k] * (m2->values)[k][j]);
}
(result.values)[i][j] = temp;
}
}
return result;
}
void PrintMatrix(mat_3x3 * matrix)
{
putchar('\n');
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
printf("%lf ", (matrix->values)[i][j]);
}
putchar('\n');
}
putchar('\n');
}
int main()
{
mat_3x3 m1;
mat_3x3 * pm1 = &m1;
mat_3x3 m2;
mat_3x3 * pm2 = &m2;
double vals[3][3] = {
{1,2,3},
{4,7,6},
{7,8,9}
};
double vals2[3][3] = {
{23,4,6},
{2,35,0},
{14,2,43}
};
SetMatrix(pm1, vals);
SetMatrix(pm2, vals2);
printf("\nm1:");
PrintMatrix(pm1);
printf("\nm2:");
PrintMatrix(pm2);
mat_3x3 m3 = MatrixMultiply(pm1, pm2);
mat_3x3 * pm3 = &m3;
printf("\nm3 = m1 * m2");
PrintMatrix(pm3);
}
Have been working on this for a while now comparing it against other simple examples and can't find the problem, so help would be appreciated!
Also if I've done anything atrocious syntax wise etc, I'm open to any criticism on how it's written as well.
While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
In practice, when coding in C, you should take care of the following issues:
refer to a good C reference website and read a good C programming book, such as Modern C
floating point numbers are not mathematical real numbers, see floating-point-gui.de for much more. For example, addition is associative in math, but not on a computer using IEEE-754.
we all make bugs (e.g. buffer overflows or undefined behavior). So you need to learn how to use a debugger. I recommend GDB. But you need to learn how to use it and spend a few hours reading documentation. Tools like valgrind are also useful (to hunt memory leaks) as soon as you use C dynamic memory allocation.
recent compilers can be helpful. I recommend GCC. You should invoke it with all warnings and debug info, e.g. gcc -Wall -Wextra -g. Be sure to spend some time in reading the documentation of your compiler. You might later consider using static program analysis tools such as Frama-C or the Clang analyzer or (for precision analysis) Fluctuat or CADNA
consider having a matrix abstract data type like here. You would then easily generalize your code to "arbitrary" N*M matrixes.
later, for benchmarking purposes, you will want to use an optimizing compiler. If you use GCC, you could compile your code using gcc -Wall -Wextra -g -O3 but then you could have surprising optimizations, see e.g. this draft report.
in some cases, you could need arbitrary-precision arithmetic. Consider then using specialized libraries such as GMPlib.
Most computers today are multi-core. You could want to use Pthreads or MPI to take advantage of that with concurrent programming.
many open source libraries exist for scientific computations. Look at least for inspiration on github and gitlab and see also this list. You could be interested by GNU GSL and study its source code since it is free software (and later improve it).
If you want to make serious scientific computations, you might consider switching (for expressiveness) to functional languages such as Ocaml. If you care about making a lot of iterative computing (like in finite element methods) you might switch to OpenCL or OpenACC.
Be aware that scientific computation is a very difficult field.
Expect to spend a decade in learning it.
I'm open to any criticism on how it's written as well.
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
is unusual. Why don't you return a pointer (to a fresh memory zone obtained with malloc and correctly initialized) ? That is likely to be faster (a pointer is usually 8 bytes, a 3x3 matrix takes 72 bytes to be copied) and enable you to code things like MatrixMultiply(MatrixMultiply(M1, M2), MatrixAdd(M2, M3)). Of course, garbage collection (read the GC handbook, consider using Boehm GC) then becomes an issue. If you used Ocaml, the system GC would be very helpful.

Use the maximum CPU to solve the permutations of 10 in the shortest possible time

I am using this program written in C to determine the permutations of size 10 of an regular alphabet.
When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM.
I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.
Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static int count = 0;
void print_permutations(char arr[], char prefix[], int n, int k) {
int i, j, l = strlen(prefix);
char newprefix[l + 2];
if (k == 0) {
printf("%d %s\n", ++count, prefix);
return;
}
for (i = 0; i < n; i++) {
//Concatenation of currentPrefix + arr[i] = newPrefix
for (j = 0; j < l; j++)
newprefix[j] = prefix[j];
newprefix[l] = arr[i];
newprefix[l + 1] = '\0';
print_permutations(arr, newprefix, n, k - 1);
}
}
int main() {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
print_permutations(arr, "", n, k);
system("pause");
return 0;
}
There are fundamental problems with your approach:
What are you trying to achieve?
If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.
The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.
Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.
Test on smaller sets to verify correctness
Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.
Once you have correctness, only then focus on performance
Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.
Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.
Memory use is no measure of performance
Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.
As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.
Here is a modified program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static unsigned long long count;
void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
if (pos == k) {
prefix[k] = '\0';
++count;
//printf("%llu %s\n", count, prefix);
//puts(prefix);
return;
}
for (int i = 0; i < n; i++) {
if (!used[i]) {
used[i] = 1;
prefix[pos] = arr[i];
print_permutations(arr, n, used, prefix, pos + 1, k);
used[i] = 0;
}
}
}
int main(int argc, char *argv[]) {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
char used[27] = { 0 };
char perm[27];
unsigned long long expected_count;
clock_t start, elapsed;
if (argc >= 2)
k = strtol(argv[1], NULL, 0);
if (argc >= 3)
n = strtol(argv[2], NULL, 0);
start = clock();
print_permutations(arr, n, used, perm, 0, k);
elapsed = clock() - start;
expected_count = 1;
for (int i = n; i > n - k; i--)
expected_count *= i;
printf("%llu permutations, expected %llu, %.0f permutations per second\n",
count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
return 0;
}
Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.
There are multiple reasons for your bad experience:
Your metric:
Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all).
Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.
Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.
Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.
Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.

Make this for-loop more efficient?

edit - -
This code will be run with optimizations off
full transparency this is a homework assignment.
I’m having some trouble figuring out how to optimize this code...
My instructor went over unrolling and splitting but neither seems to greatly reduce the time needed to execute the code. Any help would be appreciated!
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
Assuming you mean same number of additions to sum at runtime (rather than same number of additions in the source code), unrolling could give you something like:
for (j = 0; j + 5 < ARRAY_SIZE; j += 5) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Alternatively, since you're adding the same values each time through the outer loop, you don't need to process it N_TIMES times, just do this:
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
break;
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
This requires that the initial value of sum is zero, which is likely but there's actually nothing in your question that mandates this, so I include it as a pre-condition for this method.
Except by cheating*, this inner loop is essentially non-optimizable. Because you must fetch all the array elements and perform all the additions anyway.
The body of the loop performs:
a conditional branch on j;
a fetch of array[j];
the accumulation to a scalar variable;
the incrementation of j.
As said, 2. to 4. are inescapable.Then all you can do is reducing the number of conditional branches by loop unrolling (this turns the conditional branch in an unconditional one, at the expense of the number of iterations becoming fixed).
It is no surprise that you don't see a big difference. Modern processors are "loop aware", meaning that branch prediction is well tuned to such loops so that the cost of the branches is pretty low.
Cheating:
As others said, you can completely bypass the outer loop. This is just exploiting a flaw in the exercise statement.
As optimizations must be turned off, using inline assembly, pragmas, vector instructions or intrinsics should be banned as well (not mentioning automatic parallelization).
There is a possibility to pack two ints in a long long. If the sum doesn't overflow, you will perform two additions at a time. But is this legal ?
One might think of an access pattern that favors cache utilization. But here there is no hope as the array is fully traversed on every loop and there is no possibility of reuse of the values fetched.
First of all, unless you are explicitly compiling with -O0, your compiler has already likely optimized this loop much further than you could possibly expect.
Including unrolling, and on top of unrolling also vectorization and more. Trying to optimize this by hand is something you should never, absolutely never do. At most you will successfully make the code harder to read and understand, while most likely not even being able to match the compiler in terms of performance.
As to why there is no measurable gain? Possibly because you already hit a bottleneck, even with the "non optimized" version. For ARRAY_SIZE greater than your processors cache even the compiler optimized version is already limited by memory bandwidth.
But for completeness, let's just assume you have not hit that bottleneck, and that you actually had turned optimizations almost off (so no more than -O1), and optimize for that.
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
int tmpSum[4] = {0,0,0,0};
for (j = 0; j < ARRAY_SIZE; j+=4) {
tmpSum[0] += array[j+0];
tmpSum[1] += array[j+1];
tmpSum[2] += array[j+2];
tmpSum[3] += array[j+3];
}
sum += tmpSum[0] + tmpSum[1] + tmpSum[2] + tmpSum[3];
if(ARRAY_SIZE % 4 != 0) {
j -= 4;
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
There is pretty much only one factor left which still could have reduced the performance, for a smaller array.
Not the overhead for the loop, so plain unrolling would had been pointless with a modern processor. Don't even bother, you won't beat the branch prediction.
But the latency between two instructions, until a value written by one instruction may be read again by the next instruction still applies. In this case, sum is constantly written and read all over again, and even if sum is cached in a register, this delay still applies and the processors pipeline had to wait.
The way around that, is to have multiple independent additions going on simultaneously, and finally just combine the results. This is by the way also an optimization which most modern compilers do know how to perform.
On top of that, you could now also express the first loop with vector instructions - once again also something the compiler would have done. At this point you are running into instruction latency again, so you will likely have to introduce one more set of temporaries, so that you now have two independent addition streams each using vector instructions.
Why the requirement of at least -O1? Because otherwise the compiler won't even place tmpSum in a register, or will try to express e.g. array[j+0] as a sequence of instructions for performing the addition first, rather than just using a single instruction for that. Hardly possible to optimize in that case, without using inline assembly directly.
Or if you just feel like (legit) cheating:
const int N_TIMES = 1000;
const int ARRAY_SIZE = 1024;
const int array[1024] = {1};
int sum = 0;
__attribute__((optimize("O3")))
__attribute__((optimize("unroll-loops")))
int fastSum(const int array[]) {
int j;
int tmpSum;
for (j = 0; j < ARRAY_SIZE; j++) {
tmpSum += array[j];
}
return tmpSum;
}
int main() {
int i;
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
sum += fastSum(array);
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
return sum;
}
The compiler will then apply pretty much all the optimizations described above.

Effect of cache size on code

I want to study the effect of the cache size on code. For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.
How can I meassure this?
I tried to run this c program:
#define L1_CACHE_SIZE 32 // Kbytes 8192 integers
#define L2_CACHE_SIZE 256 // Kbytes 65536 integers
#define L3_CACHE_SIZE 4096 // Kbytes
#define ARRAYSIZE 32000
#define ITERATIONS 250
int arr[ARRAYSIZE];
/*************** TIME MEASSUREMENTS ***************/
double microsecs() {
struct timeval t;
if (gettimeofday(&t, NULL) < 0 )
return 0.0;
return (t.tv_usec + t.tv_sec * 1000000.0);
}
void init_array() {
int i;
for (i = 0; i < ARRAYSIZE; i++) {
arr[i] = (rand() % 100);
}
}
int operation() {
int i, j;
int sum = 0;
for (j = 0; j < ITERATIONS; j++) {
for (i = 0; i < ARRAYSIZE; i++) {
sum =+ arr[i];
}
}
return sum;
}
void main() {
init_array();
double t1 = microsecs();
int result = operation();
double t2 = microsecs();
double t = t2 - t1;
printf("CPU time %f milliseconds\n", t/1000);
printf("Result: %d\n", result);
}
taking values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant) in order to check if the program run faster if the array fits in the cache, but I always get the same CPU time.
Can anyone say what I am doing wrong?
What you really want to do is build a "memory mountain." A memory mountain helps you visualize how memory accesses affect program performance. Specifically, it measures read throughput vs spatial locality and temporal locality. Good spatial locality means that consecutive memory accesses are near each other and good temporal locality means that a certain memory location is accessed multiple times in a short amount of program time. Here is a link that briefly mentions cache performance and memory mountains. The 3rd edition of the textbook mentioned in that link is a very good reference, specifically chapter 6, for learning about memory and cache performance. (In fact, I'm currently using that section as a reference as I answer this question.)
Another link shows a test function that you could use to measure cache performance, which I have copied here:
void test(int elems, int stride)
{
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i+=stride)
result += data[i];
sink = result;
}
Stride is the temporal locality - how far apart the memory accesses are.
The idea is that this function would estimate the number of cycles that it took to run. To get throughput, you'll want to take (size / stride) / (cycles / MHz), where size is the size of the array in bytes, cycles is the result of this function, and MHz is the clock speed of your processor. You'd want to call this once before you take any measurements to "warm up" your cache. Then, run the loop and take measurements.
I found a GitHub repository that you could use to build a 3D memory mountain on your own machine. I encourage you to try it on multiple machines with different processors and compare differences.
There's a typo in your code. =+ instead of +=.
The arr array is linked into the BSS [uninitialized] section. The default value for the variables in this section is zero. All pages in this section are initially mapped R/O to a single zero page. This is linux/Unix centric, but, probably applies to most modern OSes
So, regardless of the array size, you're only fetching from a single page, which will get cached, so that's why you get the same results.
You'll need to break the "zero page mapping" by writing something to all of arr before doing your tests. That is, do something like memset first. This will cause the OS to create a linear page mapping for arr using its COW (copy-on-write) mechanism.

Efficient Rice decoding implementation

I made a naïve implementation of a Rice decoder (and encoder):
void rice_decode(int k) {
int i = 0
int j = 0;
int x = 0;
while(i < size-k) {
int q = 0;
while(get(i) == 0) {
q++;
i++;
}
x = q<<k;
i++;
for(j=0; j<k; j++) {
x += get(i+j)<<j;
}
i += k;
printf("%i\n", x);
x = 0;
}
}
with size the size of the input bitset, get(i) a primitive returning the i-th bit of the bitset, and k the Rice parameter. As I am concerned with performances, I also made a more elaborate implementation with precomputation, which is faster. However, when I turn the -O3 flag on in gcc, the naïve implementation actually outperforms the latter.
My question is: do know any existing efficient implementation of a Rice encoder/decoder (I am more concerned with decoding) that fares better than this (the ones I could find are either slower or comparable) ? Alternatively, do you have any clever idea that could make decoding faster other than precomputation ?
Rice coding can be viewed as a variant of variable-length codes. Consequently, in the past I've used table-based techniques to generate automata/state machines that decode fixed huffman codes and rice codes quickly. A quick search about the web for fast variable-length codes or fast huffman yield many applicable results, some of them table-based.
There are some textbook bithacks applicable here.
while (get(i) == 0) { q++; i++;} finds the least significant bit set in a stream.
That can be replaced with -data & data, which returns a single bit set. That can be converted to index 'q' with some hash + LUT (e.g the one involving modulus with 37. -- or using SSE4 instructions with crc32, I'd bet one can simply do LUT[crc32(-data&data) & 63];
The next loop for(j=0; j<k; j++) x += get(i+j)<<j; OTOH should be replaced with x+= data & ((1<<k)-1); as one simply gets k bits from a stream and treats them as an unsigned integer.
Finally one shifts data>>=(q+k); and reads in enough bytes from input stream.

Resources