I am having to do floating point maths on arrays of ints, fast and with low latency for audio apps (multi channel). My code works but I wonder if theres more efficient ways of processing in places. I get buffers of around 120 frames of interleaved 32 bit audio integers of 16 or 24 channels, which I then have to convert to arrays of floats / doubles for processing (eg biquad filters). Currently I iterate through the arrays and cast each integer to an element of a float array. Then I process these and cast them back to ints for the write buffer, which I pass back to the lib function (I'm on linux using snd_pcm_readi and snd_pcm_writei). Theres lots of copying and it seems wasteful.
The quicker I can do it the lower my latency so the better the overall performance as its for live sound use.
I have read about SSE and other extensions which can be compiled in with gcc options, and some references allude to being able to pass arrays for streamlined conversion etc, and I wonder if these might help the above. Or maybe I should not bother casting to floats - change my processing functions to use ints, keep track of overflows, maybe use 64 bit ints instead, and or create a separate array for exponent - seems pretty esoteric to me, but I guess its not that hard to implement and only needs to be coded once etc. I have asked a separate question of 'Is FPM required for audio DSP maths, or can it be done in 32/64 bit integer maths before rounding back to 24 bits for output?' which is part of the same topic but I thought I should split it into a different question.
If your code is license-compatible, you can use the Vector optimized library of kernels (VOLK) from the GNU Radio project, which has a kernel that does that conversion
32i_s32f_convert_32
converts the input 32 bit integer data into floating point data, and divides the each floating point output data point by the scalar value
It comes with heavily optimized implementations for SSE2, aligned and unaligned data.
EDIT: By the way, why not just do your signal processing in GNU Radio itself?
Related
I am making a 2D shooter game, and thus I have to stuff in a array lots of bullets, including their position, and where they are going.
So I have two issues, one is memory use, specially writing arrays that don't place things out of aligned and results in lots of padding or alignment that makes the speed of calculations suck.
The second is speed of calculation.
First this mean between choosing integers or floats... For now I am going with integers (if someone think floating point is better, please say so).
Then, this also mean choosing a variant of that type (8 bits? 16 bits? C confusing default? The CPU word size? Single precision? Double precision?)
Thus the question is: What type in C is fastest in modern processors (ie: common x86, ARM and other popular processors, don't worry about Z80 or 36bit processors), and what type is more reasonable when taking speed AND memory use in account?
Also, signed and unsigned has differences in speed?
EDIT because of close votes: Yes, it might be premature optimization, but I am asking not only about CPU use, but memory use (that might vary significantly), also I am doing the project to exercise my C skills, it is some years I don't code in C, and I thought to have some fun and find limits and stretch them, and also learn new standards (last time I used C it was still C89).
Finally, the major motivation of asking this question was just hacker curiosity when I found out some new interesting types (like int_fast*_t) existed in newer standards.
But if you still think this is not worth asking, then I can delete the question and go peruse the standards and some books, learn by myself. Then if others one day have the same curiosity, it is not my problem.
I would say an int should be the most comfortable for your CPU. But the C standard does have:
The typedef name int_fastN_t designates the fastest signed integer
type with a width of at least N . The typedef name uint_fastN_t
designates the fastest unsigned integer type with a width of at least
N
So in theory you could say things like: "I need it to be at least 16 bits so I shall use int_fast16_t". In practice that might translate to a plain int.
I suspect it is premature to think about these before you actually hit a performance issue that you can try to work around. I think it is better to solve problems when they occur than to try to think of an elusive super-solution that could solve all future possible issues.
Single precision floating point add and multiply is as fast as as 32 bit integer arithmetic in all modern processors (x86,ARM,MIPS), i.e. one result per clock cycle. Calculating positions and velocity in space is a lot easier with floating point arithmetic, so use floats. Single precision floats are 32 bits, and are the same size as the most efficient integer type on 32 bit CPUs.
I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!
Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.
Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.
I'm implementing some cryptographic algorithm in C which involves an 80 bits key.
A particular operation involves a rotate shifting the key x number of bits.
I've tried the long double type which if I'm not wrong is 80bits, but that doesn't work with the bitshift operator.
The only alternative I can come up with is to use a 10 element char array with some complicated looping and if-else.
My question is whether there's some simple and efficient way of carrying this out.
Thanks.
There is something a bit messed up here. If I understand you correctly, you are using a "soft" cpu on the FPGA.
Traditionally, people use the FPGA to make their own shift registers through VHDL/Verilog. These kind of algorithms are fairly painless to implement and very fast. Back at the university I did this is for a cryptography project.
Moreover, the paper you mentioned talks about a 128 bit key. This would be significantly easier to implement?
Sadly you need a bignum library. While C native data types have support for 80 bit floats it doesn't actually do what you want.
It is possible to link something like GMP or even use a less desirable approaches like 10 character array or two numbers a long and short (64bit and 16bit integers).
Neither is particularly pretty but they do work and if you're planning on using this for anything but a class, GMP is the way to go. Otherwise you could end up with a whole mess of timing attacks which you could code around but it could get really nasty, real quick.
I am writing functions that serialize/deserialize a large data structure for efficient reloading later on. There is a particular set of decimal numbers for which precision is not a huge deal, and I would like to store them in 4 bytes of binary data.
For most, reading the bytes into a buffer and using memcpy to place them into a float is sufficient, and is the most common solution I've found. However, this is not portable, as floats on the systems this software is meant for are not guaranteed to be 4 bytes in size.
What I would like is something very portable (which is one of the reasons I'm limited to C89). I'm not wedded to 4 byte storage, but it is an attractive option to me. I am pretty wholly against storing the numbers as strings. I'm familiar with endianness issues, and such things are already taken into account.
What I am looking for, therefore, is a system-independent way to store and retrieve floating point numbers in a small amount of binary data (preferably around 4 bytes). I, in my foolishness, imagined this would be the easiest part of this task, since it seems like such a common problem, but popular search engines and various reference books have provided no material assistance.
You could store them in 32 bit IEEE float format (or a very close approximation to it, for instance you might what to restrict denorms and NaNs). Then have each platform adjust as necessary to coerce its own float type to that format and back.
Of course there will be some loss of accuracy, but that's inevitable anyway if you're transferring float values of difference precisions from one system to another.
It should be possible to write portable code to find the closest IEEE value to a native float value, and vice-versa, if that's required. You wouldn't really want to use it, though, because it would probably be far less efficient than code that takes advantage of knowing the float format. In the common case where the platform uses an IEEE representation it's a no-op or a simple narrowing/widening conversion. Even in the worst case you're likely to encounter, as long as it's a binary fraction you basically just have to extract the sign, exponent and significand bits and do the right thing with them (discard bits from the significand if it's too big, adjust the bias and possibly the width of the exponent, do the right thing with underflow and overflow).
If you want to avoid losing accuracy in the case where the file is saved and then reloaded on the same system (but that system doesn't use 32bit IEEE), you could look at storing some data indicating the format in the file (size of each value, number of bits of significand and exponent), then store each value at native precision, so that it only gets rounded if it's ever loaded onto a less-precise system. I don't know whether ASN.1 has a standard to encode floating-point values along these lines, but it's the kind of complicated trickery I'd expect from it.
Check this out:http://steve.hollasch.net/cgindex/coding/portfloat.html
They give a routine which is portable and doesnt add too much overhead.