pmeerw's blog
16 Sep 2011
I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4
parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does
)
Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
| float | float (RunFast) |
float32x2_t | float32x4_t |
| 1.62 | 1.22 | 0.66 | 0.98 |
posted at: 15:33 | path: /programming | permanent link
The following code (from math_runfast.c) improves
kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).
void enable_runfast() {
#ifdef __arm__
static const unsigned int x = 0x04086060;
static const unsigned int y = 0x03000000;
int r;
asm volatile (
"fmrx %0, fpscr \n\t" //r0 = FPSCR
"and %0, %0, %1 \n\t" //r0 = r0 & 0x04086060
"orr %0, %0, %2 \n\t" //r0 = r0 | 0x03000000
"fmxr fpscr, %0 \n\t" //FPSCR = r0
: "=r"(r)
: "r"(x), "r"(y) );
#endif
}
In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.
Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937
posted at: 13:13 | path: /programming | permanent link
This is how I use ARM NEON intrinsics to speed up division and square root operations...
#include "arm_neon.h"
// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
}
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
return x * invsqrtv(x);
}
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
float32x4_t reciprocal = vrecpeq_f32(x);
reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
return reciprocal;
}
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
float32x4_t reciprocal = vrecpeq_f32(y);
reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
return x * invv(y);
}
// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
static const float32x2_t f0 = vdup_n_f32(0.0f);
return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);
}
posted at: 10:39 | path: /programming | permanent link