pmeerw's blog

16 Sep 2011

Fri, 16 Sep 2011


I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4 parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does :-))

Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
float float
float32x2_t float32x4_t
1.62 1.22 0.66 0.98
Note: float32x2_t and float32x4_t, respectively, compute two and four FFTs in parallel!

posted at: 15:33 | path: /programming | permanent link

ARM floating point performance & RunFast

The following code (from math_runfast.c) improves kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).

void enable_runfast() {
#ifdef __arm__
    static const unsigned int x = 0x04086060;
    static const unsigned int y = 0x03000000;
    int r;
    asm volatile (
        "fmrx   %0, fpscr                       \n\t"   //r0 = FPSCR
        "and    %0, %0, %1                      \n\t"   //r0 = r0 & 0x04086060
        "orr    %0, %0, %2                      \n\t"   //r0 = r0 | 0x03000000
        "fmxr   fpscr, %0                       \n\t"   //FPSCR = r0
        : "=r"(r)
        : "r"(x), "r"(y) );

In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.

Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego:

posted at: 13:13 | path: /programming | permanent link

How to use ARM NEON sqrt and reciprocal approximation

This is how I use ARM NEON intrinsics to speed up division and square root operations...

#include "arm_neon.h"

// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
    float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
    return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
    return x * invsqrtv(x);
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
    float32x4_t reciprocal = vrecpeq_f32(x);
    reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
    return reciprocal;
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
    float32x4_t reciprocal = vrecpeq_f32(y);
    reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
    return x * invv(y);

// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
    static const float32x2_t f0 = vdup_n_f32(0.0f);
    return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);

posted at: 10:39 | path: /programming | permanent link

Made with PyBlosxom