pmeerw's blog

Sep 2011

Mon, 19 Sep 2011

FFT performance on ARM Cortex-A8

All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.

complex-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 10.22 11.56 14.2 1.0 10.92 16.16 2.82 2.87
1024 4.5 5.2 5.61 0.46 5.11 7.22 1.16 1.16
512 2.07 2.3 2.98 0.2 2.59 2.89 0.36 0.34
256 0.88 1.0 1.12 0.08 1.01 1.12 0.12 0.11

real-to-complex

length (N) ooura djb kiss libav fftw2 fftw3 fftw3/neon fftw3/new
2048 5.37 - 6.91 0.7 4.71 7.37 7.38 7.38
1024 2.49 - 3.45 0.32 2.19 3.14 3.13 3.13
512 1.09 - 1.43 0.2 1.09 1.2 1.2 1.2
256 0.49 - 0.72 0.08 0.41 0.46 0.46 0.47

oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/. fftw3 is GPL licensed (version 3.2.2). fftw3/neon is based on fftw 3.2.2 and has ARM/NEON patches added. fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.

posted at: 14:00 | path: /programming | permanent link

Fri, 16 Sep 2011

KissFFT and ARM NEON

I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4 parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does :-))

Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
float float
(RunFast)
float32x2_t float32x4_t
1.62 1.22 0.66 0.98
Note: float32x2_t and float32x4_t, respectively, compute two and four FFTs in parallel!

posted at: 15:33 | path: /programming | permanent link

ARM floating point performance & RunFast

The following code (from math_runfast.c) improves kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).

void enable_runfast() {
#ifdef __arm__
    static const unsigned int x = 0x04086060;
    static const unsigned int y = 0x03000000;
    int r;
    asm volatile (
        "fmrx   %0, fpscr                       \n\t"   //r0 = FPSCR
        "and    %0, %0, %1                      \n\t"   //r0 = r0 & 0x04086060
        "orr    %0, %0, %2                      \n\t"   //r0 = r0 | 0x03000000
        "fmxr   fpscr, %0                       \n\t"   //FPSCR = r0
        : "=r"(r)
        : "r"(x), "r"(y) );
#endif
}

In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.

Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937

posted at: 13:13 | path: /programming | permanent link

How to use ARM NEON sqrt and reciprocal approximation

This is how I use ARM NEON intrinsics to speed up division and square root operations...

#include "arm_neon.h"

// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
    float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
    
    return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
}
        
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
    return x * invsqrtv(x);
}
            
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
    float32x4_t reciprocal = vrecpeq_f32(x);
    reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
                                
    return reciprocal;
}
                                    
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
    float32x4_t reciprocal = vrecpeq_f32(y);
    reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
                                                
    return x * invv(y);
}

// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
    static const float32x2_t f0 = vdup_n_f32(0.0f);
    return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);
}

posted at: 10:39 | path: /programming | permanent link

Mon, 05 Sep 2011

Got a new monitor: Benq GL2440H

Just spent €150 on a new 24" TFT monitor (1920x1080 pixels) with VGA, DVI and HDMI input. So far I am quite happy... My tiny netbook manages to output a nice picture using the VGA input -- I am surprised.

It has LED backlight and is supposed to consume around 35 W (maybe less in eco mode).

posted at: 19:26 | path: / | permanent link

Sun, 04 Sep 2011

Reviving old hardware: Tevion MD-9458 scanner with Linux

Getting an Tevion MD-9458 USB flat-bed scanner (manufactured September 2001) to work with Ubuntu/Linux:

More information is on the Sane gt68xx-backend page.

posted at: 15:20 | path: /configuration | permanent link

Made with PyBlosxom