Sep 2011
All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon
.
length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
2048 | 10.22 | 11.56 | 14.2 | 1.0 | 10.92 | 16.16 | 2.82 | 2.87 |
1024 | 4.5 | 5.2 | 5.61 | 0.46 | 5.11 | 7.22 | 1.16 | 1.16 |
512 | 2.07 | 2.3 | 2.98 | 0.2 | 2.59 | 2.89 | 0.36 | 0.34 |
256 | 0.88 | 1.0 | 1.12 | 0.08 | 1.01 | 1.12 | 0.12 | 0.11 |
length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
2048 | 5.37 | - | 6.91 | 0.7 | 4.71 | 7.37 | 7.38 | 7.38 |
1024 | 2.49 | - | 3.45 | 0.32 | 2.19 | 3.14 | 3.13 | 3.13 |
512 | 1.09 | - | 1.43 | 0.2 | 1.09 | 1.2 | 1.2 | 1.2 |
256 | 0.49 | - | 0.72 | 0.08 | 0.41 | 0.46 | 0.46 | 0.47 |
oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/.
fftw3 is GPL licensed (version 3.2.2).
fftw3/neon is based on fftw 3.2.2 and has
ARM/NEON patches added.
fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.
posted at: 14:00 | path: /programming | permanent link
I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4 parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does )
Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
float | float (RunFast) |
float32x2_t | float32x4_t |
1.62 | 1.22 | 0.66 | 0.98 |
posted at: 15:33 | path: /programming | permanent link
The following code (from math_runfast.c) improves
kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).
void enable_runfast() { #ifdef __arm__ static const unsigned int x = 0x04086060; static const unsigned int y = 0x03000000; int r; asm volatile ( "fmrx %0, fpscr \n\t" //r0 = FPSCR "and %0, %0, %1 \n\t" //r0 = r0 & 0x04086060 "orr %0, %0, %2 \n\t" //r0 = r0 | 0x03000000 "fmxr fpscr, %0 \n\t" //FPSCR = r0 : "=r"(r) : "r"(x), "r"(y) ); #endif }
In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.
Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937
posted at: 13:13 | path: /programming | permanent link
This is how I use ARM NEON intrinsics to speed up division and square root operations...
#include "arm_neon.h" // approximative quadword float inverse square root static inline float32x4_t invsqrtv(float32x4_t x) { float32x4_t sqrt_reciprocal = vrsqrteq_f32(x); return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal; } // approximative quadword float square root static inline float32x4_t sqrtv(float32x4_t x) { return x * invsqrtv(x); } // approximative quadword float inverse static inline float32x4_t invv(float32x4_t x) { float32x4_t reciprocal = vrecpeq_f32(x); reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal; return reciprocal; } // approximative quadword float division static inline float32x4_t divv(float32x4_t x, float32x4_t y) { float32x4_t reciprocal = vrecpeq_f32(y); reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal; return x * invv(y); } // accumulate four quadword floats static inline float accumv(float32x4_t x) { static const float32x2_t f0 = vdup_n_f32(0.0f); return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1); }
posted at: 10:39 | path: /programming | permanent link
Just spent €150 on a new 24" TFT monitor (1920x1080 pixels) with VGA, DVI and HDMI input. So far I am quite happy... My tiny netbook manages to output a nice picture using the VGA input -- I am surprised.
It has LED backlight and is supposed to consume around 35 W (maybe less in eco mode).
posted at: 19:26 | path: / | permanent link
Getting an Tevion MD-9458 USB flat-bed scanner (manufactured September 2001) to work with Ubuntu/Linux:
/etc/sane.d/gt68xx.conf
:
# Medion/Lifetec/Tevion/Cytron MD 9458: override "artec-ultima-2000" vendor "Medion" model "MD 9458" firmware "eplus2k.usb"
/usr/share/sane/gt68xx/
.
posted at: 15:20 | path: /configuration | permanent link