pmeerw's blog
Sep 2011
All measurements in seconds for 10000 repetitions. Run on Beagleboard-XM clocked at 900 MHz (no RunFast). Compiled using -O2 -march=armv7-a -ffast-math -fPIC -mfloat-abi=softfp -mfpu=neon.
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 10.22 | 11.56 | 14.2 | 1.0 | 10.92 | 16.16 | 2.82 | 2.87 |
| 1024 | 4.5 | 5.2 | 5.61 | 0.46 | 5.11 | 7.22 | 1.16 | 1.16 |
| 512 | 2.07 | 2.3 | 2.98 | 0.2 | 2.59 | 2.89 | 0.36 | 0.34 |
| 256 | 0.88 | 1.0 | 1.12 | 0.08 | 1.01 | 1.12 | 0.12 | 0.11 |
| length (N) | ooura | djb | kiss | libav | fftw2 | fftw3 | fftw3/neon | fftw3/new |
| 2048 | 5.37 | - | 6.91 | 0.7 | 4.71 | 7.37 | 7.38 | 7.38 |
| 1024 | 2.49 | - | 3.45 | 0.32 | 2.19 | 3.14 | 3.13 | 3.13 |
| 512 | 1.09 | - | 1.43 | 0.2 | 1.09 | 1.2 | 1.2 | 1.2 |
| 256 | 0.49 | - | 0.72 | 0.08 | 0.41 | 0.46 | 0.46 | 0.47 |
oourafft (as of 2006/12/28) is free and available at http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html.
djbfft is available at http://cr.yp.to/djbfft.html.
kissfft is under BSD license and available at http://sourceforge.net/projects/kissfft/.
fftw2 is GPL licensed (version 2.1.5), available at http://www.fftw.org/.
fftw3 is GPL licensed (version 3.2.2).
fftw3/neon is based on fftw 3.2.2 and has
ARM/NEON patches added.
fftw3/new is GPL licensed (version 3.3.1-beta) and has ARM/NEON support.
posted at: 14:00 | path: /programming | permanent link
I added ARM NEON SIMD support to kiss FFT. Beware, this primarily enables 2 and 4
parallel FFTs, it not necessarily speeds up a single transform (well, in fact it does
)
Runtime for real-to-complex transform (N=256, forward and inverse transform, 10000 repetitions) in seconds:
| float | float (RunFast) |
float32x2_t | float32x4_t |
| 1.62 | 1.22 | 0.66 | 0.98 |
posted at: 15:33 | path: /programming | permanent link
The following code (from math_runfast.c) improves
kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).
void enable_runfast() {
#ifdef __arm__
static const unsigned int x = 0x04086060;
static const unsigned int y = 0x03000000;
int r;
asm volatile (
"fmrx %0, fpscr \n\t" //r0 = FPSCR
"and %0, %0, %1 \n\t" //r0 = r0 & 0x04086060
"orr %0, %0, %2 \n\t" //r0 = r0 | 0x03000000
"fmxr fpscr, %0 \n\t" //FPSCR = r0
: "=r"(r)
: "r"(x), "r"(y) );
#endif
}
In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.
Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937
posted at: 13:13 | path: /programming | permanent link
This is how I use ARM NEON intrinsics to speed up division and square root operations...
#include "arm_neon.h"
// approximative quadword float inverse square root
static inline float32x4_t invsqrtv(float32x4_t x) {
float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
return vrsqrtsq_f32(x * sqrt_reciprocal, sqrt_reciprocal) * sqrt_reciprocal;
}
// approximative quadword float square root
static inline float32x4_t sqrtv(float32x4_t x) {
return x * invsqrtv(x);
}
// approximative quadword float inverse
static inline float32x4_t invv(float32x4_t x) {
float32x4_t reciprocal = vrecpeq_f32(x);
reciprocal = vrecpsq_f32(x, reciprocal) * reciprocal;
return reciprocal;
}
// approximative quadword float division
static inline float32x4_t divv(float32x4_t x, float32x4_t y) {
float32x4_t reciprocal = vrecpeq_f32(y);
reciprocal = vrecpsq_f32(y, reciprocal) * reciprocal;
return x * invv(y);
}
// accumulate four quadword floats
static inline float accumv(float32x4_t x) {
static const float32x2_t f0 = vdup_n_f32(0.0f);
return vget_lane_f32(vpadd_f32(f0, vget_high_f32(x) + vget_low_f32(x)), 1);
}
posted at: 10:39 | path: /programming | permanent link
Just spent €150 on a new 24" TFT monitor (1920x1080 pixels) with VGA, DVI and HDMI input. So far I am quite happy... My tiny netbook manages to output a nice picture using the VGA input -- I am surprised.
It has LED backlight and is supposed to consume around 35 W (maybe less in eco mode).
posted at: 19:26 | path: / | permanent link
Getting an Tevion MD-9458 USB flat-bed scanner (manufactured September 2001) to work with Ubuntu/Linux:
/etc/sane.d/gt68xx.conf:
# Medion/Lifetec/Tevion/Cytron MD 9458: override "artec-ultima-2000" vendor "Medion" model "MD 9458" firmware "eplus2k.usb"
/usr/share/sane/gt68xx/.
posted at: 15:20 | path: /configuration | permanent link