pmeerw's blog

Fri, 16 Sep 2011

ARM floating point performance & RunFast

The following code (from math_runfast.c) improves kiss FFT's real-to-complex transform (N=256) runtime from
1.62 to 1.22 seconds (forward and inverse transform, 10000 repetitions).

void enable_runfast() {
#ifdef __arm__
    static const unsigned int x = 0x04086060;
    static const unsigned int y = 0x03000000;
    int r;
    asm volatile (
        "fmrx   %0, fpscr                       \n\t"   //r0 = FPSCR
        "and    %0, %0, %1                      \n\t"   //r0 = r0 & 0x04086060
        "orr    %0, %0, %2                      \n\t"   //r0 = r0 | 0x03000000
        "fmxr   fpscr, %0                       \n\t"   //FPSCR = r0
        : "=r"(r)
        : "r"(x), "r"(y) );
#endif
}

In RunFast mode the VFP11 coprocessor, there are no user exception traps, rounding behaviour is slightly different (no negative zeros) and NaNs are handled differently.

Ideal speedup on Cortex-A8 for RunFast is reportedly 40%. There is a patch for eglibc on meego: http://permalink.gmane.org/gmane.comp.handhelds.meego.devel/7937

posted at: 13:13 | path: /programming | permanent link

Made with PyBlosxom