pmeerw's blog

29 Oct 2012

Mon, 29 Oct 2012

Adventures in fast 16-bit audio sample to float conversion

How to efficiently convert an audio sample in 16-bit signed integer format to a 32-bit float value on an ARM NEON CPU? And how to achieve bit-exact results?

There are several ways to do it in different projects:
s16 -> float float -> s16
PulseAudio flt = sample / (float) 0x7fff; sample = lrintf(clip_flt(flt) * 0x7fff)
libavresample flt = sample / (float) (1<<15); sample = (s16) clip_s16(lrintf(flt * (1 << 15)));
RtAudio flt = (sample + 0.5f) * (1 / 32767.5f); sample = (s16) (flt * 32767.5f - 0.5f);
clip_s16() saturates a 16-bit short integer (-32768..32767); clip_flt() returns a float -1.0..1.0.

Observations regarding PulseAudio:

  1. flt_to_s16(s16_to_flt(x)) != x for x == -32768
  2. x / (float) 0x7fff != x * (1.0f / 0x7fff) on the other hand x / (float) (1<<15) == x * (1.0f / (1<<15)) and the second form would allow to avoid division in favour of multiplication of the inverse; the problem with the first form is a slight deviation for certain input values
  3. CPUs saturation instructions cannot be directly used; float_to_s16() in PulseAudio never produces -32768

lrintf() rounds according to the current rounding mode which by default is round-toward-nearest integer, toward-even for tie breaking). For example, this means:

12.3 -> 12
12.5 -> 12 (!)
12.7 -> 13
13.3 -> 13
13.5 -> 14 (!)
13.7 -> 14
So .5 values are rounded to an even value.

Efficient float_to_s16() on ARM NEON

I'm just showing the initialization and the code in the inner loop, processing 4 samples at a time.
static void float_to_s16(const float *src, int16_t *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q2, %[two23]            \n\t"
    "vdup.f32   q3, %[scale]            \n\t"
    "vdup.u32   q4, %[mask]             \n\t"
    "vdup.f32   q5, %[mone]             \n\t"

    "vld1.32    {q0}, [%[src]]!         \n\t" /* load x */
    "vmaxq.f32  q0, q0, q5              \n\t" /* clip at -1.0 */
    "vmul.f32   q0, q0, q3              \n\t" /* scale */
    "vand.u32   q1, q0, q4              \n\t" /* get sign bit */
    "vorr.u32   q1, q1, q2              \n\t" /* put sign on 2^23 */
    "vadd.f32   q0, q1, q0              \n\t" /* sgn(x)*2^23 + x ... */
    "vsub.f32   q0, q0, q1              \n\t" /* ... - sgn(x)*2^23 */
    "vcvt.s32.f32 q0, q0                \n\t" /* convert to int */
    "vqmovn.s32  d0, q0                 \n\t" /* saturate and narrow */
    "vst1.16    {d0}, [%[dst]]!         \n\t"
                                                                       
    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [scale] "r" (32767.0f), [two23] "r" (8.3886080000e+06f), [mask] "r" (0x80000000), [mone] "r" (-1.0f) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4", "q5" /* clobber list */                                        
  );
}
Observations: So for lrintf() rounding we have: one extra vand, vor, vadd, vsub -- not bad!

Efficient s16_to_float() on ARM NEON

On we turn to the inverse operation in a similar manner. Goal is to obtain bit-extact results for the operation sample / (float) 0x7fff; without actually performing the costly division. I am not saying that this makes much sense, but hey, we can :-)

First we observe that a discrepancy (i.e. sample / (float) 0x7fff != sample * (1.0f / 0x7fff)) occurs when the binary representation of the input value converted to float ends in 0x4000 (that is, q0 & 0xffff == 0x4000after the vcvt instruction). There are 1536 such problematic values over all possible inputs.

static void s16_to_float(const int16_t *src, float *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q1, %[invscale]         \n\t"
    "vdup.u16   q3, %[mask]             \n\t"
    "vdup.u32   q4, %[one]              \n\t"

    "vld1.16    {d0}, [%[src]]!         \n\t" /* load x */
    "vmovl.s16  q0, d0                  \n\t" /* s16 -> s32 */
    "vcvt.f32.s32 q0, q0                \n\t" /* s32 -> float */
                  
    "vceq.u16   q2, q0, q3              \n\t" /* check for defect */
    "vand.u32   q2, q2, q4              \n\t" /* prepare 1 if defect */
                                  
    "vmul.f32   q0, q0, q1              \n\t" /* multiply by invscale */
    "vadd.u32   q0, q0, q2              \n\t" /* correct if defect */
    "vst1.32    {q0}, [%[dst]]!         \n\t"

    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [invscale] "r" (invscale), [mask] "r" (0x4000), [one] "r" (1) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4" /* clobber list */
  );
}
Observations: So for the exact conversion / division we have: one extra vceq, vand, vadd -- not bad!

posted at: 11:17 | path: /programming | permanent link

Made with PyBlosxom