pmeerw's blog /2012/Oct/29

How to efficiently convert an audio sample in 16-bit signed integer format to a 32-bit float value on an ARM NEON CPU? And how to achieve bit-exact results?

There are several ways to do it in different projects:

s16 -> float float -> s16

PulseAudio flt = sample / (float) 0x7fff; sample = lrintf(clip_flt(flt) * 0x7fff)

libavresample flt = sample / (float) (1<<15); sample = (s16) clip_s16(lrintf(flt * (1 << 15)));

RtAudio flt = (sample + 0.5f) * (1 / 32767.5f); sample = (s16) (flt * 32767.5f - 0.5f);

clip_s16() saturates a 16-bit short integer (-32768..32767); clip_flt() returns a float -1.0..1.0.

Observations regarding PulseAudio:

flt_to_s16(s16_to_flt(x)) != x for x == -32768
x / (float) 0x7fff != x * (1.0f / 0x7fff) on the other hand x / (float) (1<<15) == x * (1.0f / (1<<15)) and the second form would allow to avoid division in favour of multiplication of the inverse; the problem with the first form is a slight deviation for certain input values
CPUs saturation instructions cannot be directly used; float_to_s16() in PulseAudio never produces -32768

lrintf() rounds according to the current rounding mode which by default is round-toward-nearest integer, toward-even for tie breaking). For example, this means:

12.3 -> 12
12.5 -> 12 (!)
12.7 -> 13
13.3 -> 13
13.5 -> 14 (!)
13.7 -> 14

So .5 values are rounded to an even value.

Efficient float_to_s16() on ARM NEON

I'm just showing the initialization and the code in the inner loop, processing 4 samples at a time.

static void float_to_s16(const float *src, int16_t *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q2, %[two23]            \n\t"
    "vdup.f32   q3, %[scale]            \n\t"
    "vdup.u32   q4, %[mask]             \n\t"
    "vdup.f32   q5, %[mone]             \n\t"

    "vld1.32    {q0}, [%[src]]!         \n\t" /* load x */
    "vmaxq.f32  q0, q0, q5              \n\t" /* clip at -1.0 */
    "vmul.f32   q0, q0, q3              \n\t" /* scale */
    "vand.u32   q1, q0, q4              \n\t" /* get sign bit */
    "vorr.u32   q1, q1, q2              \n\t" /* put sign on 2^23 */
    "vadd.f32   q0, q1, q0              \n\t" /* sgn(x)*2^23 + x ... */
    "vsub.f32   q0, q0, q1              \n\t" /* ... - sgn(x)*2^23 */
    "vcvt.s32.f32 q0, q0                \n\t" /* convert to int */
    "vqmovn.s32  d0, q0                 \n\t" /* saturate and narrow */
    "vst1.16    {d0}, [%[dst]]!         \n\t"
                                                                       
    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [scale] "r" (32767.0f), [two23] "r" (8.3886080000e+06f), [mask] "r" (0x80000000), [mone] "r" (-1.0f) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4", "q5" /* clobber list */                                        
  );
}

Observations:

the vmaxq instruction is needed to match PulseAudio's clipping sematics (clip -1.0f to -32767 instead of -32768); otherwise, vqmovn takes care of narrowing a 32-bit signed integer to 16-bit and saturation.
the lrintf() rounding sematic (nearest integer, even tie breaking) is achieved by some a floating point trickery I have taken from the GNU C library implementation (s_lrintf()): the floating-point value of 2^23 is added and subtracted, depending on the sign of the input, in order to round. This can be done efficiently by extracting the sign bit from the input (vand) and or-ing the sign to the two23 value.

So for lrintf() rounding we have: one extra vand, vor, vadd, vsub -- not bad!

Efficient s16_to_float() on ARM NEON

On we turn to the inverse operation in a similar manner. Goal is to obtain bit-extact results for the operation sample / (float) 0x7fff; without actually performing the costly division. I am not saying that this makes much sense, but hey, we can

First we observe that a discrepancy (i.e. sample / (float) 0x7fff != sample * (1.0f / 0x7fff)) occurs when the binary representation of the input value converted to float ends in 0x4000 (that is, q0 & 0xffff == 0x4000after the vcvt instruction). There are 1536 such problematic values over all possible inputs.

static void s16_to_float(const int16_t *src, float *dst) {
  __asm__ __volatile__ (
    "vdup.f32   q1, %[invscale]         \n\t"
    "vdup.u16   q3, %[mask]             \n\t"
    "vdup.u32   q4, %[one]              \n\t"

    "vld1.16    {d0}, [%[src]]!         \n\t" /* load x */
    "vmovl.s16  q0, d0                  \n\t" /* s16 -> s32 */
    "vcvt.f32.s32 q0, q0                \n\t" /* s32 -> float */
                  
    "vceq.u16   q2, q0, q3              \n\t" /* check for defect */
    "vand.u32   q2, q2, q4              \n\t" /* prepare 1 if defect */
                                  
    "vmul.f32   q0, q0, q1              \n\t" /* multiply by invscale */
    "vadd.u32   q0, q0, q2              \n\t" /* correct if defect */
    "vst1.32    {q0}, [%[dst]]!         \n\t"

    : [dst] "+r" (dst), [src] "+r" (src) /* output operands (or input operands that get modified) */
    : [invscale] "r" (invscale), [mask] "r" (0x4000), [one] "r" (1) /* input operands */
    : "memory", "cc", "q0", "q1", "q2", "q3", "q4" /* clobber list */
  );
}

Observations:

So vceq check for the problem condition; it sets each matching 16-bit word to 0xffff if is 0x4000.
vand just keeps the LSB in each 32-bit word, hence a 1 indicated the problem condition for each float.
Finally, vadd adds the correction bit to the multiplication result -- making multiplication by the inverse of 0x7fff identical to divsion by 0x7fff.

So for the exact conversion / division we have: one extra vceq, vand, vadd -- not bad!

posted at: 11:17 | path: /programming | permanent link

<	October 2012					>
Su	Mo	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Mon, 29 Oct 2012

Adventures in fast 16-bit audio sample to float conversion

Efficient float_to_s16() on ARM NEON

Efficient s16_to_float() on ARM NEON

	`s16 -> float`	`float -> s16`
PulseAudio	`flt = sample / (float) 0x7fff;`	`sample = lrintf(clip_flt(flt) * 0x7fff)`
libavresample	`flt = sample / (float) (1<<15);`	`sample = (s16) clip_s16(lrintf(flt * (1 << 15)));`
RtAudio	`flt = (sample + 0.5f) * (1 / 32767.5f);`	`sample = (s16) (flt * 32767.5f - 0.5f);`