Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is this API appropriate, especially for real time use #3

Open
mattetti opened this issue Jan 4, 2017 · 92 comments
Open

is this API appropriate, especially for real time use #3

mattetti opened this issue Jan 4, 2017 · 92 comments

Comments

@mattetti
Copy link
Member

mattetti commented Jan 4, 2017

This discussion is a follow up from this initial proposal. The 2 main arguments that were raised are:

  • Is this API appropriate for real time usage (especially in regards to allocation and memory size)
  • is the interface too big/not adequate

@egonelbre @kisielk @nigeltao @taruti all brought up good points and Egon is working on a counter proposal focusing on smaller interfaces with compatibility with types commonly found in the wild (int16, float32).

As mentioned in the original proposal, I'd like to this organization of a special interest group of people interested in doing more/better audio in Go. I have to admit my focus hasn't been real time audio and I very much appreciate the provided feedback. We all know this is a challenging issue which usually results in a lot of libraries doing things in very different ways. However, I do want to believe that we, as a community and with the support of the core team, can come up with a solid API for all Go audio projects.

@egonelbre
Copy link

@mattetti
Copy link
Member Author

mattetti commented Jan 4, 2017

@egonelbre would you mind squashing your commits for the proposal or maybe send a PR. GitHub really makes it hard to comment on different part of the code coming from different commits :(

@taruti
Copy link

taruti commented Jan 4, 2017

Typically when using audio my needs have been:

  1. Read from input source (typically system IO + slice of []int16 or []float32)
  2. Filter&downsample&convert to preferred internal format (typically []float32)
  3. Do all internal processing with that type (typically []float32)
  4. Maintain as little latency as possible by keeping cpu and memory allocation (and with that GC) in check

@egonelbre
Copy link

@mattetti sure no problem.

Say you are designing a sample-based synthesizer (eg: Akai MPC) and your project has an audio pool it is working with. You'll want to be storing those samples in memory in the native format of your DSP path so you don't have to waste time doing conversions every time you are reading from your audio pool.

@kisielk sure, if you have such sample-based synth you probably need to track what notes are playing, etc. anyways so you would have a Synth node that produces float32/float64, i.e. you pay the conversion per synth not per sample. It's not as good as no conversion, but it just means you can have one less-effect overall for the same performance.

@egonelbre
Copy link

@mattetti Here you go: egonelbre/exp@81ba19e

@kisielk
Copy link

kisielk commented Jan 4, 2017

Yes but the "synth" is not going to be limited to one sample, usually you have some number of channels, say 8-16, and each one can choose any part of any sample to play at any time. In my opinion processing audio in float64 is pretty niche, relegated to some high precision or quality filters which aren't commonly used. Even in that case, the data can be converted to float64 for processing just within that filter block, there's little reason to store it in anything but float32 otherwise. Even still most DSP is performed using float32 even on powerful architectures like x86, reason being that you can do twice as much with SIMD instructions in that case.

Of course I'm totally fine with having float64 as an option for a buffer type when appropriate, but I believe that float32 should be on par. I feel like it would certainly be the primary format for any real-time applications. Even for batch processing you are likely to see performance gains from using it.

@egonelbre
Copy link

egonelbre commented Jan 4, 2017

@kisielk Yes, also, for my own needs float32 would be completely sufficient.

Forums seemed to agree that in most cases float64 isn't a signifcant improvement. However, if one of the intended targets will be writing audio plugins; then many plugin API-s include float64 version (e.g. VST3) and DAW-s have an option to switch between float32 and float64.

I agree that, if only one should be chosen then float32 seems more suitable. (Although. I don't think I have the full knowledge of audio processing to definitively say it.) The only argument for float64 is that math package works on float64. So only using float32 means there is a need for math32 package.

@mattetti
Copy link
Member Author

mattetti commented Jan 4, 2017

I agree that float32 is usually plenty enough but as mentioned my problem is that the Go math package is float64 only. Are we willing to reimplement the math functions we need? It might make sense if we start doing asm optimizations but that's quite a lot of work.

@kisielk
Copy link

kisielk commented Jan 4, 2017

Again, I don't think it's a binary choice, I just think that both should have equal support within the API. And yes, if I was using Go for realtime processing of audio I would definitely want a 32-bit version of the math package. I don't think the math package needs to dictate any limitations on any potential audio API.

@mattetti
Copy link
Member Author

mattetti commented Jan 4, 2017

@kisielk sounds fair, just to be clear, would you be interested in using Go for realtime processing or at least giving it a try? You obviously do that for a living using C++ so your expertise would be invaluable.

@egonelbre
Copy link

egonelbre commented Jan 4, 2017

Are we willing to reimplement the math functions we need?

How much math functions are needed in practice? Initially the package could be a wrapper around math to make it more convenient and then start optimizing the bottlenecks. I never needed more than sin/cos/exp/abs/rand; but I've never done anything complicated either.

I suspect some of the first bottleneck and candidate for "asm optimized" code will be []int16->[]float32 conversion, buffer multiplication and/or addition two buffers together.

@kisielk
Copy link

kisielk commented Jan 4, 2017

@mattetti that is something I'm definitely interested in. I'm not exactly a DSP expert, but I work enough with it day to day to be fairly familiar with the domain.

@egonelbre Gain is also a big one that benefits from optimization. (edit: maybe that's what you meant by buffer multiplication, or did you mean convolution?)

@egonelbre
Copy link

@kisielk yeah, I meant gain :), my brains language unit seems to be severely malfunctioning today.

@taruti
Copy link

taruti commented Jan 4, 2017

math package (trigonometric, logarithmic, etc) with float32 and SIMD optimization for any data type are two different things. In many cases just mult/add/sub/div are needed and for those package math is not needed.

I think that math32 and SIMD are best kept separate from this proposal.

If we are thinking of performance then conversions of buffers without needing to allocate can be important. For example have one input buffer and one output buffer for the conversion. Instead of allocating a new output buffer each time.

@kisielk
Copy link

kisielk commented Jan 4, 2017

@taruti +:100:

@kisielk
Copy link

kisielk commented Jan 4, 2017

Speaking of conversion between buffers, I think it's important the API has a way to facilitate conversion between buffers of different data types and sizes without allocation (eg: 2 channels to 1, etc). The actual conversion method would be determined by the application but at least the API should be able to help facilitate this without too much additional complexity.

@mattetti
Copy link
Member Author

mattetti commented Jan 4, 2017

Alright, here is my suggestion. I'll add you guys to the organization and we can figure out an API for real time processing and from there see how it works for offline. Ideally I would love to end with:

  • a generic audio API (what we are discussing here)
  • a list of codecs (I started with wav and aiff, they still need work and refinement but they work)
  • a set of transforms (gain, dynamics, eq, lfos)
  • analyzers (FFT and things like chromagrams, key, onset detectors...)
  • generators

@rakyll and I also discussed adding wrappers to things like CoreAudio on Mac so we could have an end to end experience without having to rely on things like portaudio. This is outside of the scope of what I have in mind but I figured I should mentioned it.

I like designing APIs against real usage, so maybe a first good step is to define an example we would like to build and from there define the components we need. Thoughts?

@kisielk
Copy link

kisielk commented Jan 4, 2017

That sounds like a good idea to me. However I would propose we limit the scope of the core audio package to the first two points (and perhaps a couple of very general utilities from point 3). I feel like the rest would be better suited for other packages. My main reasoning behind this is that I feel like the first two items can be achieved (relatively) objectively and there can be one canonical implementation. As you go down the list it becomes increasingly application-dependent.

@mattetti
Copy link
Member Author

mattetti commented Jan 4, 2017

I think the audio API should be in its own package and each of those things in separate packages. For instance I have the wav and aiff packages isolated. That's another reason why having a GitHub organization is nice.

@kisielk
Copy link

kisielk commented Jan 4, 2017

Just noticed that when looking at the org page. Looks good to me 👍

@nigeltao
Copy link

nigeltao commented Jan 5, 2017

There's the original proposal. @egonelbre has an alternative proposal. Here are a couple more (conflicting) API ideas for a Buffer type. I'm not saying that either of them are any good, but there might be a useful core in there somewhere. See also another API design in the github.com/azul3d/engine/audio package.

Reader/Writer-ish:

type Buffer interface {
	Format() Format

	// The ReadFrames and WriteFrames methods are roughly analogous to bulk
	// versions of the Image.At and Image.Set methods from the standard
	// library's image and image/draw packages.

	// ReadFrames converts that part of the buffer's data in the range [offset
	// : offset + n] to float32 samples in dst[:n], and returns n, the minimum
	// of length and the number of samples that dst can hold.
	//
	// offset, length and n count frames, not samples (slice elements). For
	// example, stereo audio might have two samples per frame. To convert
	// between a frame count and a sample count, multiply or divide by
	// Format().SamplesPerFrame().
	//
	// The offset is relative to the start of the buffer, which is not
	// necessarily the start of any underlying audio clip.
	//
	// The n returned is analogous to the built-in copy function, where
	// copy(dst, src) returns the minimum of len(dst) and len(src), except that
	// the methods here count frames, not samples (slice elements).
	//
	// Unlike the io.Reader interface, ReadFrames should read (i.e. convert) as
	// many frames as possible, rather than returning short. The conversion
	// presumably does not require any further I/O.
	//
	// TODO: make this return (int, error) instead of int, and split this into
	// audio.Reader and audio.Writer interfaces, analogous to io.Reader and
	// io.Writer, so that you could write "mp3.Decoder(anIOReader)" to get an
	// audio.Reader?
	ReadFrames(dst []float32, offset, length int) (n int)

	// WriteFrames is like ReadFrames except that it converts from src to this
	// Buffer, instead of converting from this Buffer to dst.
	WriteFrames(src []float32, offset, length int) (n int)
}

type BufferI16 struct {
	Fmt  Format
	Data []int16
}

type BufferF32 struct {
	Fmt  Format
	Data []float32
}

Have Buffer be a concrete type, not an interface type:

type Buffer struct {
	Format Format

	DataType DataType

	// The DataType field selects which slice field to use.
	U8  []uint8
	I16 []int16
	F32 []float32
	F64 []float64
}

type DataType uint8

const (
	DataTypeUnknown DataType = iota
	DataTypeU8_U8
	DataTypeU8_I16BE
	DataTypeU8_I16LE
	DataTypeU8_F32BE
	DataTypeU8_F32LE
	DataTypeI16
	DataTypeF32
	DataTypeF64
)

@mattetti
Copy link
Member Author

mattetti commented Jan 5, 2017

In addition, here is another comment from @nigeltao about the math library:

As for a math32 library, I'm not sure if it's necessary. It's slow to call (64-bit) math.Sin inside your inner loop. Instead, I'd expect to pre-compute a global sine table, such as "var sineTable = [4096]float32{ etc }". Compute that table at "go generate" time, and you don't need the math package (or a math32 package) at run time.

I really like this idea which can also apply to log. It might come at an extra memory cost but I am personally OK with that.

Let try to summarize the pros and cons of those different approaches and let's discuss what we value and the direction we want to take. I am now convinced that my initial proposal, while fitting my needs, doesn't work well in other scenarios and shouldn't be left as is.

@nigeltao
Copy link

nigeltao commented Jan 5, 2017

A broader point, re the proposal to add packages to the Go standard library or under golang.org/x, is that I think it is too early to say what the 'right' API should be just by looking at an interface definition. As rsc said on golang/go#18497 (comment): "The right way to start is to create a package somewhere else (github.com/go-audio is great) and get people to use it. Once you have experience with the API being good, then it might make sense to promote to a subrepo or eventually the standard library (the same basic path context followed)." Emphasis added.

The right way might actually involve letting a hundred API flowers bloom, and trying a few different APIs before making a push for any particular flower.

I'd certainly like to see more experience with how audio codecs fit into any API proposal: how does the Buffer type (whatever it is) interact with sources (which can block on I/O, e.g. playing an mp3 stream over the network) and sinks (which you don't want to glitch)?

WAV and AIFF are a good start, but handling some sort of compressed audio would be even better. A full-blown mp3 decoder is a lot of work, but as far as kicking API tyres, it might suffice to write a decoder for a toy audio codec where "c3d1e3c1e2c2e4" decoded to "play a C sine wave for 3 seconds, D for 1 second, E for 3 seconds, etc", i.e. to play a really ugly version of "doe a deer".

@nigeltao
Copy link

nigeltao commented Jan 5, 2017

Back on API design brainstorming and codecs, there might be some more inspiration in the golang.org/x/text/encoding/... and golang.org/x/text/transform packages, which let you e.g. convert between character encodings like Shift JIS, Windows 1252 and UTF-8.

Text encodings are far simpler than audio codecs, though, so it might not end up being relevant.

@kisielk
Copy link

kisielk commented Jan 5, 2017

Some more API inspiration, from C++:

https://www.juce.com/doc/classAudioBuffer
https://www.juce.com/doc/classAudioProcessor

JUCE is one of the most-used audio processing libraries out there.

@kisielk
Copy link

kisielk commented Jan 5, 2017

Obviously the API isn't very go-like since it's C++ (and has a fair amount of pre-C++11 legacy, though is gradually being modernized) but it's worth taking a look at how they put things together.

@mattetti
Copy link
Member Author

mattetti commented Jan 5, 2017

JUCE uses overloading quite heavily and as mentioned isn't very go-like (it's also a framework more than a suite of library, but it is well written and very popular). My hope is that we can come up with a more modern and accessible API instead of "port", I would really want audio in Go to be much easier for new developers. On a side note, I did port over some part of JUCE such as https://www.juce.com/doc/classValueTree for better interop with audio plugins.

@kisielk
Copy link

kisielk commented Jan 5, 2017

I'm not suggesting porting it, but I think the concepts in the library are pretty well thought out and cover most of what you would want to do with audio processing. It's worth getting familiar with. I don't think the use of overloading really matters, it's pretty easy to do that in other ways with Go.

@mattetti
Copy link
Member Author

mattetti commented Jan 5, 2017

@nigeltao I agree with rsc and to be honest my goal was more to get momentum than to get the proposal accepted. I'm very happy to have found a group of motivated people who are interested in tackling the same issue.

I'll open a couple issues to discuss code styling and "core values" of this project.

@egonelbre
Copy link

@mattetti I just converted my experiment to use split channels, the full difference can be seen here egonelbre/exp@8c77c79?diff=split#diff-b1e66adfee4cfc554526b30559e7e612

@mattetti
Copy link
Member Author

I'm back at looking at what we can do to design a generic buffer API. @egonelbre I don't think isolating channels is the way to go, we can more than likely implement an API similar to your channel() passing an iteration function that will be fed every samples in a channel for instance. From everything I've seen so far, samples are always interleaved (that said I'm sure there are counter examples).

I did hit an issue today when I needed to feed a PCM data chunk in int32 or float32 and my current API was only providing int. So I'm going to explore the image and text packages to see if there is a good flexible solution there. I looked at azul3d which is quite well done but I'm not a fan of their buffer/slice implementation: https://github.com/azul3d/engine/blob/master/audio/slice.go

@mattetti
Copy link
Member Author

Taking notes:

Text transformer & chain interfaces: https://godoc.org/golang.org/x/text/transform#Transformer (similar to what an audio transformer interface could be but passing a buffer)

In regards to the Draw package, it starts from an Image interface:

// Image is a finite rectangular grid of color.Color values taken from a color
// model.
type Image interface {
	// ColorModel returns the Image's color model.
	ColorModel() color.Model
	// Bounds returns the domain for which At can return non-zero color.
	// The bounds do not necessarily contain the point (0, 0).
	Bounds() Rectangle
	// At returns the color of the pixel at (x, y).
	// At(Bounds().Min.X, Bounds().Min.Y) returns the upper-left pixel of the grid.
	// At(Bounds().Max.X-1, Bounds().Max.Y-1) returns the lower-right one.
	At(x, y int) color.Color
}

The interface is implemented by many concrete types such as:

// NRGBA is an in-memory image whose At method returns color.NRGBA values.
type NRGBA struct {
	// Pix holds the image's pixels, in R, G, B, A order. The pixel at
	// (x, y) starts at Pix[(y-Rect.Min.Y)*Stride + (x-Rect.Min.X)*4].
	Pix []uint8
	// Stride is the Pix stride (in bytes) between vertically adjacent pixels.
	Stride int
	// Rect is the image's bounds.
	Rect Rectangle
}

func (p *NRGBA) ColorModel() color.Model { return color.NRGBAModel }

func (p *NRGBA) Bounds() Rectangle { return p.Rect }

func (p *NRGBA) At(x, y int) color.Color {
	return p.NRGBAAt(x, y)
}

or

// Paletted is an in-memory image of uint8 indices into a given palette.
type Paletted struct {
	// Pix holds the image's pixels, as palette indices. The pixel at
	// (x, y) starts at Pix[(y-Rect.Min.Y)*Stride + (x-Rect.Min.X)*1].
	Pix []uint8
	// Stride is the Pix stride (in bytes) between vertically adjacent pixels.
	Stride int
	// Rect is the image's bounds.
	Rect Rectangle
	// Palette is the image's palette.
	Palette color.Palette
}

func (p *Paletted) ColorModel() color.Model { return p.Palette }

func (p *Paletted) Bounds() Rectangle { return p.Rect }

func (p *Paletted) At(x, y int) color.Color {
	if len(p.Palette) == 0 {
		return nil
	}
	if !(Point{x, y}.In(p.Rect)) {
		return p.Palette[0]
	}
	i := p.PixOffset(x, y)
	return p.Palette[p.Pix[i]]
}

A gif image is implemented as such:

type GIF struct {
	Image     []*image.Paletted
        //...
}

When drawing using the generic Image interface, a switch type is done to optimize the flow:
https://golang.org/src/image/draw/draw.go?s=2824:2903#L114

But there is always a slow fallback.

@kisielk
Copy link

kisielk commented Feb 15, 2017

@mattetti that's a good find.

Maybe the audio interface could have functions that return / set values for a particular channel, sample pair as either float64 or int. Then the underlying data could be in a more optimized form and functions that need the highest performance can use a type switch and operate on the data directly.

@kisielk
Copy link

kisielk commented Feb 15, 2017

I'm thinking something like:

type Buffer interface {
   Size() Format // returns number of channels and samples, not sure of the naming
   ReadInt(int ch, int sample) int // Maybe read into a passed-in buffer instead?
   ReadFloat(int ch, int sample) float64 // ditto
}

@egonelbre
Copy link

egonelbre commented Feb 15, 2017

From everything I've seen so far, samples are always interleaved (that said I'm sure there are counter examples.

There seem to be two uses:

  1. reading & writing... buffer usually interleaved. e.g. most audio formats and devices.
  2. processing... buffer usually split. e.g. VST, AU (V2), WebAudio, AAX, RTAS, JUCE

With callbacks and call per sample, the overhead is an issue; if it could be inlined and better optimized, it would make some things much nicer.

For the 1. case, you don't always have random access or it is expensive for compressed streams. So building a buffer with random sample based access doesn't make sense... I don't think random access is necessary there, you want to read a chunk or write a chunk of samples. Interleaved when you want to immediately output or basic processing. Deinterleaved for more complicated things.

For 2. you want deinterleaved buffers to make processing simpler. To ensure that processing nodes can communicate you want at most three different buffer formats that work with it.

The case against not doing swizzling when reading/writing is performance... but when you want performance, then in those cases you probably need to use the native format anyways, which might be uint16... but then there might also be issues with sample rate or mono-stereo conversions.

@mattetti
Copy link
Member Author

mattetti commented Feb 15, 2017

I was apparently wrong:

The buffer contains data in the following format: non-interleaved IEEE754 32-bit linear PCM with a nominal range between -1 and +1, that is, 32bits floating point buffer, with each samples between -1.0 and 1.0. If the AudioBuffer has multiple channels, they are stored in separate buffer.

But the API clearly exposes a way to get the channel number: AudioBuffer.numberOfChannels You can also decode a stereo track into a buffer so I'm super confused. Ok I'm not confused anymore, the documentation is misleading, Web Audio is storing the channel PCM data in separate internal buffers accessible through getChannelData(channel) confirming what you were saying.

// Stereo
var channels = 2;

// Create an empty two second stereo buffer at the
// sample rate of the AudioContext
var frameCount = audioCtx.sampleRate * 2.0;
var myArrayBuffer = audioCtx.createBuffer(channels, frameCount, audioCtx.sampleRate);

button.onclick = function() {
  // Fill the buffer with white noise;
  // just random values between -1.0 and 1.0
  for (var channel = 0; channel < channels; channel++) {
    // This gives us the actual array that contains the data
    var nowBuffering = myArrayBuffer.getChannelData(channel);
    for (var i = 0; i < frameCount; i++) {
      // Math.random() is in [0; 1.0]
      // audio needs to be in [-1.0; 1.0]
      nowBuffering[i] = Math.random() * 2 - 1;
    }
  }

  // Get an AudioBufferSourceNode.
  // This is the AudioNode to use when we want to play an AudioBuffer
  var source = audioCtx.createBufferSource();

  // set the buffer in the AudioBufferSourceNode
  source.buffer = myArrayBuffer;

  // connect the AudioBufferSourceNode to the
  // destination so we can hear the sound
  source.connect(audioCtx.destination);

  // start the source playing
  source.start();

}

@mattetti
Copy link
Member Author

mattetti commented Mar 1, 2017

Quick update: this interleaved vs not interleaved issue got me stuck. I instead opted to do a lot of work on my wav and aiff decoders, make sure the buffered approach worked and was documented/had examples. I spent a decent amount of time improving the 24bit audio support for the codecs (my implementation was buggy but hard to verify, tests were added).

At this point, I still think Nigel/image pkg approach is the most interesting but I don't have the bandwidth to build a full implementation. Egonelbre's implementation shows some of the challenges we are facing depending on the design decision we take.

I'll focus on my current edge cases and real world implementation to see how a generic API would best benefit my own usage.

On a different note, @brettbuddin wrote a very interesting synth in Go and I think he would be a good addition to this discussion: https://github.com/brettbuddin/eolian

@brettbuddin
Copy link

Took me a bit to get caught up on the state of the discussion.

One of the early decisions I made with Eolian was to focus on a single channel of audio. This is mostly because I didn't want to deal with interleaving in processing and I didn't have a decent implementation for keeping channels separate at the time. I've wanted to implement 1-to-2 (and back) conversion modules for some time now, but have been stuck in a similar mode of trying to decide how not to disrupt the current mono structure too drastically.

@egonelbre
Copy link

While implementing different things I realized there is one important case where int16 is preferrable -- ARM -- or mobile device in general. Looking at pure op stats there's around 2x performance difference between using float32 and int16. Of course, I'm not sure how directly these measurements translate into audio code, but it is something to be considered.

@egonelbre
Copy link

Yup, interleaved is faster to process... I was concerned about whether using buffer[i+1] etc. would introduce a bounds check in the tight loop. But it seems it is not: https://play.golang.org/p/EkNPEjU3bS

BenchmarkInterleaved-8                  10000000               144 ns/op
BenchmarkInterleavedVariable-8           5000000               295 ns/op
BenchmarkInterleaved2-8                 10000000               151 ns/op
BenchmarkDeinterleaved-8                10000000               219 ns/op

It seems that the optimizer can deal with it nicely. Disabling bounds checks had no effect on the results. However using "variable" number of channels has a quite impact.

Still, it is more convenient to write code for deinterleaved :D. Maybe there is a nice way of automatically generate code based on one single channel example for different types and channel counts?

Superpowered has chosen to support only stereo which makes things much easier and as far as I understand it only supports float32.

@brettbuddin
Copy link

brettbuddin commented Mar 30, 2017

Is there a case where you wouldn't be aware of how many channels of audio there are in your stream? The variable method doesn't seem all that valuable to me.

Edit: Nevermind. I misread the example. Disregard this question.

@faiface
Copy link

faiface commented Apr 29, 2017

(Reposting from Pixel gitter).

Just a quick point here from me (got some work right now, will come with more ideas later). Looking at go-audio, I don't really like the core Buffer interface at all. It's got way too many methods, none of which enables reading data from it without converting it to one of those 3 standard formats. Which makes it kind of pointless to use any other format.

Which is very bad, because a playback library might want to use a different format.

Which would require extensive conversions for every piece of audio data, maybe even 2 conversions for each piece, and that's a lot of mess.

@faiface
Copy link

faiface commented Jul 14, 2017

Hi guys,

as you could've noticed on Reddit, a month ago we've started working on an audio package in Pixel and we want to make it a separate library.

Throughout the month, we learned numerous things. One of the most important things we learned is: buffers in the high-level API are not a very good idea for real-time. Why? Management, swapping them, queueing them, unqueueing them, and so on, cost significant CPU resources. OpenAL uses buffers and we weren't able to get less than 1/20s latency, which is quite high.

Also, buffers don't work very well. They don't compose, it's hard to build nice abstractions around them. They're simply data.

In Pixel audio, we chose a different approach. We built the whole library (not that it's finished) around this abstraction:

type Streamer interface {
    Stream(samples [][2]float64) (n int, ok bool)
    Err() error
}

Streamer interface is very similar to io.Reader, except that it's optimized for audio. For full explanation and documentation of the rest of the API, read here.

The thing is, this abstraction is extremely flexible and composes very well, just like io.Reader. Also, it's suitable for real-time audio processing (you can try for yourself, the library works).

I'd like to suggest replacing current go-audio with the audio package from Pixel. I know this is a big suggestion, but I think it would be worth it for go-audio. In case you don't want to do that, I'd at least suggest you change your abstractions away from buffers and more towards this streaming approach.

Thanks!

If you have any questions, please ask!

@mattetti
Copy link
Member Author

I wasn't aware of this new audio package. I'll definitely check it out. One thing that surprised me is the fact that your streamer interface seems to be locked to stereo. Is that correct?

@faiface
Copy link

faiface commented Jul 14, 2017

Yes that is correct. This no problem for mono, mono can always be easily turned into stereo. This is only a problem for more channels than stereo offers. I'm open to solutions for this. So far, it's like "stereo is enough", but that might not be true.

@faiface
Copy link

faiface commented Jul 14, 2017

The idea of replacing go-audio was a bit too rushed, upon second though, it's probably not a good idea. However, I still suggest that go-audio changes its core interfaces away from buffers and towards streaming.

@mattetti
Copy link
Member Author

We definitely need to have support for surround/multichannel audio formats. Let me take a look next week and write down my thoughts and see if we can go from there.

@egonelbre
Copy link

@faiface I'm unclear about the comment that "buffers in the high-level API are not a very good idea for real-time". I mean your design still uses buffers i.e. the samples [][2]float32 parameter.

I do agree that the end-user of things like player, effects, decoding, streaming, encoding, shouldn't have to worry about buffers. But internally handling of buffers is unfortunate, but necessary. E.g. when you get 16 track audio sequencer, most likely you will need some way to generate/mix them separately, because one might cause ducking on an another track.

Note that the go-audio/audio is not the only attempt at audio lib - i.e. https://github.com/loov/audio.

I will try to go over the package in more details later, but preliminary comments based on seeing the API:

  1. There can be N channels.
  2. There can be multiple SampleRate-s in your input/processing/output.
  3. There can be multiple sample types e.g. int16, float32, float64 come to mind.

I do agree that as a simple game audio package it will be sufficient... And not handling the diversity can be a good trade-off.

@faiface
Copy link

faiface commented Jul 14, 2017

@egonelbre Of course, internally there usually are some buffers, however, this API minimizes their use. This is in contrast to OpenAL, which requires you to create and fill a buffer if you want to play anything. This results in additional copying between buffers and so on. Maybe that can be implemented efficiently, however, OpenAL does not implement it efficiently enough. Note, that when calling streamer.Stream(buffer), no additional buffer needs to be created. This is not the case for OpenAL way of handling buffers. We switched to ALSA backend for Linux and that enabled us to have millisecond(-ish) latency. Data is only copied to temporary on-stack buffers when we need to do mixing, other than that, data is copied directly to the lowest-level.

Regarding channels, yeah, that's a trade-off.

Regarding sample rates. Yes, there can be multiple sample rates. The reason we adopted the approach of "unified sample rate" is that it simplifies signal processing code in major ways. For example, if the Stream method would take an additional format argument, it would always need to handle conversions between formats. This would non-only result in a lot more code, it would also result in worse performance.

However, unified sample rate is not a problem, IMHO. Audio sources, such as files, can always be resampled on-the-fly, or the sample rate can be adjusted according to them. The unified sample rate is only important for the intermediate audio processing stage.

The same holds for different sample types. Any sample type can be converted to float64. If the audio source contains int16 samples, it's no problem. They simply get converted to float64 when streaming.

@egonelbre
Copy link

Yup, I agree that within a single pipeline you will most likely use a single sample rate. However, a server that does audio-processing in multiple threads might need different sample-rates. But, I'm also not completely clear how big of a problem this is in real-world.

With regards to float64. On ARM there's a (potential) performance hit that you take due to processing float64 instead of float32 or int16.

I understand very well that handling all these cases complicate the implementation. (To the extent that for a game, I will probably just use stereo float32, myself.)

With regards to performance, I would like to specialize on all the different parameters. Effectively, to have type Buffer_SampleRate44100_Stereo struct and get all the effects filters implemented automatically for all the variations and with SIMD (as much as possible) -- but I still don't have a good idea how it would look like in practice. This might be an unrealistic goal, but definitely something to think about.

I do have some thoughts. 1. code-gen; Nile stream processing language for defining effects, 2. optimization passes similar to go compiler SSA and generate SIMD code, 3. single interfaced package that is backed by multiple implementations etc....

But, generally, every decision has trade-offs -- whether you care about the "trade-off" is dependent on the domain and your use-cases -- and it's completely fine for some domain not to care about some of these trade-offs.

@faiface
Copy link

faiface commented Jul 15, 2017

I believe that you and I agree that having to implement each filter/effect/compositor for each sampling format (mono, stereo, int8, int16, float32, ...) is awful. So yeah, one way probably is code generation, although I'm not sure how feasible this is.

The question I think is in place is: is it worth to support all the formats within the audio processing stage? I decided to answer this question with: no. Of course, it's necessary to support all the formats for decoding and encoding audio files. But I think it ends there.

Let me show you. Here's the implementation of the "gain effect" (a non-sophisticated volume slider): https://github.com/faiface/beep/blob/master/effects.go#L8. Now, the important thing is: this is all I had to write. And it works with everything. I can use to to adjust the volume of music, sound effects, individual sounds, I can change the volume in real-time. If I was to support all the different formats in the audio processing stage, I can't really see how I would achieve this level of convenience. And convenience like this makes it possible to implement all of the complicated effects, such as 3D sound and Doppler effect things in very few lines of code.

Beep is currently less than 1K LOC (not counting https://github.com/hajimehoshi/oto, which implements the actual playback backend) and already supports loading WAV files, mixing, sequencing, controlling playback (pausing, tracking time), playing only parts of an audio file or any streamer, and makes it easy to create your own streamers and effects.

I'm sorry if I sounded like I was wanting to destroy go-audio before :). I eventually came to the conclusion that it's best to keep Beep and go-audio separate. I just want to point out one way of doing audio and show its benefits. And you guys can take an inspiration, or don't. No problem there.

EDIT: And I don't see why a server with multiple threads could possibly need to use different sample rates anywhere.

@mattetti
Copy link
Member Author

mattetti commented Jul 15, 2017 via email

@egonelbre
Copy link

I completely understand the reasoning for not supporting the different options I described. I think it's important to examine the potential design-space as fully as possible.

Note, the Gain function you have there can produce clicking noise. i.e. try switching between gain 0 and 1 every ~100ms. Not sure where the gain value comes from, how it's modified and how big are your buffers... so it might not happen in practice. And, oh, yes, I know all the pain of writing the processor code for multiple formats/options https://github.com/loov/audio/blob/master/example/internal/effect/gain.go#L20. It avoids some of the aliasing, but still can have problems with it. Also it doesn't do control signal input. But I digress.

Do note, this is also the reason we have go-audio/audio and loov/audio separate -- so we can experiment independently and come up with unique solutions. (e.g. I noticed in the wave decoder this https://github.com/faiface/beep/blob/master/wav/decode.go#L125 -- look at similar issue I posted against go-audio/wav go-audio/wav#5 (comment))

Eventually we can converge on some of the issues... or maybe we will create multiple package for some different-domains, but there is still code that could be potentially shared between all the implementations. (e.g. ASM/SIMD code on []float32, []float64 arrays for different platforms; or converting MULAW bytestream to []float64-s).

EDIT: And I don't see why a server with multiple threads could possibly need to use different sample rates anywhere.

Facilitated example: Imagine a resampling service where you submit a wave-file and you can specify the end-result sample-rate, which you can later download.

@faiface
Copy link

faiface commented Jul 15, 2017

One of the reasons why I concluded that beep is not a good fit for go-audio is that I deliberately make compromises to enable simplifity and remove the pain of audio programming, but that gets in the way of universality, so we're aiming at slightly different goals.

Regarding clicking in Gain, I don't think it's Gain's responsibility to fix that. You know, if you start playing PCM data in the middle of a wave (at value 0.4 for example) a click is what's supposed to happen. I think it's the user's responsibility to adjust the gain value smoothly, e.g. by lerping. And the buffer size can be as small as 1/200s (but only on Linux at the moment, but we're working on getting the latency down on other platforms too).

The decoding design, that's interesting, although I think WAVE only supports uint8, int16 and float32, right? So I'm not sure it's worth it, but I'll think about it.

And resampling server. If you take a look at our roadmap, one of the things which are to be done are: Buffer struct and Resample decorater. Don't be confused, Buffer struct is more like a bytes.Buffer for samples and less like an OpenAL buffer. So the sample conversion will be done something like this:

buf := beep.NewBuffer(fmt2)
buf.Collect(beep.Resample(source, fmt1.SampleRate, fmt2.SampleRate))

or even directly to the file (in which case the file is never fully in memory, note that wav.Encode is not implemented yet)

wav.Encode(fw, fmt2, beep.Resample(source, fmt1.SampleRate, fmt2.SampleRate))

beep.SampleRate only takes place when it's important, and that will be documented.

@egonelbre
Copy link

@faiface Some WAVE samples here: https://en.wikipedia.org/wiki/WAV#WAV_file_audio_coding_formats_compared. And a list of things libsndfile supports http://www.mega-nerd.com/libsndfile/. Although, you can pretty far by just supporting PCM u8, s16, s24, s32, s64 and IEEE_FLOAT 32, 64, mulaw and alaw.

@faiface
Copy link

faiface commented Jul 15, 2017

@egonelbre With WAVE, only PCM support is planned so far, as PCM seems to be close to 100% of the existing usage. Currently supported are u8 and s16, but float32 support will be added shortly. I believe these cover an overwhelming majority of what's out there.

@wsc1
Copy link

wsc1 commented Oct 4, 2018

@faiface audacity uses float32 wav

@wsc1
Copy link

wsc1 commented Oct 4, 2018

Hi all,

There is an issue about pre-emption in go runtime which can influence reliability of audio i/o.

Also, zc/sio is a work in progress to deal with audio callback APIs like CoreAudio, AAudio, Jack which operate on a foreign usually real-time thread.
The goal is to make the path to the callback free of sys-calls (cgo isn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants