First steps on the road to Holofunkiness
Dang it’s been quiet around here lately. Too quiet. One might think I had no intention of ever blogging again. Fortunately for us all, the worm has turned and it’s time to up the stakes considerably, as follows:
I mentioned in a blog some time ago that I had a pet hacking concept called Holofunk. That’s what I’ll mostly be blogging about for the rest of the year.
There has been a lot of competition for my time — I’ve got two awesome kids, three and six, which is an explanation right there; and I spent the first half of the year working on a hush-hush side project with my mentor. Now that project has wound down and Holofunk’s time has finally come.
One thing I know about my blogging style is that it works much better if I blog about a project I’m actively working on. Back in the day (e.g. 2007, still the high point for blog volume here), I was contributing to the GWT open source project, and posting like mad. Since joining Microsoft in 2008, though, I’ve done no open source hacking to speak of. That’s about to change.
Holofunk is a return to the days of public code, since I’ll be licensing the whole thing with the Microsoft public license (that being the friendliest one legally, as well as quite compatible with my goals here). So now I can hack and talk about it again, and that’s what I intend to do. The rest of 2011 is my timeframe for delivering a reasonably credible version of Holofunk 1.0. Feel free to hassle me about it if I slack off! It never hurts motivation to have people interested.
So What The Pants Is Holofunk Anyway?
My post from last year gave it a good shot, but I think some videos will help a great deal to explain what the heck I’m thinking here. Plus it livens up this hitherto pure wall-of-text blog considerably.
First, a video from Beardyman, who is basically my muse on this project. This video is him performing live, recording himself and self-looping with two Korg KAOSS pads, while being recorded from multiple cameras. The audio is all done live. Then a friend of his edited the video (only) such that the multiple overlaid video images parallel the audio looping that he’s doing. In other words, the pictures reflect the sounds. Check it:
OK. So that’s “live looping” — looping yourself as you sing. (Beardyman is possibly the best beatboxer in the world, so he’s got a massive advantage in this artform, but hey, amateurs can play too!)
Now. Here’s a totally sweet video of a dude who’s done a whole big bunch of gesture recognition as a frontend to Ableton Live, which is pretty much the #1 electronic music software product out there:
You can see plenty of other people are all over this general “gestural performance” space! In fact, given my limited hacking bandwidth, it’s entirely possible someone else will develop something almost exactly like what I have in mind and totally beat me to it. That would be fine — if I can play with their thing, then great! But working on it myself has already been very educational and promises to get much more so.
Here’s one more Kinect-controlled Ableton phenomenon. This one a lot more ambient in nature, and this guy is even using a Wiimote as well. He includes views of the Ableton interface:
So those are some of my inspirations here.
My concept for Holofunk, in a nutshell, is this: use a Kinect and a Wiimote to allow Beardyman-like live looping of your own singing/beatboxing, with a gestural UI to actually grab and manipulate the sounds you’ve just recorded. Imagine that dude in the second video had a microphone and was singing and recording himself even while he was dancing, and that his gestures let him manipulate the sounds he’d just made, potentially sounding a lot like that Beardyman video. That’s the idea: direct Kinect/Wiimote manipulation of the sounds and loops you’re making in realtime. If it still makes no sense, well, thanks for making the effort, and hopefully I’ll have some videos once I have something working!
Ideas Are Cheap, Champ
One thing I’ve deeply learned since starting at Microsoft is that big ideas are a dime a dozen, and without execution you’re just a bag of hot wind. So by brainstorming in public like this I run a dire risk of sounding like (or actually being) a mere poser. Let me first make very clear that all the projects above, that already actually work, are awesome and inspiring, and that I will be lucky if I can make anything half as cool as any one of them.
That said, I am going to soldier on with sharing my handwavey concepts and preliminary investigations, since it’s what I got so far. By critiquing these other projects in the context of mine, I’m only trying to be clear about what I’m thinking; I’m not claiming to have a “better idea”, just (what I think is) a different idea. And as I said, everyone else is free to jump on this concept, this is open source brainstorming right here!
The general thing I want to have, that none of the projects above have quite nailed, is a clear relationship between your gestures, your singing, the overall sound space, and the visuals. I want Holofunk to make visual and tangible sense. Loops should be separately grabbable and manipulable objects, that pulse in rhythm with the system’s “metronome”, and that have colors based on their realtime frequency. (So a bass line would be a throbbing red circle and a high-hat would be a pulsing blue ring.) It should be possible for people watching to see the sounds you are making, as you make them, and to follow what you’re doing as you add new loops and tweak existing ones. This “visual approachability” goal will hopefully also make it much easier to actually use Holofunk, not just watch it.
For an example of how this kind of thing can go off the rails, check out this video of Pixeljunk Lifelike, from a press demo at the E3 video gaming conference:
This is cool, but too abstract, as this review of the demo makes clear:
Then a man got up and began waving a Move controller, and we heard sounds. The screen showed a slowly moving kaleidoscope. I couldn’t tell how his movements impacted the music I was hearing or the images I was seeing. This went on for over 20 minutes and it felt like a lifetime.
Beardyman is also notoriously challenged to communicate what the hell he is actually doing on stage. He admits as much in this clip from him performing on Conan O’Brien (at 1:20):
My ultimate dream for Holofunk is to make it so awesomely tight that Beardyman himself could perform with it and people could more easily understand what the hell is going on as his piece trips them out visually as well as audially. That’s the ultimate goal here: make the audible visible, and even tangible. Holofunk.
(Now, realistically there’s no way Beardyman would actually do better with a single Wiimote than with four full KAOSS pads — he’s just got a lot more control power there. Still, let’s call it an aspirational goal.)
Ableton Might Not Cut It
I knew jack about realtime audio processing when I started researching all this last year. I actually started out by getting a copy of Ableton Live myself, since I figured that it already did all the sound processing I could possibly want, and more. People hacking it with Kinect are all over the net, too, and it’s got a very flexible external API. I fooled around with it at home, recording some tracks myself.
But the more I played with it, the more I started questioning whether it would ultimately be the right thing. Ableton was originally engineered on the “virtual synthesizer & patch kit” paradigm. It’s a track-based, instrument-based application, in which you assemble a project from loops and effects that are laid out like pluggable gadgets.
The problem is that the kind of live looping I have in mind for this project is going to have to be very fluid. Starting a new track could happen at the click of a button. Adding effects and warps is going to be very dynamic. Literally every Ableton-based performance I have seen is structured around creating a set of tracks and effects, and then manipulating the parameters of that set in realtime. Putting Kinect on top of Ableton seems to basically turn your body into a very flexible twiddler of the various knobs built into your Ableton set. The “Kin Hackt” video above shows the Ableton UI “under the hood”, but even the much more dynamic and involving “dancing DJ” above is still fundamentally manipulating a pre-recorded set of tracks (though he’s recording and looping his gestural manipulations of those tracks).
I was pretty sure that while I could get a long way with Ableton, I’d ultimately hit a wall when it came to really getting to slice up a realtime microphone track into a million little loops. So I was finding myself itching to just start writing some code, building callbacks, handling fast Fourier transforms, and just generally getting my hands directly on the samples and controlling all the audio myself. Perhaps it’s just programmer hubris, but I ultimately decided it was too risky to climb the full Ableton/Live/MAX learning curve only to perhaps finally discover it wouldn’t be flexible enough.
The second video above calls itself “live looping with Kinect and Ableton Live 8”, and it is live looping in that he’s obviously recording his own movements, such that the gestures he makes shape one of the tracks in his Ableton set, and he then loops the shaped track. Perhaps it would be trivial to add a microphone to the experience and loop a realtime-recorded track. Looks like I’ll be looking that dude up! But on my current path I’ll be building the sound processing in C# directly.
Latency Is Death: The Path To ASIO
When first firing up Ableton, with an M-Audio Fast Track Pro USB interface, I found things laggy. I would sing or beatbox into the microphone, and I would hear it back from Ableton after a noticeable delay. Just as a long-distance phone call can lead to people tripping over each other, even small amounts of latency are seriously annoying for music-making.
So latency is death. It turns out that Windows’ own sound APIs are not engineered for low latency, as they have a lot of intermediate buffering. The most common solution out there is ASIO, a sound standard from steinberg.net. There is a project named ASIO4ALL which puts out what amounts to a universal USB ASIO driver, enabling you to get low-latency sound input from USB devices generally. Installing ASIO4ALL immediately fixed the latency issues with Ableton. So it’s clear that given that I’m developing on Windows, ASIO is the way to go for low-latency sound input and output.,
On the latency front, it’s also worth mentioning this awesome article on latency reduction from Gamasutra. I will be following that advice to a T.
.NET? Are you crazy?
I’m going to be writing this thing in C# on Windows and .NET. The most obvious reason for this is I work for Microsoft and like Microsoft products. The less obvious reason is that I find C# a real pleasure to program in, and very efficient when used properly.
My boss is fond of pointing out that pointers are essentially death to performance, in that object references generally directly imply garbage collector pressure and cache thrashing, both of which are terrible. But in C#, with struct types, you can represent things much more tightly if you want. You can also avoid famous problems like allocating lambdas in hot paths.
In the particular case of Holofunk, the most critical thing to get right is the buffer management. I will need to make sure I know how much memory fragmentation I’m getting and how many buffers ahead I should allocate. My hunch is I’ll wind up allocating in 1MB chunks from .NET, and having a sub-allocator chop those up into smaller buffers I can reference with some BufferRef struct.
Anyway the point is that I know there are performance ratholes in .NET, but my day job has given me extensive experience at perf tuning C# programs generally, so I am not too concerned about it right now.
And, of course, Microsoft tools are pretty darn good compared to some of the competition. Holofunk will be an XNA app for Windows, giving me pretty much the run of the machine with a straightforward graphics API that can scale up as far as I’m likely to need. I’ve taken the classic “adapt the sample” approach to getting my XNA project off the ground, and I’m developing some minimal retained scene graph and state machine libraries.
What about Kinect?
Microsoft just released the Windows Kinect SDK beta, which is dead simple to use — maybe a page of code to get full skeletal data at 15 to 20 frames per second in C# (on my Core 2 Quad Q9300 PC from three years ago). So that’s the plan there.
It doesn’t support partial skeletal tracking, or hand recognition, or a variety of other things, and it has a fairly restrictive noncommercial license. But none of those are at all showstoppers for me, and the simplicity and out-of-the-box it-just-works factor are high enough to get me on board.
Why a Wiimote? And how?
I’ve mentioned “Wiimote” a few times. The main reason is simple: low-latency gesturing.
It’s no secret that Kinect has substantial latency — at least a tenth of a second or so, and probably more. What is latency? Death. So having Kinect be the only gestural input seems doomed to serious input lag for a music-making system. Moreover, finger recognition for Kinect is not available with the Microsoft SDK. I could be using one of the other open source robot-vision-based Kinect SDKs (there’s one from MIT that can do finger pose recognition), but that would still have large latency, and would require the Kinect to be closer to the user. I want this to be an arm-sweeping interface that you use while standing and dancing, not a shoulders-up interface that you have to remain mostly still to use.
I can’t see how to do a low-latency direct manipulation interface without some kind of low-latency clicking ability. That’s what the Wiimote provides: the ability to grab (with the trigger) and click (with the thumb), and a bunch of other button options thrown in there into the bargain.
A sketch of the interaction design (I am not an interaction designer, can you tell?) is something like this:
- Initial screen: a white sphere in the center of a black field, overlaid with a simple line drawing of your skeleton. Hands are circles.
- Sing into microphone: sphere changes colors as you sing.
- The central sphere represents the sound coming from your microphone.
- (First color scheme to try: map frequencies to color spectrum, and map animated spectrum to circle, with red/low in center and violet/high around rim.)
- Reach out at screen with Wiimote hand: see skeleton track.
- Move Wiimote hand over white sphere: hand circle glows, white sphere glows.
- Pull Wiimote trigger: white sphere clones itself; cloned sphere sticks to Wiimote hand.
- The cloned sphere is a loop which you are now recording.
- Sing into microphone while holding trigger: cloned sphere and central sphere both color-animate the sound.
- Release Wiimote trigger: cloned sphere detaches from Wiimote hand and starts looping.
- Letting go of the trigger ends the loop and starts it playing by itself. The new sphere is now an independent track floating in space, represented by an animated rainbow circle.
That’s the core interaction. And the key is that the system has to respond quickly to trigger presses. You really want to be able to flick the trigger quickly to make separate consecutive loops, and less latency in that critical gesture is going to make life much simpler.
So a Wiimote it is. Fortunately there is a .NET library for connecting a Wiimote to a PC via Bluetooth. It was written by the redoubtable Brian Peek, who, as it happens, also worked on some of the samples in the Windows Kinect SDK. This project would not be nearly as feasible without his libraries! I got a Rocketfish Micro Bluetooth Adapter at Best Buy, and the thing is shockingly tiny. With a bit of finagling (it seems to need me to reconnect the Wiimote from scratch on each boot), I was able to rope it into my XNA testbed.
You don’t really want to write a whole DSP library from scratch, do you?
Good God, no. Without Ableton Live, I need something to handle the audio. It has to play well with C#, and with ASIO. After a lot of looking around, multiple parties wound up recommending the BASS audio toolkit.
In my fairly minimal experimentation to date, BASS has Just Worked. It was able to connect to the ASIO4ALL driver and get sound from my microphone with low latency, while linked into my XNA app. So far it’s been very straightforward, and it looks like the right level of API, where I can manage my own buffering and let the library call me whenever I need to do something. It also supports all the audio effects I’m likely to need, and — should I want to actually include prerecorded samples — it can handle track acquisition from anywhere.
It also has a non-commercial license, but again, that’s fine for this project.
The Fun Begins… Now
So… that’s what I have. I feel like a model builder with parts from a new kit spread out all over the floor, and only a couple of the first pieces glued together. But I’m confident I have all the pieces.
Another thing I want to get right is I want Holofunk to record its performances, so you can play them back. This means not only the sounds, but the visuals. So I need an architecture that supports both free-form direct manipulation, and careful time-accurate recording of both the visuals and the sounds.
Over the next six months I will be steadily chipping away at this thing. Here’s a rough order of business:
- Get Kinect skeleton data into my XNA app
- Render minimal skeleton via scene graph based on Kinect dataa
- Integrate Wiimote data to allow hand gesturing
- Define “sound sphere” class (I think I might call them “loopies”)
- Support grabbing, manipulating loopies (interaction / graphics only, no sound yet)
- Performance recording:
- Define core buffer management
- Implement microphone recording
- Implement buffer splitting from microphone recording
- Define “Performance” class representing an evolving performance
- Define recording mechanism for streams of positional data (to record positions of Loopies)
- Holofunk comes to life
- Couple direct manipulation UI to recording infrastructure
- Result: can grab to make a new loopie, can let it go to start it playing
If I can get to that point by the end of the year, I’ll be happy. If I can get further, I’ll be very happy. Further means:
- Ability to click loopies to select them
- Press on loopies to move them around spatially
- Some other gesture (Wii cross pad?) to apply an effect to a loopie
- Push up and wave your Wiimote arm, and it bends pitch up and down
- Push right, and it applies a frequency filter, banded by your arm position (dubstep heaven)
- Push down, and it lets you scratch back and forth in time (latency may be too high for this though)
- Hold the trigger while doing such gestures, and the effect gets recorded
- This lets you record effects on existing loopies
- Segment the screen into quarters; provide affordances for muting/unmuting a quarter of the screen, merging all loopies in that quarter, etc.
- This would let you do group operations on sets of sounds
AND THEN SOME. The possibilities are pretty clearly limitless.
My most sappily optimistic ambition here is that this all becomes a new performance medium, a new way of making music, and that many people find it approachable and enjoyable. Let’s see what happens. Thanks for reading… and stay tuned!