Towards (even more) practical Faust

I’ve recently completed my final project for the MUSIC 420B class in Stanford University. Prior to taking this class I happened to attend a workshop by Yann Orlarey, the author of the Faust musical signal processing language, that was held in CCRMA. I was fascinated by the ease and speed with which one could create effects and sound synthesizers using Faust. Also, having missed electronic music composition by that time I took the class that offered the possibility to play more with Faust and music production software. My initial goal was to demonstrate practical usage of Faust for music production in combination with software like Logic Pro or Ableton Live and write a whole piece using Faust generated sounds. VST in my opinion was the way to go because the tools that I wanted to use on Mac supported it. Although it was possible to create VSTi plug-ins with Faust, they were lacking some features that are expected from most synthesizers, with polyphony being the most noticeable of them. I’ve decided to take on this and fill this gap. The DSSI plug-in architecture supported by Faust already had support for polyphony so I could get an idea how to implement it for VSTi-s.

The following diagram describes in general the VSTi architecture design. The VST host interacts with the plug-in through the AudioEffectEx interface. Faust class implements this interface and using multiple instances of the Voice class supports polyphony. Each Voice class contains an instance of mydsp class which is produced by the Faust compiler and implements the signal processing/synthesis part.

Faust VSTi architecture design

Support for portamento slide was added by storing the last played voice in a dedicated member of the Faust class. In addition the architecture recognizes the “pitchbend” control as one that has to be updated according to MIDI pitch-bend event. The following short loop demonstrates how a whole piece can be produced using Faust instruments. All the instruments are Faust VSTi-s except for percussion that wasn’t fully implemented yet:

That’s how the bastard-synth VST used in this loop looks in MuTools’ MULab. Here we can see the controls recognized by the Faust architecture: freq, gain, gate, pitchbend and prevfreq.

A more detailed project summary is in the following paper http://stanford.edu/~yanm2/files/mus420b.pdf. The project code is part of Faust source and can be check out from http://git.code.sf.net/p/faudiostream/code.

Matrix poetry

Beyond corn fields and golden rye
Where sun goes down and sirens cry
Illusion overtakes the mind
It’s nearly positive, or so it seems
But eigenvalues do not lie!

LED T-Shirt

During the last couple of months I’ve been working on a fun side-project with my friend Shlomoh Oseary. For a long time I wanted to make a T-shirt with an equalizer display on it that will light up in correspondence with surrounding sounds and music, and once I had a buddy excited about this idea too we started working.

We decided to use E-textile dedicated components. Arduino Lilypad with its 8 MHz Atmega processor seemed suitable for the task. Now we had to understand how will will drive the LEDs. The naive approach of connecting each LED to ground and to one of the Lilypad’s outputs would limit the number of LEDs we can drive this way. After searching a bit we found that what we want is to build a LED matrix. The principle in a LED matrix is that all the LEDs in the same row or column are connected. In our case all the minus legs of the LEDs in the same column are shorted and all the plus legs of the LEDs in the same row are shorted. To light up a LED we need to feed positive voltage to the corresponding row and short to ground the corresponding column. To light up multiple LEDs our LED matrix driver code  loops over all the rows and columns and constantly lights up each LED that is required to be turned on for a fraction of a second thus achieving the effect of those LED being constantly turned on.

Testing the microphone and the FFT calculation

Each column of the LED-matrix represents a frequency range with lower frequencies on the right. The more energy is sensed in a certain bin – the more LEDs in this column will be turned on. To find the energy for each frequency range we used FFT over a window of  128 samples. The sampling frequency was chosen to be 4000 Hz providing according to Nyquist theorem coverage for tones up to 2000 Hz. A predefined threshold (which we need to calibrate) is subtracted from the calculated energy to filter out small fluctuations and the outcome is mapped to the number of rows of the LED matrix to represent an energy level.
We used an existing FFT implementation for Arduino from http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1286718155.
There is still a final touch missing to the algorithm which is applying a low-pass filter to clean frequencies higher then 2000 Hz from the recorded signal prior to FFT calculation.

Connecting the electret microphone and the power supply to the Lilypad.

LED T-shirt @ work

When beauty and electronics meet… (Julia Shteingart modeling)

Code

Project’s code (except for FFT implementation which can be downloaded using the link above and the TimerOne library which can be downloaded from Arduino site) is available through SVN under

https://bitbucket.org/ymcrcat/led-t-shirt/

Credits

To Shlomoh’s mom for sewing.

3D on Android (Geekcon 2011)

I’ve recently participated in Geekcon 2011. It is similar to Garage Geeks but the thing is that people actually build stuff during the two and a half days of staying there.
My friends, Ofer Gadish and Gil Shai from Acceloweb and I worked on displaying stereoscopic 3-D images on an Android device. Those were exciting three days of sleeping very little, coding a lot, soldering, drinking beer and having a lot of fun.

When we initially discussed the idea we thought about using 3-D glasses controlled by Bluetooth but we realized that in the short time we had we would probably not be able to study the control protocol and also figure out how we directly control the Bluetooth transmitter of the mobile device, if it is at all possible to do it on such a low level from a user application.

Instead we have chosen to control the glasses through the audio jack output of the mobile phone. We found another pair of glasses controlled using quite an old VESA standard. The glasses are supplied with 3-pin mini-DIN receptacle. The idea is very simple: high voltage means logical “1” means opening the left eye and low voltage means logical “0” means opening the right eye.

To supply ground, +5V and an accurate square wave synchronization signal to the glasses we’ve done some soldering and connected the mini-DIN to an Arduino, that was in turn receiving the output from the mobile’s audio jack.

The Android software was a bit of a mess, reaching a switching rate of 60 Hz wasn’t very simple considering the slow performance which I don’t exactly know what to attribute to, whether it is a slow refresh rate of the display or the technique we used to draw the images (although we accessed the canvas directly, bypassing any higher level APIs for displaying a picture). On Saturday afternoon we had it running, with some glitches occurring every couple of seconds, but giving some feeling of 3-D depth. Or was it our exhausted imagination after not sleeping too much during this crazy and awesome weekend?

Eusipco 2011

At the end of August I have attended Eusipco 2011 conference in Barcelona, Spain. I have presented my work on speaker identification using diffusion maps, a manifold learning and dimensionality reduction method developed during the last years by Ronald Coifman and Stephane Lafon. The paper can be found here: “Speaker Identification Using Diffusion Maps”.
In this paper we propose a data-driven approach for speaker identiﬁcation without assuming any particular speaker model. The goal in speaker identiﬁcation task is to determine which one of a group of known speakers best matches a given voice sample. Here we focus on text-independent speaker identiﬁcation, i.e. no assumption is made regarding the spoken text. Our approach is based on a recently developed manifold learning technique, named diffusion maps. Diffusion maps enable embedding of the recording into a new space, which is likely to capture the speech intrinsic structure. The algorithm was tested and compared to common identiﬁcation algorithms, and our experiments had shown that the proposed algorithm obtains improved results when few labeled samples are available.

Playlists generation for AudioGalaxy

I use audio galaxy and thought it would be nice to create M3U playlists for all my music folders. The result is this simple Python script:

import os
import string

AUDIOGALAXY_PLAYLIST_DIR = r"C:UsersyanMusicAudiogalaxy Playlists"
MUSIC_ROOT_FOLDER = "D:\Music\"
MUSIC_EXTENSIONS = [".mp3",]
PLAYLIST_EXT = ".m3u"

class Playlist:
def __init__(self, dir):
self.__name = string.replace(string.split(dir, MUSIC_ROOT_FOLDER, -1)[1], os.path.sep, "_")
if not self.__name:
self.__name = "Songs in root"
self.__songs = []
for file in [file for file in os.listdir(dir) if not file in [".", ".."]
and os.path.splitext(file)[1] in MUSIC_EXTENSIONS]:
fullname = os.path.join(dir, file)
if not os.path.isdir(fullname):
self.__songs.append(fullname)

def printme(self):
print "%s - %d songs" % (self.__name, self.countSongs())
#for song in self.__songs:
#    print "t%s" % (song,)

def countSongs(self):
return len(self.__songs)

def save(self, dir):
if 0 == self.countSongs():
return
file = open(os.path.join(dir, self.__name + PLAYLIST_EXT), "w")
file.write('#EXTM3Un')
for song in self.__songs:
file.write(song + "n")
file.close()

def handle_dir(target_dir, dirname, fnames):
p = Playlist(dirname)
p.printme()
p.save(target_dir)

def main():
os.path.walk(MUSIC_ROOT_FOLDER, handle_dir, AUDIOGALAXY_PLAYLIST_DIR)

if __name__ == '__main__':
main()


Just set MUSIC_ROOT_FOLDER to your music files location and AUDIOGALAXY_PLAYLIST_DIR accordingly and run the script.

Pango parking Android application

I’ve published my first android application! Well, no need to get too excited… It’s a simple proxy for the Pango cellular  parking service. The application uses SMS commands supported by Pango to activate and deactivate parking. Meanwhile only the default parking city and area are supported.
I’m planning to add a combo box for choosing the city and parking area, and later, perhaps add support for automatic location detection using GPS.

Enjoy.

https://market.android.com/details?id=com.pango.mobile

VoiceBrowsing Toolbar for IE

Long after I’ve written the first version of the VoiceBrowsing Toolbar for Internet Explorer, it’s about time to mention it here. Once surprised by not finding a comfortable  browser plugin that enables navigation using voice commands, I’ve decided to implement one. To make things simple I’ve relied on the Microsoft Speech API accessible in Windows Vista and above through the .NET framework.
I’ve started with writing a .NET voice recognition engine that provides a registration  and notification interface. Given a dictionary of phrases to be recognized it invokes a callback function provided by the user. A small Internet Explorer toolbar makes it usable for navigating to websites or performing common operations such as browsing back, forward or going to the homepage. The plugin is configurable and lets the user specify keywords (trigger phrases) and the URL that she would like to go to once that keyword is recognized by the speech recognition engine.
The plugin is currently available only for Internet Explorer, starting with version 6, on Windows Vista and above. In the future I plan to implement Firefox and Chrome extensions based on the same speech technology.

Comments and suggestions for further development would be highly appreciated.

Melecon 2010

Last week I attended the Melecon 2010 conference, held in Valetta, Malta. I presented my work on content insertion into H.264 compressed video. It is covered in this article:
Fast H.264 Picture-in-Picture (PIP) Transcoder with B-Slices and Direct Mode Support“.
H.264, an ITU standard for video coding, has become increasingly popular, offering solutions for many applications requiring video compression. In some of these applications there is a need to insert content into an already compressed video. This operation incurs high computational cost if a naïve approach is taken. Therefore, a concept of reusing encoding information, called “Guided Encoding”, was developed in the Signal and Image Processing Lab at the Technion. In this project, we extended this technique and applied “Guided Encoding” to the Main Profile of H.264 to support features such as Bi-directional prediction, weighted prediction and Direct encoding mode. The result is a set of recommendations and algorithmic pointers, as well the implementation of the proposed solution within the H.264 reference software. Evaluation of our solution has shown a significant improvement in run-time compared to the naive approach.

Vibrato detection in audio signals

In a research I’m currently working on we tackle the problem of discrimination between speech and singing. One of the indicators of singing as opposed to speech is the presence of vibrato applied by the singer, more accurately a pitch vibrato. The pitch vibrato is an oscillation of the base pitch with a frequency between 4 and 8 Hz. Therefore, in order to identify the vibrato effect we need to detect this oscillation.
The first step is detecting the pitch. We dissect the audio to frames of 256 samples each and perform pitch detection using the autocorrelation method. Now we have a vector of values indication the pitch for each frame. We compute the DFT of this pitch vector and examine the range of 4-8 Hz. A peak (local maximum) in this range indicates an oscillation of the base pitch. For more robustness we simply calculate the energy of this range:

$f_{min}=4 Hz \\ f_{max}=8 Hz \\ E_{vibrato}=\frac{1}{N}\sum_{f_{min}}^{f_{max}}|P_k^d|^2$

Now we may either use the calculated energy as a measure for vibrato or compare it to a certain threshold we and tell whether the pitch vibrates.

The MIR Toolbox made this task very easy to perform by offering useful functions for audio analysis, segmentation to frames and much more.