Gyrophone Attack: Q&A part 2

A journalist from the Danish Consumer Council has contacted me recently regarding our Gyrophone attack on mobile devices (published in 2014). I’m posting the questions and my answers to them. Note that I haven’t been keeping up with the most recent developments around gyroscope access on mobile device, so I encourage you to verify
the state of matters nowadays.

Q: Can Android phones with gyroscopes still read up to 200 hertz, being within the spectrum of the human voice?
A: There’s no one Android phone. Android is an operating systems with many versions currently being used by users. The access depends not only on the operating system but on the hardware capabilities as well. As far as I know gyroscope measurements are still accessible without special permissions to applications.
I’m not aware of protective measures taken to limit the available sampling rate by software or hardware vendors. However, as we already noted in the paper, the Chrome browser, and other browsers based on the WebKit framework, impose a software limitation on the sampling rate available from JavaScript, bringing it down to 25Hz. So at least Chrome protects its users from malicious access to gyroscopes.
What’s important to note, is that most of the human speech is beyond the sampling frequency, and the access to it is due to an effect named “aliasing”. Low-pass filtering can mitigate the attack. It seems for instance that Samsung Galaxy use hardware that applies certain low-pass filtering (I’m not sure on what frequency), and our phrase recognition attack did not perform as well on those phones as it did on Nexus 4 devices. As we did not test it on many different models, it remains to be studied how well it can perform on them.

Q: In 2014 the technique was not refined enough to pick up more than a fraction of the words spoken near a phone. What about now? Have gyroscopes in smartphones evolved enough to pick up entire sentences and conversations? From how far away can the gyroscopes pick up conversations?
A: The statement “the technique was not refined enough to pick up more than a fraction of the words” is inaccurate. In the experiments we did we purposely trained the algorithms on small sets of phrases to recognize. Since we did not conduct experiments with larger dictionaries, it remains to be studied how well it can perform for larger sets of phrases.
The task (as many other machine learning tasks) definitely becomes harder as the dictionary grows. Our aim was to show a proof of concept, but the full potential of the attack can only be understood by conducting more experiments.
The distance on which this can work depends on the loudness of the signal. If the signal is coming from a further source but is loud enough, perhaps the attack can work. In our experimental setup, the source of the sound was pretty close to the device and was fairly loud. What can amplify the attack is a reverberant surface that responds to sound waves and conducts them well.
To calm things down, I believe that this attack won’t work well when the speakers are several meters away from the device with most gyroscopes. However, this is a very general claim, and it all depends on the particular hardware model and characteristics.

Q: Does the user of an Android phone need to give permission for recording before gyroscopes pick up sound and words? Does Google require the user to give permission for recording?
A: Android phones notify the user when an application requires access to the microphone. However (and that’s the point of our work) it doesn’t notify the user when an application accesses to the gyroscope, which is what enables stealthy eavesdropping. I’m not sure what you mean by Google, since that’s a property of the Android operating system (which is mostly maintained by them), and it’s important to not confuse the two. There are many Android distributions that come without pre-installed Google applications, and Google doesn’t have access to the data on those phone. It’s important to be precise about it and phrase it accurately. Since the Android OS is mostly maintained by Google, it might be a natural expectation that they would address such issues, however, since it is an open source system, technically inclined users can compile their own version of it, and mitigate the attack.

Q: Can the user choose not to be recorded (through gyroscopes)?
A: As far as I know, the user of a standard Android distribution doesn’t have an option to block access to the gyroscopes.

Q: What do Google do with the sound data they pick up via gyroscopes?
A: I don’t have any evidence that Google collects such data and does anything with it. And since Google anyway has access by default to Android devices, we need to worry about the general lesson from this attack, rather than about Google in particular. More important, is that any Android application can collect gyroscope data. So more than worrying about Google (that has reputation to maintain), we should be worried about malicious third-parties that have the same access to our data.

Q: Do you know if it is stored someplace or Google use it through voice recognition?
A: Again, there’s no evidence that Google records gyroscope data, stores it or uses it anywhere else except for on the phone itself. The point is that our work shows new implications of having potential to access this data.

PowerSpy: Location Tracking using Mobile Device Power Analysis

Our phones are always within reach and their location is mostly the same as our location. In effect, tracking the location of a phone is practically the same as tracking the location of its owner. Since users generally prefer that their location not be tracked by arbitrary 3rd parties, all mobile platforms consider the device’s location as sensitive information and go to considerable lengths to protect it: applications need explicit user permission to access the phone’s GPS and even reading coarse location data based on cellular and WiFi connectivity requires explicit user permission.

We showed that, despite these restrictions, applications can covertly learn the phone’s location. They can do so using a seemingly benign sensor: the phone’s power meter that measures the phone’s power consumption over a period of time. Our work is based on the observation that the phone’s location significantly affects the power consumed by the phone’s cellular radio. The power consumption is affected both by the distance to the cellular base station to which the phone is currently attached (free-space path loss) and by obstacles, such as buildings and trees, between them (shadowing). The closer the phone is to the base station and the fewer obstacles between them the less power the phone consumes.

The strength of the cellular signal is a major factor affecting the power used by the cellular radio. Moreover, the cellular radio is one of the most dominant power consumers on the phone.
Suppose an attacker measures in advance the power profile consumed by a phone as it moves along a set of known routes or in a predetermined area such as a city. We show that this enables the attacker to infer the target phone’s location over those routes or areas by simply analyzing the target phone’s power consumption over a period of time. This can be done with no knowledge of the base stations to which the phone is attached.

A major technical challenge is that power is consumed simultaneously by many components and applications on the phone in addition to the cellular radio. A user may launch applications, listen to music, turn the screen on and off, receive a phone call, and so on. All these activities affect the phone’s power consumption and result in a very noisy approximation of the cellular radio’s power usage. Moreover, the cellular radio’s power consumption itself depends on the phone’s activity, as well as the distance to the base-station: during a voice call or data transmission the cellular radio consumes more power than when it is idle. All of these factors contribute to the phone’s power consumption variability and add noise to the attacker’s view: the power meter only provides aggregate power usage and cannot be used to measure the power used by an individual component such as the cellular radio.

Nevertheless, using machine learning, we show that the phone’s aggregate power consumption over time completely reveals the phone’s location and movement. Intuitively, the reason why all this noise does not mislead our algorithms is that the noise is not correlated with the phone’s location. Therefore, a sufficiently long power measurement (several minutes) enables the learning algorithm to “see” through the noise. We refer to power consumption measurements as time-series and use methods for comparing time-series to obtain classification and pattern matching algorithms for power consumption profiles.

The PowerSpy project was a joint work with Gabi NakiblyAaron Schulman, Gunaa Arumugam Veerapandian and Prof. Dan Boneh from Stanford University, as part of a series of works on sensor exploitation on mobile devices.

Gyrophone: Recognizing Speech from Gyroscope Signals

My advisor Dan Boneh, colleague Gabi Nakibly and I have recently published a paper “Gyrophone: Recognizing Speech from Gyroscope Signals”.
It was presented at the 23rd USENIX Security conference in San Diego, and at
BlackHat Europe 2014 in Amsterdam.

To get a quick idea of what this research is about the following video should do:

We show that the MEMS gyroscopes found on modern smart phones are sufficiently sensitive to measure acoustic signals in the vicinity of the phone. The resulting signals contain only very low-frequency information (< 200 Hz). Nevertheless we show, using signal processing and machine learning, that this information is sufficient to identify speaker information and even parse speech. Since iOS and Android require no special permissions to access the gyro, our results show that apps and active web content that cannot access the microphone can nevertheless eavesdrop on speech in the vicinity of the phone.
This research attracted quite a bit of media attention and the first one to be published was an article in Wired.comThe Gyroscopes in Your Phone Could Let Apps Eavesdrop on Conversations. They interviewed us directly, and that article is probably the most technically accurate (Engadget and many more others followed and cited this original article).

Here’s our BlackHat Europe talk that explains this work in more detail

We’ve been also addressed with some questions by a French journalist, which I answered quite in detail in an email. So to clarify certain points regarding this work I’m pasting the Q&A here:

Q: What is the best results that you get in term of recovering sound? What are the limit of your work until now? What sort of sounds couldn’t you recover? Could you recover a complete human conversation, for example ?
A: To be precise we currently do not recover the original sound in a way that it will be understandable to a human. We rather try to tell what was the original word (in our case digits) based on the gyroscope measurements. The fact that the recording is not comprehensible to a human ear doesn’t mean a machine cannot understand it, and that’s exactly what we do.
We managed to reach a recognition accuracy of 65% for a specific speaker, for a set of 11 different words, with a single mobile device, and 77% combining measurements from two devices. While that is far from full speech recognition the important point is that we can still identify potentially sensitive information in a conversation this way.
We also outline a direction for potential reconstruction of the original sound using multiple phones, but that requires further research, and we don’t claim yet whether it is possible or not with smartphone devices. Another, not less important result is the ability to identify the gender of the speaker, or to identify a particular speaker among a group of possible users of the mobile device.

Q: Why has nobody worked on and proposed this approach before? Is it because the technical tools (algorithm) weren’t available? I mean, what is really the most performance? What was the most difficult? The algorithm? What are the advantages in comparison to traditional microphones?
A: I’m not completely sure nobody hasn’t but definitely no prior work demonstrated the capabilities to this extent. The fact itself, that gyroscopes are susceptible to acoustic noise, was known. Manufacturers were aware of it but they didn’t look at it from a security point of view, but rather as an effect that might just add noise to the gyro measurements. We think there hasn’t been enough awareness regarding the possibility of sensing speech frequencies, and the security implications of it. In particular in smartphones, access to the gyro doesn’t require any permission from the user, which makes it a good candidate for leaking information, and a such, an interesting problem to look into. That is also the advantage compare to the regular microphone.
The hardest part apart from the idea itself was to adapt speech processing algorithms to work with the gyroscope signals and obtain results despite of the low sampling frequency and noise.

Q: What are the applications for Gyrophone that you imagine for the future? Spying?
A: The application can definitely be eavesdropping on specific words in a conversation, or knowing who is near the mobile device at a certain moment.

Q: What are the next steps in your work? I mean what is your next work now to progress in this direction? Do you plan to publish soon with new results?
A: The next steps in this direction would be to study better what are the limits of this attack: What physical range is possible? Can the recognition accuracy be improved?
Is there a way to synchronize two or more mobile devices to potentially recover sound?Currently we don’t plan to publish new results for this attack but rather exploring more ways to leak sensitive information from mobile phones by unexpected means.

Q: Do you imagine that in the future such system could be used by everybody? To do what?
A: It is not that easy to make such a system work for practical attacks on a large scale, although more research effort in this direction might yield more surprising results.

Q: How could we avoid this spying risk?
A: The attack is not so hard to prevent either by limiting the sampling rate available to applications, requiring specific permissions, or filtering the high frequencies.
Our hope is that the general issue of side-channel attacks will be addressed by mobile device manufacturers in a way that will make it impossible.

The project page http://crypto.stanford.edu/gyrophone provides access to our code and dataset, as well as a link to the published paper.

Linux Audio Conference 2014

I’ve recently attended LAC’14, the 12th Linux Audio Conference, held this time at the ZKM (Zentrum fur Kunst und Media) in Karlsruhe, Germany. This conference is free and serves as a gathering opportunity for developers of Linux audio tools, experimental electronic music composers and open-source contributors.
I was presenting a contribution to the Faust musical signal processing language compiler. The main maintainers of the Faust project are currently Yann Orlarey and Stéphane Letz.
The joint work with Prof. Julius O. Smith (CCRMA at Stanford University) and Andrew Best (Blamsoft Inc.) added support for various useful features to the Faust VST architecture.
The title was 
Extending the Faust VST Architecture with Polyphony, Portamento and Pitch Bend.
I was initially introduced to Faust during a short workshop Yann gave in CCRMA in 2013.
Later on, while taking the Software for Sounds Synthesis class at CCRMA, I was trying to use Faust in combination with the MuLab DAW that can run on Windows and Mac OS X and supports VSTs as instrument plug-ins. Noticing the lack of some common features I’ve decided to turn it into my class project. The motivation was to enable using Faust to create plug-ins for as many free and commercial DAWs and production tools as possible and making the produced plug-ins functional enough to be useful for actual music production.
It was also exciting to meet open-source contributors for projects such as Ardour, QTractor and other great tools. Currently I’m continuing to work on improving Faust’s VST architecture, debugging and making it compatible with more production tools (like Ardour) out there. Next goal is solving some issues when trying to use Faust VSTs with Ableton Live.

Towards (even more) practical Faust

I’ve recently completed my final project for the MUSIC 420B class in Stanford University. Prior to taking this class I happened to attend a workshop by Yann Orlarey, the author of the Faust musical signal processing language, that was held in CCRMA. I was fascinated by the ease and speed with which one could create effects and sound synthesizers using Faust. Also, having missed electronic music composition by that time I took the class that offered the possibility to play more with Faust and music production software. My initial goal was to demonstrate practical usage of Faust for music production in combination with software like Logic Pro or Ableton Live and write a whole piece using Faust generated sounds. VST in my opinion was the way to go because the tools that I wanted to use on Mac supported it. Although it was possible to create VSTi plug-ins with Faust, they were lacking some features that are expected from most synthesizers, with polyphony being the most noticeable of them. I’ve decided to take on this and fill this gap. The DSSI plug-in architecture supported by Faust already had support for polyphony so I could get an idea how to implement it for VSTi-s.

The following diagram describes in general the VSTi architecture design. The VST host interacts with the plug-in through the AudioEffectEx interface. Faust class implements this interface and using multiple instances of the Voice class supports polyphony. Each Voice class contains an instance of mydsp class which is produced by the Faust compiler and implements the signal processing/synthesis part.

Faust VSTi architecture design

Faust VSTi architecture design

Support for portamento slide was added by storing the last played voice in a dedicated member of the Faust class. In addition the architecture recognizes the “pitchbend” control as one that has to be updated according to MIDI pitch-bend event. The following short loop demonstrates how a whole piece can be produced using Faust instruments. All the instruments are Faust VSTi-s except for percussion that wasn’t fully implemented yet:

That’s how the bastard-synth VST used in this loop looks in MuTools’ MULab. Here we can see the controls recognized by the Faust architecture: freq, gain, gate, pitchbend and prevfreq.

pasted1

A more detailed project summary is in the following paper http://stanford.edu/~yanm2/files/mus420b.pdf. The project code is part of Faust source and can be check out from http://git.code.sf.net/p/faudiostream/code.

LED T-Shirt

During the last couple of months I’ve been working on a fun side-project with my friend Shlomoh Oseary. For a long time I wanted to make a T-shirt with an equalizer display on it that will light up in correspondence with surrounding sounds and music, and once I had a buddy excited about this idea too we started working.

We decided to use E-textile dedicated components. Arduino Lilypad with its 8 MHz Atmega processor seemed suitable for the task. Now we had to understand how will will drive the LEDs. The naive approach of connecting each LED to ground and to one of the Lilypad’s outputs would limit the number of LEDs we can drive this way. After searching a bit we found that what we want is to build a LED matrix. The principle in a LED matrix is that all the LEDs in the same row or column are connected. In our case all the minus legs of the LEDs in the same column are shorted and all the plus legs of the LEDs in the same row are shorted. To light up a LED we need to feed positive voltage to the corresponding row and short to ground the corresponding column. To light up multiple LEDs our LED matrix driver code  loops over all the rows and columns and constantly lights up each LED that is required to be turned on for a fraction of a second thus achieving the effect of those LED being constantly turned on.

Testing the microphone and the FFT calculation

Each column of the LED-matrix represents a frequency range with lower frequencies on the right. The more energy is sensed in a certain bin – the more LEDs in this column will be turned on. To find the energy for each frequency range we used FFT over a window of  128 samples. The sampling frequency was chosen to be 4000 Hz providing according to Nyquist theorem coverage for tones up to 2000 Hz. A predefined threshold (which we need to calibrate) is subtracted from the calculated energy to filter out small fluctuations and the outcome is mapped to the number of rows of the LED matrix to represent an energy level.
We used an existing FFT implementation for Arduino from http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1286718155.
There is still a final touch missing to the algorithm which is applying a low-pass filter to clean frequencies higher then 2000 Hz from the recorded signal prior to FFT calculation.

Connecting the electret microphone and the power supply to the Lilypad.

LED T-shirt @ work

When beauty and electronics meet… (Julia Shteingart modeling)

Code

Project’s code (except for FFT implementation which can be downloaded using the link above and the TimerOne library which can be downloaded from Arduino site) is available through SVN under

https://bitbucket.org/ymcrcat/led-t-shirt/

Credits

To Shlomoh’s mom for sewing.

Eusipco 2011

At the end of August I have attended Eusipco 2011 conference in Barcelona, Spain. I have presented my work on speaker identification using diffusion maps, a manifold learning and dimensionality reduction method developed during the last years by Ronald Coifman and Stephane Lafon. The paper can be found here: “Speaker Identification Using Diffusion Maps”.
In this paper we propose a data-driven approach for speaker identification without assuming any particular speaker model. The goal in speaker identification task is to determine which one of a group of known speakers best matches a given voice sample. Here we focus on text-independent speaker identification, i.e. no assumption is made regarding the spoken text. Our approach is based on a recently developed manifold learning technique, named diffusion maps. Diffusion maps enable embedding of the recording into a new space, which is likely to capture the speech intrinsic structure. The algorithm was tested and compared to common identification algorithms, and our experiments had shown that the proposed algorithm obtains improved results when few labeled samples are available.

Melecon 2010

Last week I attended the Melecon 2010 conference, held in Valetta, Malta. I presented my work on content insertion into H.264 compressed video. It is covered in this article:
Fast H.264 Picture-in-Picture (PIP) Transcoder with B-Slices and Direct Mode Support“.
H.264, an ITU standard for video coding, has become increasingly popular, offering solutions for many applications requiring video compression. In some of these applications there is a need to insert content into an already compressed video. This operation incurs high computational cost if a naïve approach is taken. Therefore, a concept of reusing encoding information, called “Guided Encoding”, was developed in the Signal and Image Processing Lab at the Technion. In this project, we extended this technique and applied “Guided Encoding” to the Main Profile of H.264 to support features such as Bi-directional prediction, weighted prediction and Direct encoding mode. The result is a set of recommendations and algorithmic pointers, as well the implementation of the proposed solution within the H.264 reference software. Evaluation of our solution has shown a significant improvement in run-time compared to the naive approach.

Vibrato detection in audio signals

In a research I’m currently working on we tackle the problem of discrimination between speech and singing. One of the indicators of singing as opposed to speech is the presence of vibrato applied by the singer, more accurately a pitch vibrato. The pitch vibrato is an oscillation of the base pitch with a frequency between 4 and 8 Hz. Therefore, in order to identify the vibrato effect we need to detect this oscillation.
The first step is detecting the pitch. We dissect the audio to frames of 256 samples each and perform pitch detection using the autocorrelation method. Now we have a vector of values indication the pitch for each frame. We compute the DFT of this pitch vector and examine the range of 4-8 Hz. A peak (local maximum) in this range indicates an oscillation of the base pitch. For more robustness we simply calculate the energy of this range:

f_{min}=4 Hz \\ f_{max}=8 Hz \\  E_{vibrato}=\frac{1}{N}\sum_{f_{min}}^{f_{max}}|P_k^d|^2

Now we may either use the calculated energy as a measure for vibrato or compare it to a certain threshold we and tell whether the pitch vibrates.

The MIR Toolbox made this task very easy to perform by offering useful functions for audio analysis, segmentation to frames and much more.