Friday, March 02, 2012
BROOKE GLADSTONE: Anonymity online has been hotly contested since the birth of the World Wide Web. Last month, a team of researchers released a paper which demonstrated that it is possible to identify someone simply by their writing style.
Arvind Narayanan is a computer science researcher at Stanford and one of the authors of the paper. Welcome to the show.
ARVIND NARAYANAN: Thank you for having me.
BROOKE GLADSTONE: Explain how you conducted the experiment.
ARVIND NARAYANAN: Certainly. We collected a huge set of blog posts, comprising about 100,000 different authors and we looked at each author’s writing style and we tried to see if we could compress that down to unique fingerprints.
BROOKE GLADSTONE: Using the word “since” rather than “because” is a potential marker.
ARVIND NARAYANAN: Indeed. These are roughly interchangeable but different people might use them in different proportion. Our markers range from very simple ones, such as the frequency of various punctuation characters, to full-blown grammatical analysis of sentences. The magic comes when we put hundreds or thousands of these markers together. And that’s what we did.
BROOKE GLADSTONE: And you were able to correctly identify authors 20 percent of the time? That doesn’t seem that great to me.
ARVIND NARAYANAN: In the most stringent case, where we ask our algorithm to always make a guess as to who the author might be, giving it only a small amount of text, less than a thousand words, we achieved a rate of 20 percent. However, when you give it more text, it goes up to 35 to 40 percent, or especially when you allow it to make guesses only when it’s relatively confident of the answer; we’re able to get an accuracy of about 80 percent.
BROOKE GLADSTONE: You’re asking an algorithm to express confidence?
ARVIND NARAYANAN: That’s exactly right. We developed several methods by which an algorithm could not only produce an answer which represents who an unknown author might be, but it can also estimate its own probability of correctness.
BROOKE GLADSTONE: Your algorithms did not take topic into consideration. You can identify people fairly straightforwardly by what they obsess over online, or at least it certainly would be an effective market. Why didn’t you use it?
ARVIND NARAYANAN: There were two reasons. To demonstrate the validity of these algorithms, it’s much clearer to compare the performance of one algorithm against another if we only look at people’s writing style because that’s something inherent, whereas the topic an vary greatly from one data set to another.
The second reason is that if you imagine the kind of situation where an oppressive government, let’s say, might be using this algorithm, it could turn out that the victim that they’re going after may have only written about something innocuous such as recipes on their blog under their real identity, but now they might be expressing something very sensitive, such as political speech under cover of a pseudonym.
And so, if we need an algorithm to connect these two together, that algorithm will only be successful if it has been developed without incorporating any topic information, because the two topics that it needs to match are different.
BROOKE GLADSTONE: Was part of your motivation for doing this research to see how much dissidents might be at risk online?
ARVIND NARAYANAN: That is, indeed, a big part of our motivation. There are a variety of technologies that we think can be used to compromise the anonymity of dissidents around the world, and so we wanted to put this out in the public domain.
BROOKE GLADSTONE: Is there any way for people to protect themselves?
ARVIND NARAYANAN: Yes. A previous study by researchers at Drexel has shown that if study participants are asked to think consciously about their writing style and either try to obfuscate it or even try to mimic another person’s writing style, people are reasonably good at doing this without any specialized training.
If somebody wanted to produce an anonymous document but didn’t want to be identified, one way to do that might be to try to mimic the writing style of some famous author.
BROOKE GLADSTONE: So throw in a few “forsooths” and make like Shakespeare?
ARVIND NARAYANAN: It’s a little bit more complex than that, but basically along those lines.
BROOKE GLADSTONE: Arvind, thank you very much.
ARVIND NARAYANAN: Thank you for having me.
BROOKE GLADSTONE: Arvind Narayanan is a computer science researcher at Stanford.