Deanonymizing Quora Answers

Our task is to identify authors of texts. In particular short texts such as Quora answers. We investigate the application of LSTM (Long short-term memory) architecture for RNN (Recurrent Neural Network) to learn linguistic cues and perform author identification. As humans we often associate certain people to certain types of writing styles. In fact, historians have used this as a tool to identify authors of works whose authors were unknown. Apart from literary interest, these methods also have applications in identity tracing for cyber forensics, more generally referred to as forensic linguistics.

Typical approaches look at simple style markers like frequency of monograms (single words) and bigrams (word pairs). That approach works well when, for example, we are trying to predict the author for a book, because many rarer words and phrases specific to that author are likely to be present. But these approaches don’t work as well for smaller texts.

However even for smaller texts, should be possible to discern between different authors based on linguistic cues such as sentence construction. We investigate the application of LSTM (Long short-term memory) architecture for RNN (Recurrent Neural Network) to learn these linguistic cues.

Quora is a question answering website where many prolific individuals answer many questions but the answers may be short. Thus it serves as the perfect testing ground for author identification of shorter texts.

This was a course project for CS224D: Deep learning for NLP and the report can found here