David Zachariae

Thesis

Recursive Neural Networks for Sensitive Information Detection in Emails

As large amounts of text data are created on the internet every day, it also becomes increasingly more relevant to discover ways to process it automatically. Machine learning techniques and natural language processing, have successfully explored multiple problems in this regard, like sentiment analysis, question answering, among others. A task that, to the best of our knowledge, has yet to be fully explored is that of sensitive information detection. This task involves the detection of sensitive information in documents that the authors do not want to leak to the public. A specific case of this is large amounts of emails and other data surrounding court-cases about fraud and corruption in large corporations, like in the case of the Enron and Monsanto scandals. Past explorations into this task have been focused on keyword-based solutions and have achieved some success, but this usually requires domain knowledge and fails to capture more complex patterns. Neural network based techniques are very effective at discovering such underlying patterns. In recent years a special kind of tree-structured recursive neural networks have become state-of-the-art in analyzing text data on sentence-level. Many different variants of these have been developed and achieved better and better performances, but most of these have not yet been applied to the task of sensitive information detection. In this thesis, we present a study on sensitive information detection with using five different variations recursive neural networks for parsing natural language and creating sentence representations. The primary purpose of this thesis is to discover how different hyper-parameters affect the different models and which models have the best overall performance on our specific task. We find a suitable set of hyper-parameters and discuss different effects of parameters on each model. Testing on a large dataset, we discovered that a recursive neural network that uses the long short-term memory (LSTM) architecture in combination with a LSTM context tracker, achieves the highest performance when considering both training time and accuracy.