Deep Reader

What ?

This project involves natural language processing and network science.

Its development is part of our MSc Digital Media Engineering at DTU and is through collaboration with EasyTranslate.

The main goal here is to provide all the tools to understand a text without having to read it. We make it possible to explore a corpus of texts and display the relations between them.

EasyTranslate is a translation company. The company is supported with specialized translators. Today, when a client uploads a text, a manager reads it quickly and routes it to the right translators. We provide tools to bypass the reading-through process by analysing the text with natural language processing. Moreover, we also made it possible to have an overview of all the documents : the corpus network. This network gives a nice visualization of the texts and provides the notion of distance between texts. Thus, thanks to those (semantic) distances between texts, it is possible to find the right translator automatically : the translator who has translated the closest text.

We provide an analysis on two levels :

  • Relations with other texts using the cosine similarity
  • Topic distribution, complexity and most important words within a text
  • How I use it ?

    how it works

    Corpus visualization

    The Network page shows a network composed of nodes. Nodes represent texts, and links represent the similarity between texts. The more a link is thick the more the texts are close to each other. The graph provides a nice way to visualize a corpus.

    Play with the network, try to understand what the texts are about, explore the relationships between texts.

    You can see different types of nodes :

  • The lightest and largest represent topics built through the LDA
  • The smallest and darker nodes represent the texts
  • Single text analysis

    The Text Analysis page shows the analysis of a single text. It analysis a random text in the corpus. And you can naturally access to the analysis of any text of your choice in the corpus from the network (button is in the left panel).

    This page shows the distribution of topics along the text, the complexity based on the Flesch reading ease, and the most important words for each topic displayed with both a wordcloud and a graph.

    Keep in mind that the main goal is to provide a tool which helps a manager to understand a text and provide an idea of the difficulty and the most challenging words in the text. However, the analysis page provides also tools helping to understand the content of a text : the force directed graph for instance provides a nice overview of the text because it gives visually examples of most significant (and complex) words.

    how it works

    Communities

    We use a Python implementation of the Louvain-algorithm in order to partition the communities which are groups of nodes where documents are highly similar.

    The modularity giving an insight on the quality of partition is around 0.62 which indicates it is fairly good (modularity varies from -0.5 to 1, a modularity of 1 indicates the groups are well delimited).

    We also colored separated components with more than 2 documents. We wanted to display a nice visualization of the groups within the corpus. The analysis of the communities show that groups defined thanks to the Louvain-algorithm really make sense. For instance the texts in the purple community are referred to philosophy.

    What data ?

    We extracted texts from several sources :

  • The MASC corpus provides a great variety of texts. However, some texts and categories have been removed in order to obtain a broader repartition of texts.
  • The Gutemberg project provides a huge quantity of texts from different topics. The texts have been chosen according to their topics.
  • And some other random texts such as the iPhone user guide or the Apache License.
  • Topics, the origins.

    The topics have been retrieved thanks to the LDA (Latent Dirichlet Allocation). We used Gensim, an open source library in python for natural language processing on the 5,025,150 articles from the English Wikipedia (one day of computation). The LDA is a model which, from a corpus of texts, enabled us to extract a given amount of topics. Topics are groups of words with a semantic relation. They are composed of a list of weighted words.

    The topics are made according to a mathematical model and are based on statistics. Thus, a word taken alone within a topic often doesn't make any sense : when the topics are retrieved they give a probability for each word to belong to a topic They can basically be seen in several topics because words are often used in different contexts.

    In order to provide an overview of the texts without seeing the whole content of a topic, we named manually the most used topics. Therefore, some topic names are just an overall description and allow us to distinguish each of them. For instance the topic "everyday vocabulary" corresponds to a lot of adjectives or vocabulary related to dialogues. As a matter of fact this vocabulary can be used in a lot of different contexts : texts about marketing for example are characterized by an high proportion of that topic because this kind of texts use a lot of positive adjectives. We can make the same observation for movie scripts or fictions, which are also highly characterized by this topic both for movies or fictions about war or fantasy.

    What does "relation bewteen texts" mean ?

    how it works

    Cosine similarity

    The distance between texts is based on the cosine similarity in the topic space. Each topic represents a dimension (on the example, the dimensions are the topics "economy", "mathematics" and "food"). The cosine similarity is basically the cosine of the angle between two vectors in the topic space. The diagram on the left shows how it works with three topics : the cosine similarity corresponds to the cosine of the angle Θ between the two vectors. The same principle is applicable for a space of 100 dimensions.

    In order to obtain a graph that makes sense, we apply a threshold to the similarities : only the similarities higher than 85% are displayed.

    How are the texts analysed ?

    Topics distribution

    Thanks to the LDA, we extract the topics from small chuncks in the text. It provides a probability for each part along the text to belong to a specific topic. The texts need to be tokenized for the LDA.

    The LDA topics are mathematically made and are based on assumptions which aren't always verified on the Wikipedia corpus and in the texts we analyze. Moreover, the LDA is a model, a simplification of the world. Thus, the LDA cannot handle every nuances of the language. This is why some results need an interpretation. But hopefully the results are quite good and give a nice overview of a text without reading it.

    Complexity

    We introduced the complexity because of our motivations for this project. Indeed, the goal is to provide a tool for managers in a translation company to allocate a text to the best qualified translator without reading the text. Thus, evaluating the complexity of the text can help the manager to choose the right translator. Moreover, the complexity will also be used for dynamic pricing : a complex text requires more ressources to be translated well. Then, the translation cost should be higher than for a less complex text and vice versa.

    We use the Flesch reading ease formula to give an estimation of the complexity along the text. This formula is based on the length of the words, the number of syllabus and the length of the words. Basically, a text is hard to read if the sentences and the words are long. This gives an estimation of the complexity on a scale from 0 to 100 (we inverted the original scale). Scores can be interpreted with the following table:

  • 0-10 : easily understood by an average 11-year-old student
  • 30-40 : easily understood by 13- to 15-year-old students
  • 70-100 : best understood by university graduates
  • Meaningful words

    In order to help a manager in its choices, we need to provide the most important words in a text to help him to spot from which topic the text is about (for instance a text with difficult law references or medicine vocabulary).

    Usual scoring-functions such as the TF-IDF was not working very well and it was very time-consuming. We developed our own scoring-function to sort words with regard to their difficulty and importance in the text.

    This function is quite simple and is basically the length of the words divided by the frequency of the word in the English vocabulary. It can seem to be a strange and very basic but sometimes, less is more ! We obtain really good results compared to the traditional TF-IDF.

    The frequencies of the words are defined with the Reuters corpus from NLTK. The choice of the length of the words as a characteristic parameter is motivated by the fact that in English, the more a word is long, the more the word carry meaning and the more the word is likely to be complex. We use the frequencies of the words in English to filter the common words.

    Which technologies are used ?

    Back-End

    The back-end gathers all the work behind the scene. Due to technical limitations, all the analysis have been computed one time and the results have been stored in a data base. Indeed, free hosts like Heroku don't provide enough computation power to load and run the LDA based on the whole English Wikipedia.

    We also built a prototype which analysis texts on-the-go : the user uploads a text and gets the results instantly.

    All the analysis are made with Python. Here are all the libraries we used in the project :

  • Gensim : a powerful language processing Python library with an implementation of the LDA model
  • NLTK : our favorite language processing platform for Python. It contains over 50 corpora and lexical resources and provides also tools for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. We use it for text cleaning, tokenizing and word frequency count but also to get useful data like the Reuters corpus
  • NetworkX : an awesome Python library for network analysis
  • A python Louvain-algorithm implementation : the Louvain-algorithm is an algorithm for community detection (detection of "groups" of nodes). It finds an optimal partition of the network with regard to the modularity
  • Flask : a microframework for Python based on Werkzeug and Jinja 2. It enables us to build this application easily and manage the different webpages
  • Hosts :
  • Front-End

    The front-end represents all the work you see, here, on this app. It includes the webpages and the graphs :

  • Bootstrap : a popular framework for developing responsive, mobile first projects on the web
  • d3 : an amazing JavaScript library for data visualization : all the graphs have been made thanks to this library
  • About us

    Valentin Liévin

    Student in MSc Digital Media Engineering at DTU (Denmark) and in MSc General Engineering at École Centrale de Nantes (France)


    Yannis Flet-Berliac

    Student in MSc Digital Media Engineering at DTU (Denmark) and in MSc General Engineering at École Centrale de Nantes (France)