diff --git a/docs/source/lexrank_tutorial.rst b/docs/source/lexrank_tutorial.rst new file mode 100644 index 000000000..5f3e472dd --- /dev/null +++ b/docs/source/lexrank_tutorial.rst @@ -0,0 +1,283 @@ +=================================== +Tutorial: Extractive Summarization +=================================== + +This tutorial will go through the implementation of several extractive +summarization models with spaCy. + +An *extractive* summarization system is a filter over the original document/s: +most of the text is removed, and the remaining text is formatted as a summary. +In contrast, an *abstractive* summarization system generates new text. + +Application Context +------------------- + +Extractive summarization systems need an application context. We can't ask how +to design the system without some concept of what sort of summary will be +useful for a given application. (Contrast with speech recognition, where +a notion of "correct" is much less application-sensitive.) + +For this, I've adopted the application context that `Flipboard`_ discuss in a +recent blog post: they want to display lead-text to readers on mobile devices, +so that readers can easily choose interesting links. + +I've chosen this application context for two reasons. First, `Flipboard`_ say +they're putting something like this into production. Second, there's a ready +source of evaluation data. We can look at the lead-text that human editors +have chosen, and evaluate whether our automatic system chooses similar text. + +Experimental Setup +------------------ + +Instead of scraping data, I'm using articles from the New York Times Annotated +Corpus, which is a handy dump of XML-annotated articles distributed by the LDC. +The annotations come with a field named "online lead paragraph". Our +summarization systems will be evaluated on their Rouge-1 overlap with this +field. + +Further details of the experimental setup can be found in the appendices. + +.. _newyorktimes.com: http://newyorktimes.com + +.. _Flipboard: http://engineering.flipboard.com/2014/10/summarization/ + +.. _vector-space model: https://en.wikipedia.org/wiki/Vector_space_model + +.. _LexRank algorithm: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html + +.. _PageRank: https://en.wikipedia.org/wiki/PageRank + +Summarizer API +-------------- + +Each summarization model will have the following API: + +.. py:func:`summarize(nlp: spacy.en.English, headline: unicode, paragraphs: List[unicode], + target_length: int) --> summary: unicode + +We receive the headline and a list of paragraphs, and a target length. We have +to produce a block of text where len(text) < target_length. We want summaries +that users will click-on, and not bounce back out of. Long-term, we want +summaries that would keep people using the app. + +Baselines: Truncate +------------------- + +.. code:: python + + def truncate_chars(nlp, headline, paragraphs, target_length): + text = ' '.join(paragraphs) + return text[:target_length - 3] + '...' + + def truncate_words(nlp, headline, paragraphs, target_length): + text = ' '.join(paragraphs) + tokens = text.split() + summary = [] + n_words = 0 + n_chars = 0 + while n_chars < target_length - 3: + n_chars += len(tokens[n_words]) + n_chars += 1 # Space + n_words += 1 + return ' '.join(tokens[:n_words]) + '...' + + def truncate_sentences(nlp, headline, paragraphs, target_length): + sentences = [] + summary = '' + for para in paragraphs: + tokens = nlp(para) + for sentence in tokens.sentences(): + if len(summary) + len(sentence) >= target_length: + return summary + summary += str(sentence) + return summary + +I'd be surprised if Flipboard never had something like this in production. Details +like lead-text take a while to float up the priority list. This strategy also has +the advantage of transparency: it's obvious to users how the decision is being +made, so nobody is likely to complain about the feature if it works this way. + +Instead of cutting off the text mid-word, we can tokenize the text, and + ++----------------+-----------+ +| System | Rouge-1 R | ++----------------+-----------+ +| Truncate chars | 69.3 | ++----------------+-----------+ +| Truncate words | 69.8 | ++----------------+-----------+ +| Truncate sents | 48.5 | ++----------------+-----------+ + +Sentence Vectors +---------------- + +A simple bag-of-words model can be created using the `count_by` method, which +produces a dictionary of frequencies, keyed by string IDs: + +.. code:: python + + >>> from spacy.en import English + >>> from spacy.en.attrs import SIC + >>> nlp = English() + >>> tokens = nlp(u'a a a. b b b b.') + >>> tokens.count_by(SIC) + {41L: 4, 11L: 3, 5L: 2} + >>> [s.count_by(SIC) for s in tokens.sentences()] + [{11L: 3, 5L: 1}, {41L: 4, 5L: 1}] + + +Similar functionality is provided by `scikit-learn`_, but with a different +style of API design. With spaCy, functions generally have more limited +responsibility. The advantage of this is that spaCy's APIs are much simpler, +and it's often easier to compose functions in a more flexible way. + +One particularly powerful feature of spaCy is its support for +`word embeddings`_ --- the dense vectors introduced by deep learning models, and +now commonly produced by `word2vec`_ and related systems. + +Once a set of word embeddings has been installed, the vectors are available +from any token: + + >>> from spacy.en import English + >>> from spacy.en.attrs import SIC + >>> from scipy.spatial.distance import cosine + >>> nlp = English() + >>> tokens = nlp(u'Apple banana Batman hero') + >>> cosine(tokens[0].vec, tokens[1].vec) + + + + + +.. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ + +.. _word2vec: https://code.google.com/p/word2vec/ + +.. code:: python + + def main(db_loc, output_dir, feat_type="tfidf"): + nlp = spacy.en.English() + + # Read stop list and make TF-IDF weights --- data needed for the + # feature extraction. + with open(stops_loc) as file_: + stop_words = set(nlp.vocab.strings[word.strip()] for word in file_) + idf_weights = get_idf_weights(nlp, iter_docs(db_loc)) + if feat_type == 'tfidf': + feature_extractor = tfidf_extractor(stop_words, idf_weights) + elif feat_type == 'vec': + feature_extractor = vec_extractor(stop_words, idf_weights) + + for i, text in enumerate(iter_docs(db_loc)): + tokens = nlp(body) + sentences = tokens.sentences() + summary = summarize(sentences, feature_extractor) + write_output(summary, output_dir, i) + + + + +.. _scikit-learn: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text + + + + + +The LexRank Algorithm +---------------------- + +LexRank is described as a graph-based algorithm, derived from `Google's PageRank`_. +The nodes are sentences, and the edges are the similarities between one +sentence and another. The "graph" is fully-connected, and its edges are +undirected --- so, it's natural to represent this as a matrix: + +.. code:: python + + from scipy.spatial.distance import cosine + import numpy + + + def lexrank(sent_vectors): + n = len(sent_vectors) + # Build the cosine similarity matrix + matrix = numpy.ndarray(shape=(n, n)) + for i in range(n): + for j in range(n): + matrix[i, j] = cosine(sent_vectors[i], sent_vectors[j]) + # Normalize + for i in range(n): + matrix[i] /= sum(matrix[i]) + return _pagerank(matrix) + +The rows are normalized (i.e. rows sum to 1), allowing the PageRank algorithm +to be applied. Unfortunately the PageRank implementation is rather opaque --- +it's easier to just read the Wikipedia page: + +.. code:: python + + def _pagerank(matrix, d=0.85): + # This is admittedly opaque --- just read the Wikipedia page. + n = len(matrix) + rank = numpy.ones(shape=(n,)) / n + new_rank = numpy.zeros(shape=(n,)) + while not _has_converged(rank, new_rank): + rank, new_rank = new_rank, rank + for i in range(n): + new_rank[i] = ((1.0 - d) / n) + (d * sum(rank * matrix[i])) + return rank + + def _has_converged(x, y, epsilon=0.0001): + return all(abs(x[i] - y[i]) < epsilon for i in range(n)) + + +Initial Processing +------------------ + + + + +Feature Extraction +------------------ + + .. code:: python + def sentence_vectors(sentence, idf_weights): + tf_idf = {} + for term, freq in sent.count_by(LEMMA).items(): + tf_idf[term] = freq * idf_weights[term] + vectors.append(tf_idf) + return vectors + +The LexRank paper models each sentence as a bag-of-words + +This is simple and fairly standard, but often gives +underwhelming results. My idea is to instead calculate vectors from +`word-embeddings`_, which have been one of the exciting outcomes of the recent +work on deep-learning. I had a quick look at the literature, and found +a `recent workshop paper`_ that suggested the idea was plausible. + + + + +Taking the feature representation and similarity function as parameters, the +LexRank function looks like this: + + +Given a list of N sentences, a function that maps a sentence to a feature +vector, and a function that computes a similarity measure of two feature +vectors, this produces a vector of N floats, which indicate how well each +sentence represents the document as a whole. + +.. _Rouge: https://en.wikipedia.org/wiki/ROUGE_%28metric%29 + + +.. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ + +.. _recent workshop paper: https://www.aclweb.org/anthology/W/W14/W14-1504.pdf + + +Document Model +-------------- + + +