mirror of https://github.com/explosion/spaCy.git
133 lines
5.0 KiB
Plaintext
133 lines
5.0 KiB
Plaintext
doctype html
|
|
html(lang='en')
|
|
head
|
|
meta(charset='utf-8')
|
|
title spaCy Blog
|
|
meta(name='description', content='')
|
|
meta(name='author', content='Matthew Honnibal')
|
|
link(rel='stylesheet', href='css/style.css')
|
|
//if lt IE 9
|
|
script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
|
|
body#blog
|
|
header(role='banner')
|
|
h1.logo spaCy Blog
|
|
.slogan Blog
|
|
main#content(role='main')
|
|
section.intro
|
|
p
|
|
| Example use of the spaCy NLP tools for data exploration.
|
|
| Here we will look for reddit comments that describe Google doing something,
|
|
| i.e. discuss the company's actions. This is difficult, because other senses of
|
|
| "Google" now dominate usage of the word in conversation, particularly references to
|
|
| using Google products.
|
|
|
|
p
|
|
| The heuristics used are quick and dirty – about 5 minutes work.
|
|
|
|
//| A better approach is to use the word vector of the verb. But, the
|
|
// | demo here is just to show what's possible to build up quickly, to
|
|
// | start to understand some data.
|
|
|
|
article.post
|
|
header
|
|
h2 Syntax-specific Search
|
|
.subhead
|
|
| by
|
|
a(href='#', rel='author') Matthew Honnibal
|
|
| on
|
|
time(datetime='2015-08-14') August
|
|
|
|
details
|
|
summary: h4 Imports
|
|
|
|
pre.language-python
|
|
code
|
|
| from __future__ import unicode_literals
|
|
| from __future__ import print_function
|
|
| import sys
|
|
|
|
|
| import plac
|
|
| import bz2
|
|
| import ujson
|
|
| import spacy.en
|
|
|
|
details
|
|
summary: h4 Load the model and iterate over the data
|
|
|
|
pre.language-python
|
|
code
|
|
| def main(input_loc):
|
|
| nlp = spacy.en.English() # Load the model takes 10-20 seconds.
|
|
| for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump.
|
|
| comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.
|
|
|
|
|
details
|
|
summary: h4 Apply the spaCy NLP pipeline, and look for the cases we want
|
|
|
|
pre.language-python
|
|
code
|
|
| comment_parse = nlp(comment_str)
|
|
| for word in comment_parse:
|
|
| if google_doing_something(word):
|
|
| # Print the clause
|
|
| print(''.join(w.string for w in word.head.subtree).strip())
|
|
details
|
|
summary: h4 Define the filter function
|
|
|
|
pre.language-python
|
|
code
|
|
|
|
|
|
|
| def google_doing_something(w):
|
|
| if w.lower_ != 'google':
|
|
| return False
|
|
| # Is it the subject of a verb?
|
|
| elif w.dep_ != 'nsubj':
|
|
| return False
|
|
| # And not 'is'
|
|
| elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux':
|
|
| return False
|
|
| # Exclude e.g. "Google says..."
|
|
| elif w.head.lemma_ in ('say', 'show'):
|
|
| return False
|
|
| else:
|
|
| return True
|
|
|
|
|
|
|
|
|
|
details
|
|
summary: h4 Call main
|
|
|
|
pre.language-python
|
|
code
|
|
| if __name__ == '__main__':
|
|
| plac.call(main)
|
|
|
|
details
|
|
summary: h4 Example output
|
|
|
|
p.
|
|
Many false positives remain. Some are from incorrect interpretations
|
|
of the sentence by spaCy, some are flaws in our filtering logic. But
|
|
the results are vastly better than a string-based search, which returns
|
|
almost no examples of the pattern we're looking for.
|
|
|
|
code
|
|
| Google dropped support for Android < 4.0 already
|
|
| google drive
|
|
| Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc
|
|
| When Google responds
|
|
| Google translate cyka pasterino.
|
|
| A quick google looks like Synology does have a sync'ing feature which does support block level so that should work
|
|
| (google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?
|
|
| Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.
|
|
| Google offers something like this already, but it is truly terrible.
|
|
| google isn't helping me
|
|
| Google tells me: 0 results, 250 pages removed from google.
|
|
| how did Google swoop in and eat our lunch
|
|
|
|
|
|
|
|
script(src="js/prism.js")
|
|
script(src="js/details_polyfill.js")
|