spaCy/docs/redesign/tute_syntax_search.jade

133 lines
5.0 KiB
Plaintext
Raw Normal View History

2015-08-14 18:13:22 +00:00
doctype html
html(lang='en')
head
meta(charset='utf-8')
title spaCy Blog
meta(name='description', content='')
meta(name='author', content='Matthew Honnibal')
link(rel='stylesheet', href='css/style.css')
//if lt IE 9
script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
body#blog
header(role='banner')
h1.logo spaCy Blog
.slogan Blog
main#content(role='main')
section.intro
p
| Example use of the spaCy NLP tools for data exploration.
| Here we will look for reddit comments that describe Google doing something,
| i.e. discuss the company's actions. This is difficult, because other senses of
| "Google" now dominate usage of the word in conversation, particularly references to
| using Google products.
p
| The heuristics used are quick and dirty – about 5 minutes work.
//| A better approach is to use the word vector of the verb. But, the
// | demo here is just to show what's possible to build up quickly, to
// | start to understand some data.
article.post
header
h2 Syntax-specific Search
.subhead
| by
a(href='#', rel='author') Matthew Honnibal
| on
time(datetime='2015-08-14') August
details
summary: h4 Imports
pre.language-python
code
| from __future__ import unicode_literals
| from __future__ import print_function
| import sys
|
| import plac
| import bz2
| import ujson
| import spacy.en
details
summary: h4 Load the model and iterate over the data
pre.language-python
code
| def main(input_loc):
| nlp = spacy.en.English() # Load the model takes 10-20 seconds.
| for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump.
| comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.
|
details
summary: h4 Apply the spaCy NLP pipeline, and look for the cases we want
pre.language-python
code
| comment_parse = nlp(comment_str)
| for word in comment_parse:
| if google_doing_something(word):
| # Print the clause
| print(''.join(w.string for w in word.head.subtree).strip())
details
summary: h4 Define the filter function
pre.language-python
code
|
| def google_doing_something(w):
| if w.lower_ != 'google':
| return False
| # Is it the subject of a verb?
| elif w.dep_ != 'nsubj':
| return False
| # And not 'is'
| elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux':
| return False
| # Exclude e.g. "Google says..."
| elif w.head.lemma_ in ('say', 'show'):
| return False
| else:
| return True
|
|
details
summary: h4 Call main
pre.language-python
code
| if __name__ == '__main__':
| plac.call(main)
details
summary: h4 Example output
p.
Many false positives remain. Some are from incorrect interpretations
of the sentence by spaCy, some are flaws in our filtering logic. But
the results are vastly better than a string-based search, which returns
almost no examples of the pattern we're looking for.
code
| Google dropped support for Android < 4.0 already
| google drive
| Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc
| When Google responds
| Google translate cyka pasterino.
| A quick google looks like Synology does have a sync'ing feature which does support block level so that should work
| (google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?
| Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.
| Google offers something like this already, but it is truly terrible.
| google isn't helping me
| Google tells me: 0 results, 250 pages removed from google.
| how did Google swoop in and eat our lunch
script(src="js/prism.js")
script(src="js/details_polyfill.js")