spaCy/docs/redesign/tute_syntax_search.jade

doctype html
html(lang='en')
  head
    meta(charset='utf-8')
    title spaCy Blog
    meta(name='description', content='')
    meta(name='author', content='Matthew Honnibal')
    link(rel='stylesheet', href='css/style.css')
    //if lt IE 9
      script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
  body#blog
    header(role='banner')
      h1.logo spaCy Blog
      .slogan Blog
    main#content(role='main')
      section.intro
        p
          | Example use of the spaCy NLP tools for data exploration.
          | Here we will look for reddit comments that describe Google doing something,
          | i.e. discuss the company's actions. This is difficult, because other senses of
          | "Google" now dominate usage of the word in conversation, particularly references to
          | using Google products.
        
        p
          | The heuristics used are quick and dirty &ndash; about 5 minutes work.
          
        //| A better approach is to use the word vector of the verb. But, the
        //  | demo here is just to show what's possible to build up quickly, to
        //  | start to understand some data.

      article.post
        header
          h2 Syntax-specific Search
          .subhead
            | by 
            a(href='#', rel='author') Matthew Honnibal
            |  on 
            time(datetime='2015-08-14') August
          
        details
          summary: h4 Imports

          pre.language-python
            code
              | from __future__ import unicode_literals
              | from __future__ import print_function
              | import sys
              | 
              | import plac
              | import bz2
              | import ujson
              | import spacy.en
          
        details
          summary: h4 Load the model and iterate over the data

          pre.language-python
            code 
              | def main(input_loc):
              |     nlp = spacy.en.English()                 # Load the model takes 10-20 seconds.
              |     for line in bz2.BZ2File(input_loc):      # Iterate over the reddit comments from the dump. 
              |         comment_str = ujson.loads(line)['body']  # Parse the json object, and extract the 'body' attribute. 
              |         
        details
          summary: h4 Apply the spaCy NLP pipeline, and look for the cases we want

          pre.language-python
            code
              |         comment_parse = nlp(comment_str) 
              |         for word in comment_parse:  
              |             if google_doing_something(word):
              |                 # Print the clause
              |                 print(''.join(w.string for w in word.head.subtree).strip())
        details
          summary: h4 Define the filter function

          pre.language-python
            code

              | 
              | def google_doing_something(w):
              |     if w.lower_ != 'google':
              |         return False
              |     # Is it the subject of a verb?
              |     elif w.dep_ != 'nsubj': 
              |         return False
              |     # And not 'is'
              |     elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux': 
              |         return False
              |     # Exclude e.g. "Google says..."
              |     elif w.head.lemma_ in ('say', 'show'): 
              |         return False
              |     else:
              |         return True
              | 
              | 

        details
          summary: h4 Call main

          pre.language-python
            code
              | if __name__ == '__main__':
              |     plac.call(main)

        details
          summary: h4 Example output

          p.
            Many false positives remain. Some are from incorrect interpretations
            of the sentence by spaCy, some are flaws in our filtering logic. But
            the results are vastly better than a string-based search, which returns
            almost no examples of the pattern we're looking for.

          code
            | Google dropped support for Android < 4.0 already
            | google drive
            | Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc
            | When Google responds
            | Google translate cyka pasterino.
            | A quick google looks like Synology does have a sync'ing feature which does support block level so that should work 
            | (google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?
            | Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.
            | Google offers something like this already, but it is truly terrible.
            | google isn't helping me
            | Google tells me: 0 results, 250 pages removed from google.
            | how did Google swoop in and eat our lunch

            

  script(src="js/prism.js")
  script(src="js/details_polyfill.js")
* Work on website 2015-08-14 18:13:22 +00:00			`doctype html`
			`html(lang='en')`
			`head`
			`meta(charset='utf-8')`
			`title spaCy Blog`
			`meta(name='description', content='')`
			`meta(name='author', content='Matthew Honnibal')`
			`link(rel='stylesheet', href='css/style.css')`
			`//if lt IE 9`
			`script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')`
			`body#blog`
			`header(role='banner')`
			`h1.logo spaCy Blog`
			`.slogan Blog`
			`main#content(role='main')`
			`section.intro`
			`p`
			`\| Example use of the spaCy NLP tools for data exploration.`
			`\| Here we will look for reddit comments that describe Google doing something,`
			`\| i.e. discuss the company's actions. This is difficult, because other senses of`
			`\| "Google" now dominate usage of the word in conversation, particularly references to`
			`\| using Google products.`

			`p`
			`\| The heuristics used are quick and dirty – about 5 minutes work.`

			`//\| A better approach is to use the word vector of the verb. But, the`
			`// \| demo here is just to show what's possible to build up quickly, to`
			`// \| start to understand some data.`

			`article.post`
			`header`
			`h2 Syntax-specific Search`
			`.subhead`
			`\| by`
			`a(href='#', rel='author') Matthew Honnibal`
			`\| on`
			`time(datetime='2015-08-14') August`

			`details`
			`summary: h4 Imports`

			`pre.language-python`
			`code`
			`\| from __future__ import unicode_literals`
			`\| from __future__ import print_function`
			`\| import sys`
			`\|`
			`\| import plac`
			`\| import bz2`
			`\| import ujson`
			`\| import spacy.en`

			`details`
			`summary: h4 Load the model and iterate over the data`

			`pre.language-python`
			`code`
			`\| def main(input_loc):`
			`\| nlp = spacy.en.English() # Load the model takes 10-20 seconds.`
			`\| for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump.`
			`\| comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.`
			`\|`
			`details`
			`summary: h4 Apply the spaCy NLP pipeline, and look for the cases we want`

			`pre.language-python`
			`code`
			`\| comment_parse = nlp(comment_str)`
			`\| for word in comment_parse:`
			`\| if google_doing_something(word):`
			`\| # Print the clause`
			`\| print(''.join(w.string for w in word.head.subtree).strip())`
			`details`
			`summary: h4 Define the filter function`

			`pre.language-python`
			`code`

			`\|`
			`\| def google_doing_something(w):`
			`\| if w.lower_ != 'google':`
			`\| return False`
			`\| # Is it the subject of a verb?`
			`\| elif w.dep_ != 'nsubj':`
			`\| return False`
			`\| # And not 'is'`
			`\| elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux':`
			`\| return False`
			`\| # Exclude e.g. "Google says..."`
			`\| elif w.head.lemma_ in ('say', 'show'):`
			`\| return False`
			`\| else:`
			`\| return True`
			`\|`
			`\|`

			`details`
			`summary: h4 Call main`

			`pre.language-python`
			`code`
			`\| if __name__ == '__main__':`
			`\| plac.call(main)`

			`details`
			`summary: h4 Example output`

			`p.`
			`Many false positives remain. Some are from incorrect interpretations`
			`of the sentence by spaCy, some are flaws in our filtering logic. But`
			`the results are vastly better than a string-based search, which returns`
			`almost no examples of the pattern we're looking for.`

			`code`
			`\| Google dropped support for Android < 4.0 already`
			`\| google drive`
			`\| Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc`
			`\| When Google responds`
			`\| Google translate cyka pasterino.`
			`\| A quick google looks like Synology does have a sync'ing feature which does support block level so that should work`
			`\| (google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?`
			`\| Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.`
			`\| Google offers something like this already, but it is truly terrible.`
			`\| google isn't helping me`
			`\| Google tells me: 0 results, 250 pages removed from google.`
			`\| how did Google swoop in and eat our lunch`



			`script(src="js/prism.js")`
			`script(src="js/details_polyfill.js")`