spaCy/spacy/lang/en/stop_words.py

# coding: utf8
from __future__ import unicode_literals


# Stop words
STOP_WORDS = set(
    """
a about above across after afterwards again against all almost alone along
already also although always am among amongst amount an and another any anyhow
anyone anything anyway anywhere are around as at

back be became because become becomes becoming been before beforehand behind
being below beside besides between beyond both bottom but by

call can cannot ca could

did do does doing done down due during

each eight either eleven else elsewhere empty enough even ever every
everyone everything everywhere except

few fifteen fifty first five for former formerly forty four from front full
further

get give go

had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however hundred

i if in indeed into is it its itself

keep

last latter latterly least less

just

made make many may me meanwhile might mine more moreover most mostly move much
must my myself

name namely neither never nevertheless next nine no nobody none noone nor not
nothing now nowhere 

of off often on once one only onto or other others otherwise our ours ourselves
out over own

part per perhaps please put

quite

rather re really regarding

same say see seem seemed seeming seems serious several she should show side
since six sixty so some somehow someone something sometime sometimes somewhere
still such

take ten than that the their them themselves then thence there thereafter
thereby therefore therein thereupon these they third this those though three
through throughout thru thus to together too top toward towards twelve twenty
two

under until up unless upon us used using

various very very via was we well were what whatever when whence whenever where
whereafter whereas whereby wherein whereupon wherever whether which while
whither who whoever whole whom whose why will with within without would

yet you your yours yourself yourselves
""".split()
)

for hyphen in ["'", "`", "‘", "´", "’"]:
    for stopword in u"n't 'd 'll 'm 're 's 've".split():
        STOP_WORDS.add(stopword.replace("'", hyphen))
-												Rename stop_words.py to word_sets.py and include more sets

NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)

											
										
										
											2017-03-12 12:53:46 +00:00
+								# coding: utf8
-												Break language data components into their own files

											
										
										
											2016-12-18 14:35:36 +00:00
+								from __future__ import unicode_literals
-												Rename stop_words.py to word_sets.py and include more sets

NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)

											
										
										
											2017-03-12 12:53:46 +00:00
+								# Stop words
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								STOP_WORDS = set(
 								    """
-												Break language data components into their own files

											
										
										
											2016-12-18 14:35:36 +00:00
+								a about above across after afterwards again against all almost alone along
 								already also although always am among amongst amount an and another any anyhow
 								anyone anything anyway anywhere are around as at
 								back be became because become becomes becoming been before beforehand behind
 								being below beside besides between beyond both bottom but by
 								call can cannot ca could
 								did do does doing done down due during
-												Clean up a couple of strange English stopwords

											
										
										
											2017-07-03 13:41:59 +00:00
+								each eight either eleven else elsewhere empty enough even ever every
-												Break language data components into their own files

											
										
										
											2016-12-18 14:35:36 +00:00
+								everyone everything everywhere except
 								few fifteen fifty first five for former formerly forty four from front full
 								further
 								get give go
 								had has have he hence her here hereafter hereby herein hereupon hers herself
 								him himself his how however hundred
-												Clean up a couple of strange English stopwords

											
										
										
											2017-07-03 13:41:59 +00:00
+								i if in indeed into is it its itself
-												Break language data components into their own files

											
										
										
											2016-12-18 14:35:36 +00:00
 								keep
 								last latter latterly least less
 								just
 								made make many may me meanwhile might mine more moreover most mostly move much
 								must my myself
 								name namely neither never nevertheless next nine no nobody none noone nor not
-												fixing Issue #3521 by adding all hyphen variants for each stopword

											
										
										
											2019-04-02 11:24:59 +00:00
+								nothing now nowhere
-												Break language data components into their own files

											
										
										
											2016-12-18 14:35:36 +00:00
 								of off often on once one only onto or other others otherwise our ours ourselves
 								out over own
 								part per perhaps please put
 								quite
 								rather re really regarding
 								same say see seem seemed seeming seems serious several she should show side
 								since six sixty so some somehow someone something sometime sometimes somewhere
 								still such
 								take ten than that the their them themselves then thence there thereafter
 								thereby therefore therein thereupon these they third this those though three
 								through throughout thru thus to together too top toward towards twelve twenty
 								two
 								under until up unless upon us used using
 								various very very via was we well were what whatever when whence whenever where
 								whereafter whereas whereby wherein whereupon wherever whether which while
 								whither who whoever whole whom whose why will with within without would
 								yet you your yours yourself yourselves
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								""".split()
 								)
-												fixing Issue #3521 by adding all hyphen variants for each stopword

											
										
										
											2019-04-02 11:24:59 +00:00
 								for hyphen in ["'", "`", "‘", "´", "’"]:
-												unicode string for python 2.7

											
										
										
											2019-04-02 11:52:07 +00:00
+								    for stopword in u"n't 'd 'll 'm 're 's 've".split():
-												fixing Issue #3521 by adding all hyphen variants for each stopword

											
										
										
											2019-04-02 11:24:59 +00:00
+								        STOP_WORDS.add(stopword.replace("'", hyphen))