spaCy/spacy/lang/pl/__init__.py

from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .lemmatizer import PolishLemmatizer

from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import add_lookups
from ...lookups import Lookups


class PolishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "pl"
    lex_attr_getters[NORM] = add_lookups(
        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
    )
    mod_base_exceptions = {
        exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")
    }
    tokenizer_exceptions = mod_base_exceptions
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES

    @classmethod
    def create_lemmatizer(cls, nlp=None, lookups=None):
        if lookups is None:
            lookups = Lookups()
        return PolishLemmatizer(lookups)


class Polish(Language):
    lang = "pl"
    Defaults = PolishDefaults


__all__ = ["Polish"]
Update Polish tokenizer for UD_Polish-PDB (#5432) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 13:59:55 +00:00			`from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES`
			`from .punctuation import TOKENIZER_SUFFIXES`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00			`from .stop_words import STOP_WORDS`
Improved polish tokenizer and stop words. (#2974) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions 2019-02-08 03:27:21 +00:00			`from .lex_attrs import LEX_ATTRS`
Add Polish lemmatizer (#5413) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import 2020-05-14 16:23:19 +00:00			`from .lemmatizer import PolishLemmatizer`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00
			`from ..tokenizer_exceptions import BASE_EXCEPTIONS`
Add norm exceptions to other Language classes 2017-06-03 20:29:21 +00:00			`from ..norm_exceptions import BASE_NORMS`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00			`from ...language import Language`
Add norm exceptions to other Language classes 2017-06-03 20:29:21 +00:00			`from ...attrs import LANG, NORM`
Tidy up and auto-format 2020-05-21 12:14:01 +00:00			`from ...util import add_lookups`
Add Polish lemmatizer (#5413) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import 2020-05-14 16:23:19 +00:00			`from ...lookups import Lookups`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00

Move Defaults subclass to module scope (necessary for pickling) 2017-05-20 17:02:27 +00:00			`class PolishDefaults(Language.Defaults):`
			`lex_attr_getters = dict(Language.Defaults.lex_attr_getters)`
Improved polish tokenizer and stop words. (#2974) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions 2019-02-08 03:27:21 +00:00			`lex_attr_getters.update(LEX_ATTRS)`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 16:03:03 +00:00			`lex_attr_getters[LANG] = lambda text: "pl"`
			`lex_attr_getters[NORM] = add_lookups(`
			`Language.Defaults.lex_attr_getters[NORM], BASE_NORMS`
			`)`
Update Polish tokenizer for UD_Polish-PDB (#5432) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 13:59:55 +00:00			`mod_base_exceptions = {`
			`exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".")`
			`}`
			`tokenizer_exceptions = mod_base_exceptions`
Don't make copies of language data components 2017-10-11 13:34:55 +00:00			`stop_words = STOP_WORDS`
Update Polish tokenizer for UD_Polish-PDB (#5432) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 13:59:55 +00:00			`prefixes = TOKENIZER_PREFIXES`
Tidy up and fix small bugs and typos 2019-02-08 13:14:49 +00:00			`infixes = TOKENIZER_INFIXES`
Update Polish tokenizer for UD_Polish-PDB (#5432) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> 2020-05-19 13:59:55 +00:00			`suffixes = TOKENIZER_SUFFIXES`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00
Add Polish lemmatizer (#5413) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import 2020-05-14 16:23:19 +00:00			`@classmethod`
			`def create_lemmatizer(cls, nlp=None, lookups=None):`
			`if lookups is None:`
			`lookups = Lookups()`
			`return PolishLemmatizer(lookups)`

Add basic Polish Language class 2017-05-12 07:25:37 +00:00
Move Defaults subclass to module scope (necessary for pickling) 2017-05-20 17:02:27 +00:00			`class Polish(Language):`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 16:03:03 +00:00			`lang = "pl"`
Move Defaults subclass to module scope (necessary for pickling) 2017-05-20 17:02:27 +00:00			`Defaults = PolishDefaults`
Add basic Polish Language class 2017-05-12 07:25:37 +00:00

💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 16:03:03 +00:00			`__all__ = ["Polish"]`