Commit Graph

22 Commits

Author SHA1 Message Date
Ines Montani a624ae0675 Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-22 23:09:01 +02:00
Ines Montani b507f61629 Tidy up and move noun_chunks, token_match, url_match 2020-07-22 22:18:46 +02:00
Ines Montani 24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
adrianeboyd 07639dd6ac
Remove TAG from da/sv tokenizer exceptions (#5428)
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
2020-05-13 10:25:54 +02:00
Ines Montani 46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Adriane Boyd 9f740a9891 Add a few more Danish tokenizer exceptions 2020-02-26 14:59:03 +01:00
Ines Montani a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Søren Lind Kristiansen 26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Ines Montani eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00
Søren Lind Kristiansen bef735aef7 Fix Danish abbreviation 'm.h.t.' 2017-12-21 09:24:31 +01:00
Søren Lind Kristiansen 7a2f2f6f94 Fix formatting. 2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen 15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Søren Lind Kristiansen ef03e9ea53 Remove unused import. 2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen 6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen 0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen 056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen ac8116510d Fix tokenization of 'i.' for Danish. 2017-11-24 11:16:53 +01:00
ines 819e30a26e Tidy up tokenizer exceptions 2017-11-01 23:02:45 +01:00
ines 7e424a1804 Don't copy exception dicts if not necessary and tidy up 2017-10-31 21:05:29 +01:00
mollerhoj e8f40ceed8 Add short names of months to tokenizer_exceptions 2017-07-03 15:49:51 +02:00
ines bb8be3d194 Add Danish language data 2017-05-10 21:15:12 +02:00