spaCy

Commit Graph

Author	SHA1	Message	Date
adrianeboyd	06b251dd1e	Add support for pos/morphs/lemmas in training data (#4941 ) Add support for pos/morphs/lemmas throughout `GoldParse`, `Example`, and `docs_to_json()`.	2020-01-28 11:36:29 +01:00
adrianeboyd	adc9745718	Modify morphology to support arbitrary features (#4932 ) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.	2020-01-23 22:01:54 +01:00
Sofie Van Landeghem	0a0de85409	Fix gold training (#4938 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * ensure doc is not None	2020-01-23 22:00:24 +01:00
adrianeboyd	199d89943e	Add as_example to Sentencizer pipe() (#4933 )	2020-01-22 15:40:31 +01:00
adrianeboyd	d2f3a44b42	Improve train CLI sentrec scoring (#4892 ) * reorder to metrics to prioritize F over P/R * add sentrec to model metrics	2020-01-08 16:52:14 +01:00
adrianeboyd	e55fa1899a	Report length of dev dataset correctly (#4891 )	2020-01-08 16:51:51 +01:00
adrianeboyd	e1b493ae85	Add sentrec shortcut to Language (#4890 )	2020-01-08 16:51:24 +01:00
Sofie Van Landeghem	581eeed98b	Warning goldparse (#4851 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * provide more friendly error msg for parsing file	2020-01-01 13:16:48 +01:00
Ines Montani	83e0a6f3e3	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
Ines Montani	401946d480	Un-xfail passing tests	2019-12-25 18:02:20 +01:00
Ines Montani	a892821c51	More formatting changes	2019-12-25 17:59:52 +01:00
Ines Montani	c22f075509	Update pydantic version pin [ci skip]	2019-12-25 17:29:53 +01:00
Ines Montani	33a2682d60	Add better schemas and validation using Pydantic (#4831 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Add better schemas and validation using Pydantic * Revert lookups.md * Remove unused import * Update spacy/schemas.py Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com> * Various small fixes * Fix docstring Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>	2019-12-25 12:39:49 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	21b6d6e0a8	Fix typo	2019-12-21 21:17:31 +01:00
Ines Montani	de33b6d566	Merge branch 'master' into develop	2019-12-21 21:15:46 +01:00
Ines Montani	7c69d30de5	Tidy up and expect warning	2019-12-21 21:14:52 +01:00
Sofie Van Landeghem	732142bf28	facilitate larger training files (#4827 ) * add warning for large file and change start var to long * type for file_length	2019-12-21 21:12:19 +01:00
Ines Montani	d17e7dca9e	Fix problems caused by merge conflict	2019-12-21 19:57:41 +01:00
Ines Montani	947dba7141	Merge branch 'master' into develop	2019-12-21 19:04:43 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
Olamilekan Wahab	a741de7cf6	Adding support for Yoruba Language (#4614 ) * Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>	2019-12-21 14:11:50 +01:00
Ines Montani	1b838d1313	Divide models into core and starters [ci skip]	2019-12-21 14:10:22 +01:00
Ines Montani	0750d59e5a	Allow setting ner_missing_tag on docs_to_json	2019-12-21 13:47:21 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Sofie Van Landeghem	12158c1e3a	Restore tqdm imports (#4804 ) * set 4.38.0 to minimal version with color bug fix * set imports back to proper place * add upper range for tqdm	2019-12-16 13:12:19 +01:00
Ines Montani	c466e02466	Update universe [ci skip]	2019-12-13 15:57:39 +01:00
Sofie Van Landeghem	557dcf5659	NEL requires sentences to be set (#4801 )	2019-12-13 15:55:18 +01:00
tamuhey	1707e77c5e	add char_span to Span (#4793 )	2019-12-13 15:54:58 +01:00
adrianeboyd	a4cacd3402	Add tag_map argument to CLI debug-data and train (#4750 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2019-12-13 10:46:18 +01:00
Sofie Van Landeghem	f9b541f9ef	More robust set entities method in KB (#4794 ) * add unit test for setting entities with duplicate identifiers * count the number of actual unique identifiers and throw duplicate warning	2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar	a067ded495	Update doc.md (#4796 )	2019-12-11 18:21:40 +01:00
adrianeboyd	eb9b1858c4	Add NER map option to convert CLI (#4763 ) Instead of a hard-coded NER tag simplification function that was only intended for NorNE, map NER tags in CoNLL-U converter using a dict provided as JSON as a command-line option. Map NER entity types or new tag or to "" for 'O', e.g.: ``` {"PER": "PERSON", "BAD": ""} => B-PER -> B-PERSON B-BAD -> O ```	2019-12-11 18:20:49 +01:00
Sofie Van Landeghem	5355b0038f	Update EL example (#4789 ) * update EL example script after sentence-central refactor * version bump * set incl_prior to False for quick demo purposes * clean up	2019-12-11 18:19:42 +01:00
adrianeboyd	38e1bc19f4	Add destructors for states in TransitionSystem (#4686 )	2019-12-10 13:23:27 +01:00
Matthew Honnibal	45efdb1ef7	Merge branch 'master' of https://github.com/explosion/spaCy	2019-12-10 00:54:18 +01:00
Matthew Honnibal	0a3175d46f	Require thinc v7.4.0.dev0	2019-12-10 00:47:51 +01:00
adrianeboyd	c208eb6e4d	Fix int value handling in Matcher (#4749 ) Add `int` values (for `LENGTH`) in _get_attr_values() instead of treating `int` like `dict`.	2019-12-06 19:22:57 +01:00
Tclack88	ab8dc2732c	Update token.md (#4767 ) * Update token.md documentation is confusing: A '?' is a right punct, but '¿' is a left punct * Update token.md add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation * Move quotes into code block [ci skip]	2019-12-06 19:22:02 +01:00
Sofie Van Landeghem	780d43aac7	fix bug in EL predict (#4779 )	2019-12-06 19:18:14 +01:00
Ines Montani	bf611ebca7	Document jsonl option on converter [ci skip]	2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen	de5453cdcb	Fix link to user hooks in docs (#4778 ) * Fix link to user hooks in docs * Update mr_bjerre.md Mistake in contributor agreement * Apparently hard to get it right (wrong name of sca)	2019-12-06 19:17:12 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
Antti Ajanki	e626a011cc	Improvements to the Finnish language data (#4738 ) * Enable lex_attrs on Finnish * Copy the Danish tokenizer rules to Finnish Specifically, don't break hyphenated compound words * Contributor agreement * A new file for Finnish tokenizer rules instead of including the Danish ones	2019-12-03 12:55:28 +01:00
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	68f711b409	Fix conllu2json n_sents and raw text (#4728 ) Update conllu2json converter to include raw text in final batch.	2019-11-29 10:22:03 +01:00
adrianeboyd	79ba1a3b92	Add lemmas to GoldParse / Example / docs_to_json (#4726 )	2019-11-28 14:53:44 +01:00
adrianeboyd	b841d3fe75	Add a tagger-based SentenceRecognizer (#4713 ) * Add sent_starts to GoldParse * Add SentTagger pipeline component Add `SentTagger` pipeline component as a subclass of `Tagger`. * Model reduces default parameters from `Tagger` to be small and fast * Hard-coded set of two labels: * S (1): token at beginning of sentence * I (0): all other sentence positions * Sets `token.sent_start` values * Add sentence segmentation to Scorer Report `sent_p/r/f` for sentence boundaries, which may be provided by various pipeline components. * Add sentence segmentation to CLI evaluate * Add senttagger metrics/scoring to train CLI * Rename SentTagger to SentenceRecognizer * Add SentenceRecognizer to spacy.pipes imports * Add SentenceRecognizer serialization test * Shorten component name to sentrec * Remove duplicates from train CLI output metrics	2019-11-28 11:10:07 +01:00
adrianeboyd	48ea2e8d0f	Restructure Sentencizer to follow Pipe API (#4721 ) * Restructure Sentencizer to follow Pipe API Restructure Sentencizer to follow Pipe API so that it can be scored with `nlp.evaluate()`. * Add Sentencizer pipe() test	2019-11-27 16:33:34 +01:00

1 2 3 4 5 ...

11151 Commits All Branches Search

11151 Commits

All Branches