Commit Graph

2874 Commits

Author SHA1 Message Date
Matthew Honnibal fa89613444 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-16 13:42:56 -05:00
ines 1f9f867c70 Remove unused util function 2017-04-16 20:37:45 +02:00
ines 7670c745b6 Update spacy.load() and fix path checks 2017-04-16 20:37:45 +02:00
ines d3759dfb32 Fix docstring 2017-04-16 20:37:45 +02:00
ines ed7e19ad68 Remove unused import 2017-04-16 20:37:45 +02:00
ines 0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
Matthew Honnibal 89a4f262fc Fix training methods 2017-04-16 13:00:37 -05:00
Matthew Honnibal 6a4221a6de Allow lemma to be set from Python. Re #973 2017-04-16 18:07:53 +02:00
Matthew Honnibal 137b210bcf Restore use of FTRL training 2017-04-16 18:02:42 +02:00
ines d10bd0eaf9 Fix formatting 2017-04-16 13:42:34 +02:00
ines 8191e33cf1 Update link error message with info on permissions 2017-04-16 13:32:31 +02:00
ines a3ddbc0444 Add note about --force flag to error message 2017-04-16 13:14:36 +02:00
ines e3de035814 Add meta validation to check for required settings
Complain if no "lang", "name" or "version" is found (those settings are
used in directory / package names). Package will still build without,
but it'll inevitably fail somewhere down the line.
2017-04-16 13:13:17 +02:00
ines a7574b7572 Add more options to read in meta data in package command
Add meta option to supply path to meta.json. If no meta path is set,
check if meta.json exists in input directory and use it. Otherwise,
prompt for details on the command line.
2017-04-16 13:06:02 +02:00
ines 13c8a42d2b Fix typos 2017-04-16 13:03:58 +02:00
ines 31fa73293a Move read_json out to own util function 2017-04-16 13:03:28 +02:00
Matthew Honnibal 45464d065e Remove print statement 2017-04-15 16:11:43 +02:00
Matthew Honnibal c76cb8af35 Fix training for new labels 2017-04-15 16:11:26 +02:00
Matthew Honnibal 4884b2c113 Refix StepwiseState 2017-04-15 16:00:28 +02:00
Matthew Honnibal e6ee7e130f Fix parse package meta 2017-04-15 13:38:53 +02:00
Matthew Honnibal 1a98e48b8e Fix Stepwisestate' 2017-04-15 13:35:01 +02:00
ines 0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
ines fefe6684cd Fix symlink function to check for Windows 2017-04-15 12:17:27 +02:00
ines 35fb4febe2 Fix whitespace 2017-04-15 12:13:45 +02:00
ines e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines 958b12dec8 Use pathlib instead of os.path 2017-04-15 12:13:00 +02:00
ines 956dc36785 Move functions to deprecated 2017-04-15 12:12:31 +02:00
ines c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines 26445ee304 Add compat module for Python2/3 and platform compatibility 2017-04-15 12:07:02 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal d13f0a7017 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-04-14 23:54:57 +02:00
Matthew Honnibal 354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal 33ba5066eb Refactor Language.end_training, making new save_to_directory method 2017-04-14 23:51:24 +02:00
ines 84341c2975 Only compile list of models if data_path exists 2017-04-14 16:48:02 +02:00
Gyorgy Orosz dd3244c08a Made json dump to produce unicode strings in py2 2017-04-13 23:30:47 +02:00
Gyorgy Orosz a9469c8173 Fixed typo 2017-04-13 15:24:14 +02:00
ines 41037f0f07 Remove unused imports 2017-04-13 13:52:11 +02:00
ines 1b92c8d5d5 Use unicode paths on Windows/Python 2 and catch other errors (resolves #970)
try/except here is quite dirty, but it'll at least make sure users see
an error message that explains what's going on
2017-04-10 17:49:51 +02:00
Matthew Honnibal 49e2de900e Add costs property to StepwiseState, to show which moves are gold. 2017-04-10 11:37:04 +02:00
Matthew Honnibal e26577b202 Increment version 2017-04-07 18:45:06 +02:00
Matthew Honnibal 40bf7ecf27 Increment version 2017-04-07 18:44:20 +02:00
Matthew Honnibal 1dca7eeb03 Add unicode declaration on new regression test 2017-04-07 18:09:23 +02:00
ines 887827fc6a Merge branch 'develop' 2017-04-07 17:36:23 +02:00
ines 444dd511c5 Fix xpassing URL test case 2017-04-07 17:36:05 +02:00
ines bf0f15e762 Add / to tokenizer infixes (resolves #891) 2017-04-07 17:30:44 +02:00
ines 00b9011a49 Fix whitespace 2017-04-07 17:29:59 +02:00
ines f9869e4dc5 Merge branch 'master' into develop 2017-04-07 17:23:40 +02:00
Matthew Honnibal 4a6204dbad Merge remote-tracking branch 'origin/develop' 2017-04-07 17:20:09 +02:00
Matthew Honnibal 0513c43bf0 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 17:07:10 +02:00
Matthew Honnibal cc36c308f4 Fix noun_chunk rules around coordination
Closes #693.
2017-04-07 17:06:40 +02:00
Matthew Honnibal ab846256cf Merge pull request #966 from recognai/master
Prepare Spanish language for training models, including configuration, rich-UD tag map and tests
2017-04-07 16:12:29 +02:00
Matthew Honnibal 83dca920d4 Rename test #913 -> #957, comment
Make test for #957 reference correct bug. Add comment.

Previous commit closes #957.
2017-04-07 15:54:25 +02:00
Matthew Honnibal be204ed714 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-07 15:50:14 +02:00
Matthew Honnibal e7b1ee9efd Switch to regex module for URL identification
The URL detection regex was failing on input such as 0.1.2.3, as this
input triggered excessive back-tracking in the builtin re module.
The solution was to switch to the regex module, which behaves better.

Closes #913.
2017-04-07 15:47:36 +02:00
Matthew Honnibal 5887383fc0 Add test for Issue #913: Hang from bad regex 2017-04-07 15:47:27 +02:00
ines 7ea1673072 Fix whitespace 2017-04-07 13:28:48 +02:00
ines 255650dbc2 Add connlu2json converter from explosion/spacy-dev-resources/#11 2017-04-07 13:05:12 +02:00
ines 789ce8a45e Add convert command 2017-04-07 13:04:17 +02:00
ines 9952d3b08a Fix whitespace 2017-04-07 13:02:05 +02:00
ines 47ddce6eb7 Remove unused variable 2017-04-07 13:01:48 +02:00
ines dcf8ab0c47 Merge branch 'develop' 2017-04-07 12:00:09 +02:00
ines 75f9b4c6e2 Fix whitespace 2017-04-07 10:22:18 +02:00
oeg c693d40791 feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests 2017-04-06 18:48:45 +02:00
oeg 010293fb2f fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli 2017-04-06 17:33:15 +02:00
ines 808cd6cf7f Add missing tags to verbs (resolves #948) 2017-04-03 18:12:52 +02:00
ines ad8bf1829f Import and combine Portuguese tokenizer exceptions (see #943) 2017-04-01 10:37:42 +02:00
Ines Montani f8b2d9c3b7 Merge pull request #943 from mamoit/master
Portuguese improvements
2017-04-01 10:32:00 +02:00
ines 3b667a24d4 Remove whitespace 2017-04-01 10:21:08 +02:00
ines e71a1f4bd0 Fix download commands in error messages (see #946) 2017-04-01 10:20:57 +02:00
ines 42382d5692 Fix download commands in error messages (see #946) 2017-04-01 10:19:32 +02:00
ines d4a59c254b Remove whitespace 2017-04-01 10:19:01 +02:00
Matthew Honnibal 51882ee2b8 Fix check for setting ent_id in merge 2017-03-31 19:32:01 +02:00
Miguel Almeida 4fde64c4ea Portuguese contractions and some abreviations 2017-03-31 15:52:55 +01:00
Miguel Almeida 465b240bcb Review Portuguese stop words
Mainly to review typos and add missing masculines/feminines
2017-03-31 13:00:47 +01:00
Matthew Honnibal fc3900e5b2 Allow ent_id to be set in Token 2017-03-31 14:00:14 +02:00
Matthew Honnibal 9720103428 Improve attribute handlign in doc.merge(). Still unsatisfying 2017-03-31 13:59:58 +02:00
Matthew Honnibal cfff4e0f61 Improve test 2017-03-31 13:59:32 +02:00
Matthew Honnibal 1bb7b4ca71 Add comment 2017-03-31 13:59:19 +02:00
Matthew Honnibal 725249c59a Add merge_phrase callback in matcher.pyx 2017-03-31 13:58:59 +02:00
Matthew Honnibal e854f28304 Add test for Issue #758
Issue #758 occurs when no actions are available for a single token
doc after merging.
2017-03-31 13:26:25 +02:00
Miguel Almeida c1d020b0a6 Remove "ista" from portuguese stop words 2017-03-31 12:26:13 +01:00
Miguel Almeida 17a1e7a119 Add Portuguese numbers and ordinals 2017-03-31 12:21:01 +01:00
Matthew Honnibal 47a3ef06a6 Unhack deprojetivization, moving it into pipeline
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898.
2017-03-31 12:31:50 +02:00
Joshua Reeter 564daf6dec Issue #934 symlink should not convert paths as_posix under windows. 2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita c2d48974bc Fix typos in Portuguese stop words 2017-03-30 21:59:18 +13:00
Matthew Honnibal 0fefdfcbda Merge pull request #935 from ericzhao28/master
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)
2017-03-30 02:51:24 +02:00
ines 4759fd437d Merge branch 'master' into develop 2017-03-29 10:37:13 +02:00
ines 7e4befec88 Add Hebrew to init and setup.py 2017-03-29 10:34:57 +02:00
Grégory Howard 9c2996b27f correction of package.py (encoding on open instead of write) 2017-03-29 09:11:02 +02:00
Eric Zhao aafdf6ffb8 Add option to use label karg to determine ent_type in doc.merge 2017-03-28 23:35:03 -07:00
ines 7198cf1c8a Remove unused import 2017-03-26 20:56:05 +02:00
ines 7ceaa1614b Add experimental model init command 2017-03-26 20:51:40 +02:00
Matthew Honnibal 83ba6c247c Fix init of Language without model 2017-03-26 16:46:00 +02:00
Matthew Honnibal fa107f95f6 Remove unused train_config command 2017-03-26 09:28:59 -05:00
Matthew Honnibal df83921f0a Increment version 2017-03-26 09:27:32 -05:00
Matthew Honnibal 92ac3af21d Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-26 09:26:59 -05:00
Matthew Honnibal a9b1f23c7d Enable regression loss for parser 2017-03-26 09:26:30 -05:00
ines c00d997924 Merge branch 'develop' 2017-03-26 15:57:00 +02:00
Matthew Honnibal 2efdbc08ff Make training work with directories 2017-03-26 08:46:44 -05:00
ines 007a2492bd Remove train_config command for now 2017-03-26 15:40:50 +02:00
ines b297fab062 Update error message for missing commands 2017-03-26 15:40:02 +02:00
ines 7f95023fc0 Fix formatting 2017-03-26 15:37:37 +02:00
ines 5901c8f7f0 Update spacy train CLI documentation 2017-03-26 15:33:48 +02:00
Matthew Honnibal 9dcb58aaaf Merge CLI changes 2017-03-26 07:30:45 -05:00
Matthew Honnibal 6b7f7a2060 Connect parser L1 option to train CLI 2017-03-26 07:24:07 -05:00
Matthew Honnibal ed2b106f4d Fix circular import in lemmatizer 2017-03-26 07:17:07 -05:00
Matthew Honnibal dec5571bf3 Update train CLI 2017-03-26 07:16:52 -05:00
ines 53cf2f1c0e Make dev data optional 2017-03-26 11:48:17 +02:00
Matthew Honnibal 5eac089fbe Merge branch 'master' into develop 2017-03-26 04:45:43 -05:00
ines 0fc56e2544 Update flag and defaults 2017-03-26 11:42:11 +02:00
Matthew Honnibal 2f63806ddb Update config when adding label. Re #910 2017-03-25 22:35:44 +01:00
Matthew Honnibal b94286de30 Fix regression test 2017-03-25 22:35:07 +01:00
Matthew Honnibal c748907a66 Fix errors in previous commit 2017-03-25 22:25:01 +01:00
Matthew Honnibal 4f400fa486 Prevent lemmatization of base nouns
Update lemmatizer's base-form check, for change in morphology class.
Closes #903.
2017-03-25 21:51:12 +01:00
Matthew Honnibal 850d35dcb3 Make morphology use int attributes internally
The morphology class was calling the lemmatizer inconsistently,
which some string-valued attributes. This caused Issue #903.
2017-03-25 21:49:10 +01:00
Matthew Honnibal 4454c1b23f Block lemmatization of base-form adjectives
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912.
2017-03-25 21:29:57 +01:00
ines 97814f8da6 Update Windows Python 2 link workaround to use helper functions 2017-03-25 14:04:27 +01:00
ines fdec758113 Add is_windows and is_python2 utility functions 2017-03-25 14:04:02 +01:00
Ines Montani 09837158e4 Merge pull request #921 from solresol/master
Possible solution to #909
2017-03-25 13:51:55 +01:00
Greg Baker b7f714b498 Possible solution to #909 2017-03-25 21:36:38 +11:00
Ines Montani 97cb4d5e3c Merge branch 'master' into master 2017-03-25 10:03:47 +01:00
Iddo Berger da135bd823 add hebrew tokenizer 2017-03-24 18:27:44 +03:00
Matthew Honnibal f40fbc3710 Add test for Issue #910: Resuming entity training 2017-03-23 23:38:57 +01:00
Matthew Honnibal 9c9cd99144 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-23 11:11:24 +01:00
ines 0035fd9efe Add spacy train work in progress 2017-03-23 11:08:41 +01:00
ines d5ebf583a4 Fix formatting 2017-03-23 11:08:30 +01:00
ines 3f20efe165 Merge branch 'develop'
# Conflicts:
#	spacy/util.py
2017-03-22 17:14:15 +01:00
Ines Montani f86a3a92d5 Merge pull request #899 from raphael0202/duplicate_keys
Remove duplicate keys in [en|fi] language data dicts
2017-03-22 10:20:11 +01:00
Ines Montani 87a2c85e1b Merge pull request #900 from raphael0202/unused_imports
Remove unused import statements
2017-03-22 10:10:43 +01:00
ines ce065e5d65 Fix imports 2017-03-22 10:02:14 +01:00
Andrew Poliakov 07199c3e8b Fix infinite recursion in spacy.info 2017-03-22 11:43:22 +03:00
Raphaël Bournhonesque f332bf05be Remove unused import statements 2017-03-21 21:08:54 +01:00
ines c3a9f73896 Fix writing to file 2017-03-21 12:35:22 +01:00
ines d74aa428ad Fix path 2017-03-21 12:26:00 +01:00
ines 83a999ea83 Change default license from MIT to CC 2017-03-21 12:24:43 +01:00
ines ae46647560 Fix brackets 2017-03-21 12:21:42 +01:00
ines 3e134b5b2b Make sure paths in copytree and rmtree are strings 2017-03-21 12:15:33 +01:00
ines cf0094187e Fetch MANIFEST.in from GitHub as well 2017-03-21 11:32:38 +01:00
ines 09b24bc5a9 Add docs for package command 2017-03-21 11:19:21 +01:00
ines 3f4e3fda1d Update command and fetch file templates from GitHub
While feature is still experimental, this allows files to be modified
without having to ship a new version of spaCy.
2017-03-21 11:17:36 +01:00
ines 5230ed5b98 Move directory check and overwriting/creating dirs to own function 2017-03-21 02:06:53 +01:00
ines 46bc3c36b0 Fix typo 2017-03-21 02:06:37 +01:00
ines 64e38f304e Only import shutil 2017-03-21 02:06:29 +01:00
ines 448a916d0d Add --force option to override directory 2017-03-21 02:05:34 +01:00
ines 8eb9a2b355 Fix formatting 2017-03-21 02:05:14 +01:00
ines b2bcdec0f6 Update docstring 2017-03-20 22:50:55 +01:00
ines bf240132d7 Add cli.package command to build model packages 2017-03-20 22:50:13 +01:00
ines a54e3c2efe Remove empty line 2017-03-20 22:49:36 +01:00
ines 5aea327a5b Add util function to get raw user input 2017-03-20 22:48:56 +01:00
ines a6c0361803 Handle raw_input vs input in Python 2 and 3 2017-03-20 22:48:32 +01:00
ines adbcac6591 Fix spacing 2017-03-20 22:48:21 +01:00
Matthew Honnibal 692eb0603d Fix high memory usage in download command
Due to PyPi issue #2984, installing large packages via pip causes
a large spike in memory usage. The recommended fix is to disable
caching.
2017-03-20 18:24:44 +01:00
ines f830213c4c Remove compatibility check test
Will only cause problems when incrementing version and not updating
table. Also depends on external URL, which is bad.
2017-03-20 13:20:26 +01:00
Matthew Honnibal f314d3d044 Increment version 2017-03-20 12:58:24 +01:00
Matthew Honnibal b487b8735a Decrease beam density, and fix Python 3 problem in beam 2017-03-20 12:56:05 +01:00
Ines Montani b6ee241e26 Fix print statements 2017-03-20 11:46:37 +01:00
ines b8f8d5d8bf Make sure model_path is a Posix path
Otherwise, formatting the success message with model_path.as_posix()
fails when using a local path for linking (linking still works, but the
error message is confusing)
2017-03-19 11:57:13 +01:00
ines fe0ff00fe1 Fix spacing 2017-03-19 11:55:37 +01:00
ines 5712da6095 Add regression test for #891 2017-03-19 11:48:01 +01:00
Raphaël Bournhonesque 7f579ae834 Remove duplicate keys in [en|fi] data dicts 2017-03-19 11:40:29 +01:00
ines 8de5108af6 Exclude common cache directories from mode list in cli.info
This means models called "cache" etc. won't show up in the list, but it
seems worth it.
2017-03-19 01:44:43 +01:00
Matthew Honnibal 6ee2ea1128 Increment version 2017-03-19 01:40:52 +01:00
Matthew Honnibal 797f286c38 Use import to find data package 2017-03-19 01:39:36 +01:00
Matthew Honnibal 5941fb9e92 Make spacy/data a package 2017-03-18 20:04:22 +01:00
Matthew Honnibal bc10d06bc2 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-18 19:32:54 +01:00
Matthew Honnibal 583628c350 Import metadata into __init__ 2017-03-18 19:30:03 +01:00
Matthew Honnibal 1754e0db9b Call pip via subprocess, to make it use virtualenv 2017-03-18 19:29:36 +01:00
ines 1277abcde2 Remove print statement 2017-03-18 19:14:58 +01:00
Matthew Honnibal dcec104643 Remove unused import 2017-03-18 18:57:45 +01:00
Matthew Honnibal 703eb7bdbd Fix link module 2017-03-18 18:57:31 +01:00
Matthew Honnibal f6c6c89546 Add empty data directory 2017-03-18 18:32:29 +01:00
ines 7d33104180 Use distutils.sysconfig.get_python_lib
site.getsitepackages seems to not work as expected in Python 2
2017-03-18 18:20:40 +01:00
Matthew Honnibal 1a53fcc685 Fix CLI for Python 2 2017-03-18 18:14:03 +01:00
ines aefb898e37 Add title-case version of morph rules (resolves #686) 2017-03-18 17:27:11 +01:00
ines 64ec17abc1 Pass xpassing tests and add xfails for failures 2017-03-18 17:20:46 +01:00
ines d0b85faf69 Pass regression test for #401 (resolves #401)
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines be9daefbdd Remove actual model downloading from tests 2017-03-18 17:01:10 +01:00
ines 850650221a Use correct command in deprecated download command message 2017-03-18 17:01:01 +01:00
ines 0dd7710556 Make sure paths are paths 2017-03-18 16:48:52 +01:00
Matthew Honnibal de0e6385b4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-18 16:17:28 +01:00
Matthew Honnibal fe442cac53 Fix #717: Set correct lemma for contracted verbs 2017-03-18 16:16:10 +01:00
ines ad934a9abd Add regression test for #693 2017-03-18 16:12:30 +01:00
ines f57c616830 Add regression test for #704 and test new model (resolves #704)
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal 413138de79 Fix #719: Lemmatizer can no longer output empty string 2017-03-18 16:02:06 +01:00
ines ab1451f997 Don't mark compatibility test as slow 2017-03-18 15:17:39 +01:00
ines ec3e810662 Add directory cli and set up command line interface 2017-03-18 15:14:48 +01:00
ines cd94ea1095 Use info module for spacy.info() 2017-03-18 13:01:26 +01:00
ines e3e25c0a33 Add spacy.info module
Print info about spaCy installation, local setup and models. Allow
export in Markdown format to copy-paste into GitHub issues.
2017-03-18 13:01:16 +01:00
ines 0eafc0f2c6 Add util functions to print data as table or markdown list 2017-03-18 13:00:14 +01:00
ines 6b9b444065 Fix imports 2017-03-18 12:59:41 +01:00
ines a035ebd32a Use pathlib.Path instead of os.path 2017-03-18 12:59:21 +01:00
ines 9605cf39cc Handle default path in Language classes 2017-03-18 12:58:45 +01:00
Matthew Honnibal ac4b88cce9 Fix auto-linking in download command 2017-03-17 21:36:13 +01:00
ines 8a34c3e666 Fix shortcut name 2017-03-17 20:07:34 +01:00
Matthew Honnibal 6420f86f02 Merge changes to __init__.py 2017-03-17 19:51:45 +01:00
ines e01fbacf81 Update resolve_model_name 2017-03-17 19:26:28 +01:00
ines aedefef49d Add function to resolve model names and link them 2017-03-17 18:47:05 +01:00
Matthew Honnibal d013aba7b5 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 18:30:53 +01:00
Matthew Honnibal 854cfce7cf Make vocabs more compatible across versions
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00