"slogan":"Emoji handling and meta data as a spaCy pipeline component",
"github":"ines/spacymoji",
"description":"spaCy v2.0 extension and pipeline component for adding emoji meta data to `Doc` objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom `Doc`, `Token` and `Span` attributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`.",
"pip":"spacymoji",
"category":["pipeline"],
"tags":["emoji","unicode"],
"thumb":"https://i.imgur.com/XOTYIgn.jpg",
"code_example":[
"import spacy",
"from spacymoji import Emoji",
"",
"nlp = spacy.load('en')",
"emoji = Emoji(nlp)",
"nlp.add_pipe(emoji, first=True)",
"",
"doc = nlp(u'This is a test 😻 👍🏿')",
"assert doc._.has_emoji == True",
"assert doc[2:5]._.has_emoji == True",
"assert doc[0]._.is_emoji == False",
"assert doc[4]._.is_emoji == True",
"assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'",
"assert len(doc._.emoji) == 2",
"assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')"
],
"author":"Ines Montani",
"author_links":{
"twitter":"_inesmontani",
"github":"ines",
"website":"https://ines.io"
}
},
{
"id":"spacy_hunspell",
"slogan":"Add spellchecking and spelling suggestions to your spaCy pipeline using Hunspell",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [Hunspell](http://hunspell.github.io) support for spellchecking.",
"slogan":"Language Tool style grammar handling with spaCy",
"description":"This packages leverages the [Matcher API](https://spacy.io/docs/usage/rule-based-matching) in spaCy to quickly match on spaCy tokens not dissimilar to regex. It reads a `grammar.yml` file to load up custom patterns and returns the results inside `Doc`, `Span`, and `Token`. It is extensible through adding rules to `grammar.yml` (though currently only the simple string matching is implemented).",
"github":"tokestermw/spacy_grammar",
"code_example":[
"import spacy",
"from spacy_grammar.grammar import Grammar",
"",
"nlp = spacy.load('en')",
"grammar = Grammar(nlp)",
"nlp.add_pipe(grammar)",
"doc = nlp('I can haz cheeseburger.')",
"doc._.has_grammar_error # True"
],
"author":"Motoki Wu",
"author_links":{
"github":"tokestermw",
"twitter":"plusepsilon"
},
"category":["pipeline"]
},
{
"id":"spacy_kenlm",
"slogan":"KenLM extension for spaCy 2.0",
"github":"tokestermw/spacy_kenlm",
"pip":"spacy_kenlm",
"code_example":[
"import spacy",
"from spacy_kenlm import spaCyKenLM",
"",
"nlp = spacy.load('en_core_web_sm')",
"spacy_kenlm = spaCyKenLM() # default model from test.arpa",
"nlp.add_pipe(spacy_kenlm)",
"doc = nlp('How are you?')",
"doc._.kenlm_score # doc score",
"doc[:2]._.kenlm_score # span score",
"doc[2]._.kenlm_score # token score"
],
"author":"Motoki Wu",
"author_links":{
"github":"tokestermw",
"twitter":"plusepsilon"
},
"category":["pipeline"]
},
{
"id":"spacy_readability",
"slogan":"Add text readability meta data to Doc objects",
"description":"spaCy v2.0 pipeline component for calculating readability scores of of text. Provides scores for Flesh-Kincaid grade level, Flesh-Kincaid reading ease, and Dale-Chall.",
"github":"mholtzscher/spacy_readability",
"pip":"spacy-readability",
"code_example":[
"import spacy",
"from spacy_readability import Readability",
"",
"nlp = spacy.load('en')",
"read = Readability(nlp)",
"nlp.add_pipe(read, last=True)",
"doc = nlp(\"I am some really difficult text to read because I use obnoxiously large words.\")",
"doc._.flesch_kincaid_grade_level",
"doc._.flesch_kincaid_reading_ease",
"doc._.dale_chall"
],
"author":"Michael Holtzscher",
"author_links":{
"github":"mholtzscher"
},
"category":["pipeline"]
},
{
"id":"spacy-sentence-segmenter",
"title":"Sentence Segmenter",
"slogan":"Custom sentence segmentation for spaCy",
"slogan":"Add language detection to your spaCy pipeline using CLD2",
"description":"spaCy-CLD operates on `Doc` and `Span` spaCy objects. When called on a `Doc` or `Span`, the object is given two attributes: `languages` (a list of up to 3 language codes) and `language_scores` (a dictionary mapping language codes to confidence scores between 0 and 1).\n\nspacy-cld is a little extension that wraps the [PYCLD2](https://github.com/aboSamoor/pycld2) Python library, which in turn wraps the [Compact Language Detector 2](https://github.com/CLD2Owners/cld2) C library originally built at Google for the Chromium project. CLD2 uses character n-grams as features and a Naive Bayes classifier to identify 80+ languages from Unicode text strings (or XML/HTML). It can detect up to 3 different languages in a given document, and reports a confidence score (reported in with each language.",
"github":"nickdavidhaynes/spacy-cld",
"pip":"spacy_cld",
"code_example":[
"import spacy",
"from spacy_cld import LanguageDetector",
"",
"nlp = spacy.load('en')",
"language_detector = LanguageDetector()",
"nlp.add_pipe(language_detector)",
"doc = nlp('This is some English text.')",
"",
"doc._.languages # ['en']",
"doc._.language_scores['en'] # 0.96"
],
"author":"Nicholas D Haynes",
"author_links":{
"github":"nickdavidhaynes"
},
"category":["pipeline"]
},
{
"id":"spacy-lookup",
"slogan":"A powerful entity matcher for very large dictionaries, using the FlashText module",
"description":"spaCy v2.0 extension and pipeline component for adding Named Entities metadata to `Doc` objects. Detects Named Entities using dictionaries. The extension sets the custom `Doc`, `Token` and `Span` attributes `._.is_entity`, `._.entity_type`, `._.has_entities` and `._.entities`. Named Entities are matched using the python module `flashtext`, and looked up in the data provided by different dictionaries.",
"doc = nlp(u\"I am a product manager for a java and python.\")",
"assert doc._.has_entities == True",
"assert doc[2:5]._.has_entities == True",
"assert doc[0]._.is_entity == False",
"assert doc[3]._.is_entity == True",
"print(doc._.entities)"
],
"author":"Marc Puig",
"author_links":{
"github":"mpuig"
},
"category":["pipeline"]
},
{
"id":"spacy-iwnlp",
"slogan":"German lemmatization with IWNLP",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [IWNLP-py](https://github.com/Liebeck/iwnlp-py) as German lemmatizer directly into your spaCy pipeline.",
"description":"This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [SentiWS](http://wortschatz.uni-leipzig.de/en/download) as German sentiment score directly into your spaCy pipeline.",
"slogan":"POS and French lemmatization with Lefff",
"description":"spacy v2.0 extension and pipeline component for adding a French POS and lemmatizer based on [Lefff](https://hal.inria.fr/inria-00521242/).",
"description":"Lemmy is a lemmatizer for Danish 🇩🇰 . It comes already trained on Dansk Sprognævns (DSN) word list (‘fuldformliste’) and the Danish Universal Dependencies and is ready for use. Lemmy also supports training on your own dataset. The model currently included in Lemmy was evaluated on the Danish Universal Dependencies dev dataset and scored an accruacy > 99%.\n\nYou can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCy’s builtin POS tagger.",
"github":"sorenlind/lemmy",
"pip":"lemmy",
"code_example":[
"import da_custom_model as da # name of your spaCy model",
"import lemmy.pipe",
"nlp = da.load()",
"",
"# create an instance of Lemmy's pipeline component for spaCy",
"pipe = lemmy.pipe.load()",
"",
"# add the comonent to the spaCy pipeline.",
"nlp.add_pipe(pipe, after='tagger')",
"",
"# lemmas can now be accessed using the `._.lemma` attribute on the tokens",
"nlp(\"akvariernes\")[0]._.lemma"
],
"thumb":"https://i.imgur.com/RJVFRWm.jpg",
"author":"Søren Lind Kristiansen",
"author_links":{
"github":"sorenlind"
},
"category":["pipeline"],
"tags":["lemmatizer","danish"]
},
{
"id":"wmd-relax",
"slogan":"Calculates word mover's distance insanely fast",
"description":"Calculates Word Mover's Distance as described in [From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf) by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.\n\n⚠️ **This package is currently only compatible with spaCy v.1x.**",
"doc1 = nlp(\"Politician speaks to the media in Illinois.\")",
"doc2 = nlp(\"The president greets the press in Chicago.\")",
"print(doc1.similarity(doc2))"
],
"author":"source{d}",
"author_links":{
"github":"src-d",
"twitter":"sourcedtech",
"website":"https://sourced.tech"
},
"category":["pipeline"]
},
{
"id":"neuralcoref",
"slogan":"State-of-the-art coreference resolution based on neural nets and spaCy",
"description":"This coreference resolution module is based on the super fast [spaCy](https://spacy.io/) parser and uses the neural net scoring model described in [Deep Reinforcement Learning for Mention-Ranking Coreference Models](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) by Kevin Clark and Christopher D. Manning, EMNLP 2016. With ✨Neuralcoref v2.0, you should now be able to train the coreference resolution system on your own dataset—e.g., another language than English! — **provided you have an annotated dataset**.",
"github":"huggingface/neuralcoref",
"thumb":"https://i.imgur.com/j6FO9O6.jpg",
"code_example":[
"from neuralcoref import Coref",
"",
"coref = Coref()",
"clusters = coref.one_shot_coref(utterances=u\"She loves him.\", context=u\"My sister has a dog.\")",
"slogan":"State-of-the-art coreference resolution based on neural nets and spaCy",
"description":"In short, coreference is the fact that two or more expressions in a text – like pronouns or nouns – link to the same person or thing. It is a classical Natural language processing task, that has seen a revival of interest in the past two years as several research groups applied cutting-edge deep-learning and reinforcement-learning techniques to it. It is also one of the key building blocks to building conversational Artificial intelligences.",
"url":"https://huggingface.co/coref/",
"image":"https://i.imgur.com/3yy4Qyf.png",
"thumb":"https://i.imgur.com/j6FO9O6.jpg",
"github":"huggingface/neuralcoref",
"category":["visualizers","conversational"],
"tags":["coref","chatbots"],
"author":"Hugging Face",
"author_links":{
"github":"huggingface"
}
},
{
"id":"spacy-vis",
"slogan":"A visualisation tool for spaCy using Hierplane",
"description":"A visualiser for spaCy annotations. This visualisation uses the [Hierplane](https://allenai.github.io/hierplane/) Library to render the dependency parse from spaCy's models. It also includes visualisation of entities and POS tags within nodes.",
"slogan":"Test spaCy's rule-based Matcher by creating token patterns interactively",
"description":"Test spaCy's rule-based `Matcher` by creating token patterns interactively and running them over your text. Each token can set multiple attributes like text value, part-of-speech tag or boolean flags. The token-based view lets you explore how spaCy processes your text – and why your pattern matches, or why it doesn't. For more details on rule-based matching, see the [documentation](https://spacy.io/usage/linguistic-features#rule-based-matching).",
"slogan":"A modern syntactic dependency visualizer",
"description":"Visualize spaCy's guess at the syntactic structure of a sentence. Arrows point from children to heads, and are labelled by their relation type.",
"description":"Visualize spaCy's guess at the named entities in the document. You can filter the displayed types, to only show the annotations you're interested in.",
"slogan":"Beautiful visualizations of how language differs among document types",
"description":"A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.",
"slogan":"Turn natural language into structured data",
"description":"Rasa NLU (Natural Language Understanding) is a tool for understanding what is being said in short pieces of text. Rasa NLU is primarily used to build chatbots and voice apps, where this is called intent classification and entity extraction. To use Rasa, *you have to provide some training data*.",
"github":"RasaHQ/rasa_nlu",
"pip":"rasa_nlu",
"thumb":"https://i.imgur.com/ndCfKNq.png",
"url":"https://nlu.rasa.com/",
"author":"Rasa",
"author_links":{
"github":"RasaHQ"
},
"category":["conversational"],
"tags":["chatbots"]
},
{
"id":"tochtext",
"title":"torchtext",
"slogan":"Data loaders and abstractions for text and NLP",
"slogan":"An open-source NLP research library, built on PyTorch and spaCy",
"description":"AllenNLP is a new library designed to accelerate NLP research, by providing a framework that supports modern deep learning workflows for cutting-edge language understanding problems. AllenNLP uses spaCy as a preprocessing component. You can also use Allen NLP to develop spaCy pipeline components, to add annotations to the `Doc` object.",
"github":"allenai/allennlp",
"pip":"allennlp",
"thumb":"https://i.imgur.com/U8opuDN.jpg",
"url":"http://allennlp.org",
"author":" Allen Institute for Artificial Intelligence",
"author_links":{
"github":"allenai",
"twitter":"allenai_org",
"website":"http://allenai.org"
},
"category":["standalone","research"]
},
{
"id":"textacy",
"slogan":"NLP, before and after spaCy",
"description":"`textacy` is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance `spacy` library. With the fundamentals – tokenization, part-of-speech tagging, dependency parsing, etc. – delegated to another library, `textacy` focuses on the tasks that come before and follow after.",
"slogan":"Full text geoparsing using spaCy, Geonames and Keras",
"description":"Extract the place names from a piece of text, resolve them to the correct place, and return their coordinates and structured geographic information.",
"github":"openeventdata/mordecai",
"pip":"mordecai",
"thumb":"https://i.imgur.com/gPJ9upa.jpg",
"code_example":[
"from mordecai import Geoparser",
"geo = Geoparser()",
"geo.geoparse(\"I traveled from Oxford to Ottawa.\")"
],
"author":"Andy Halterman",
"author_links":{
"github":"ahalterman",
"twitter":"ahalterman"
},
"category":["standalone"]
},
{
"id":"kindred",
"title":"Kindred",
"slogan":"Biomedical relation extraction using spaCy",
"description":"Kindred is a package for relation extraction in biomedical texts. Given some training data, it can build a model to identify relations between entities (e.g. drugs, genes, etc) in a sentence.",
"description":"sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015) is a nice twist on [word2vec](https://en.wikipedia.org/wiki/Word2vec) that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our [sense2vec demo](https://explosion.ai/demos/sense2vec) that lets you explore semantic similarities across all Reddit comments of 2015.",
"txt <- c(d1 = \"spaCy excels at large-scale information extraction tasks.\",",
" d2 = \"Mr. Smith goes to North Carolina.\")",
"",
"# process documents and obtain a data.table",
"parsedtxt <- spacy_parse(txt)"
],
"code_language":"r",
"author":"Kenneth Benoit & Aki Matsuo",
"category":["nonpython"]
},
{
"id":"cleannlp",
"title":"CleanNLP",
"slogan":"A tidy data model for NLP in R",
"description":"The cleanNLP package is designed to make it as painless as possible to turn raw text into feature-rich data frames. the package offers four backends that can be used for parsing text: `tokenizers`, `udpipe`, `spacy` and `corenlp`.",
"github":"statsmaths/cleanNLP",
"cran":"cleanNLP",
"author":"Taylor B. Arnold",
"author_links":{
"github":"statsmaths"
},
"category":["nonpython"]
},
{
"id":"spacy-cpp",
"slogan":"C++ wrapper library for spaCy",
"description":"The goal of spacy-cpp is to expose the functionality of spaCy to C++ applications, and to provide an API that is similar to that of spaCy, enabling rapid development in Python and simple porting to C++.",
"slogan":"NLP server for spaCy, WordNet and NeuralCoref as a Docker image",
"github":"artpar/languagecrunch",
"code_example":[
"docker run -it -p 8080:8080 artpar/languagecrunch",
"curl http://localhost:8080/nlp/parse?`echo -n \"The new twitter is so weird. Seriously. Why is there a new twitter? What was wrong with the old one? Fix it now.\" | python -c \"import urllib, sys; print(urllib.urlencode({'sentence': sys.stdin.read()}))\"`"
],
"code_language":"bash",
"author":"Parth Mudgal",
"author_links":{
"github":"artpar"
},
"category":["apis"]
},
{
"id":"spacy-nlp",
"slogan":" Expose spaCy NLP text parsing to Node.js (and other languages) via Socket.IO",
"github":"kengz/spacy-nlp",
"thumb":"https://i.imgur.com/w41VSr7.jpg",
"code_example":[
"const spacyNLP = require(\"spacy-nlp\")",
"// default port 6466",
"// start the server with the python client that exposes spacyIO (or use an existing socketIO server at IOPORT)",
"slogan":"Radically efficient machine teaching, powered by active learning",
"description":"Prodigy is an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you're working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. Stream in your own examples or real-world data from live APIs, update your model in real-time and chain models together to build more complex systems.",
"thumb":"https://i.imgur.com/UVRtP6g.jpg",
"image":"https://i.imgur.com/Dt5vrY6.png",
"url":"https://prodi.gy",
"code_example":[
"prodigy dataset ner_product \"Improve PRODUCT on Reddit data\"",
"title":"Introduction to Machine Learning with Python: A Guide for Data Scientists",
"slogan":"O'Reilly, 2016",
"description":"Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.",
"description":"*Text Analytics with Python* teaches you the techniques related to natural language processing and text analytics, and you will gain the skills to know which technique is best suited to solve a particular problem. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.",
"description":"Master the essential skills needed to recognize and solve complex problems with machine learning and deep learning. Using real-world examples that leverage the popular Python machine learning ecosystem, this book is your perfect companion for learning the art and science of machine learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute machine learning systems and projects successfully.",
"title":"Natural Language Processing Fundamentals in Python",
"slogan":"Datacamp, 2017",
"description":"In this course, you'll learn Natural Language Processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.",
"title":"Learning Path: Mastering spaCy for Natural Language Processing",
"slogan":"O'Reilly, 2017",
"description":"spaCy, a fast, user-friendly library for teaching computers to understand text, simplifies NLP techniques, such as speech tagging and syntactic dependencies, so you can easily extract information, attributes, and objects from massive amounts of text to then document, measure, and analyze. This Learning Path is a hands-on introduction to using spaCy to discover insights through natural language processing. While end-to-end natural language processing solutions can be complex, you’ll learn the linguistics, algorithms, and machine learning skills to get the job done.",
"description":"EcoHealth Alliance uses EpiTator to catalog the what, where and when of infectious disease case counts reported in online news. Each of these aspects is extracted using independent annotators than can be applied to other domains. EpiTator organizes annotations by creating \"AnnoTiers\" for each type. AnnoTiers have methods for manipulating, combining and searching annotations. For instance, the `with_following_spans_from()` method can be used to create a new tier that combines a tier of one type (such as numbers), with another (say, kitchenware). The resulting tier will contain all the phrases in the document that match that pattern, like \"5 plates\" or \"2 cups.\"\n\nAnother commonly used method is `group_spans_by_containing_span()` which can be used to do things like find all the spaCy tokens in all the GeoNames a document mentions. spaCy tokens, named entities, sentences and noun chunks are exposed through the spaCy annotator which will create a AnnoTier for each. These are basis of many of the other annotators. EpiTator also includes an annotator for extracting tables embedded in free text articles. Another neat feature is that the lexicons used for entity resolution are all stored in an embedded sqlite database so there is no need to run any external services in order to use EpiTator.",
"slogan":"Excel Integration with spaCy. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.",
"description":"ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.",
"slogan":"Query spaCy's linguistic annotations using GraphQL",
"github":"ines/spacy-graphql",
"description":"A very simple and experimental app that lets you query spaCy's linguistic annotations using [GraphQL](https://graphql.org/). The API currently supports most token attributes, named entities, sentences and text categories (if available as `doc.cats`, i.e. if you added a text classifier to a model). The `meta` field will return the model meta data. Models are only loaded once and kept in memory.",
"url":"https://explosion.ai/demos/spacy-graphql",
"category":["apis"],
"tags":["graphql"],
"thumb":"https://i.imgur.com/xC7zpTO.png",
"code_example":[
"{",
" nlp(text: \"Zuckerberg is the CEO of Facebook.\", model: \"en_core_web_sm\") {",
"slogan":"JavaScript API for spaCy with Python REST API",
"github":"ines/spacy-js",
"description":"JavaScript interface for accessing linguistic annotations provided by spaCy. This project is mostly experimental and was developed for fun to play around with different ways of mimicking spaCy's Python API.\n\nThe results will still be computed in Python and made available via a REST API. The JavaScript API resembles spaCy's Python API as closely as possible (with a few exceptions, as the values are all pre-computed and it's tricky to express complex recursive relationships).",
"code_language":"javascript",
"code_example":[
"const spacy = require('spacy');",
"",
"(async function() {",
" const nlp = spacy.load('en_core_web_sm');",
" const doc = await nlp('This is a text about Facebook.');",