Added support for Sanskrit language (#5956)

* Added support for Sanskrit language * Added tests for lexical attribute like_num
2020-08-25 14:26:29 +05:30 · 2020-08-25 14:26:29 +05:30 · 450720aca2
parent b10c7bc56e
commit 450720aca2
8 changed files with 848 additions and 0 deletions
--- a/.github/contributors/snsten.md
+++ b/.github/contributors/snsten.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Shashank Shekhar     |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-08-23           |
+| GitHub username                | snsten               |
+| Website (optional)             |                      |
--- a/spacy/lang/sa/init.py
+++ b/spacy/lang/sa/init.py
@ -0,0 +1,24 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+
+from ...language import Language
+from ...attrs import LANG
+
+
+class SanskritDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "sa"
+
+    stop_words = STOP_WORDS
+
+
+class Sanskrit(Language):
+    lang = "sa"
+    Defaults = SanskritDefaults
+
+
+__all__ = ["Sanskrit"]
--- a/spacy/lang/sa/examples.py
+++ b/spacy/lang/sa/examples.py
@ -0,0 +1,19 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.sa.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "अभ्यावहति कल्याणं विविधं वाक् सुभाषिता ।",
+    "मनसि व्याकुले चक्षुः पश्यन्नपि न पश्यति ।",
+    "यस्य बुद्धिर्बलं तस्य निर्बुद्धेस्तु कुतो बलम्?",
+    "परो अपि हितवान् बन्धुः बन्धुः अपि अहितः परः ।",
+    "अहितः देहजः व्याधिः हितम् आरण्यं औषधम् ॥",
+]
--- a/spacy/lang/sa/lex_attrs.py
+++ b/spacy/lang/sa/lex_attrs.py
@ -0,0 +1,131 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+
+# reference 1: https://en.wikibooks.org/wiki/Sanskrit/Numbers
+
+_num_words = [
+    "एकः",
+    "द्वौ",
+    "त्रयः",
+    "चत्वारः",
+    "पञ्च",
+    "षट्",
+    "सप्त",
+    "अष्ट",
+    "नव",
+    "दश",
+    "एकादश",
+    "द्वादश",
+    "त्रयोदश",
+    "चतुर्दश",
+    "पञ्चदश",
+    "षोडश",
+    "सप्तदश",
+    "अष्टादश",
+    "एकान्नविंशति",
+    "विंशति",
+    "एकाविंशति",
+    "द्वाविंशति",
+    "त्रयोविंशति",
+    "चतुर्विंशति",
+    "पञ्चविंशति",
+    "षड्विंशति",
+    "सप्तविंशति",
+    "अष्टाविंशति",
+    "एकान्नत्रिंशत्",
+    "त्रिंशत्",
+    "एकत्रिंशत्",
+    "द्वात्रिंशत्",
+    "त्रयत्रिंशत्",
+    "चतुस्त्रिंशत्",
+    "पञ्चत्रिंशत्",
+    "षट्त्रिंशत्",
+    "सप्तत्रिंशत्",
+    "अष्टात्रिंशत्",
+    "एकोनचत्वारिंशत्",
+    "चत्वारिंशत्",
+    "एकचत्वारिंशत्",
+    "द्वाचत्वारिंशत्",
+    "त्रयश्चत्वारिंशत्",
+    "चतुश्चत्वारिंशत्",
+    "पञ्चचत्वारिंशत्",
+    "षट्चत्वारिंशत्",
+    "सप्तचत्वारिंशत्",
+    "अष्टाचत्वारिंशत्",
+    "एकोनपञ्चाशत्",
+    "पञ्चाशत्",
+    "एकपञ्चाशत्",
+    "द्विपञ्चाशत्",
+    "त्रिपञ्चाशत्",
+    "चतुःपञ्चाशत्",
+    "पञ्चपञ्चाशत्",
+    "षट्पञ्चाशत्",
+    "सप्तपञ्चाशत्",
+    "अष्टपञ्चाशत्",
+    "एकोनषष्ठिः",
+    "षष्ठिः",
+    "एकषष्ठिः",
+    "द्विषष्ठिः",
+    "त्रिषष्ठिः",
+    "चतुःषष्ठिः",
+    "पञ्चषष्ठिः",
+    "षट्षष्ठिः",
+    "सप्तषष्ठिः",
+    "अष्टषष्ठिः",
+    "एकोनसप्ततिः",
+    "सप्ततिः",
+    "एकसप्ततिः",
+    "द्विसप्ततिः",
+    "त्रिसप्ततिः",
+    "चतुःसप्ततिः",
+    "पञ्चसप्ततिः",
+    "षट्सप्ततिः",
+    "सप्तसप्ततिः",
+    "अष्टसप्ततिः",
+    "एकोनाशीतिः",
+    "अशीतिः",
+    "एकाशीतिः",
+    "द्वशीतिः",
+    "त्र्यशीतिः",
+    "चतुरशीतिः",
+    "पञ्चाशीतिः",
+    "षडशीतिः",
+    "सप्ताशीतिः",
+    "अष्टाशीतिः",
+    "एकोननवतिः",
+    "नवतिः",
+    "एकनवतिः",
+    "द्विनवतिः",
+    "त्रिनवतिः",
+    "चतुर्नवतिः",
+    "पञ्चनवतिः",
+    "षण्णवतिः",
+    "सप्तनवतिः",
+    "अष्टनवतिः",
+    "एकोनशतम्",
+    "शतम्"
+]
+
+
+def like_num(text):
+   """
+   Check if text resembles a number
+   """
+   if text.startswith(("+", "-", "±", "~")):
+       text = text[1:]
+   text = text.replace(",", "").replace(".", "")
+   if text.isdigit():
+       return True
+   if text.count("/") == 1:
+       num, denom = text.split("/")
+       if num.isdigit() and denom.isdigit():
+           return True
+   if text in _num_words:
+       return True
+   return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/sa/stop_words.py
+++ b/spacy/lang/sa/stop_words.py
@ -0,0 +1,518 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+# Source: https://gist.github.com/Akhilesh28/fe8b8e180f64b72e64751bc31cb6d323
+
+STOP_WORDS = set(
+    """
+अहम्
+आवाम्
+वयम्
+माम्  मा
+आवाम्
+अस्मान्  नः
+मया
+आवाभ्याम्
+अस्माभिस्
+मह्यम्  मे
+आवाभ्याम्  नौ
+अस्मभ्यम्  नः
+मत्
+आवाभ्याम्
+अस्मत्
+मम  मे
+आवयोः
+अस्माकम्  नः
+मयि
+आवयोः
+अस्मासु
+त्वम्
+युवाम्
+यूयम्
+त्वाम्  त्वा
+युवाम्  वाम्
+युष्मान्  वः
+त्वया
+युवाभ्याम्
+युष्माभिः
+तुभ्यम्  ते
+युवाभ्याम्  वाम्
+युष्मभ्यम्  वः
+त्वत्
+युवाभ्याम्
+युष्मत्
+तव  ते
+युवयोः  वाम्
+युष्माकम्  वः
+त्वयि
+युवयोः
+युष्मासु
+सः
+तौ
+ते
+तम्
+तौ
+तान्
+तेन
+ताभ्याम्
+तैः
+तस्मै
+ताभ्याम्
+तेभ्यः
+तस्मात्
+ताभ्याम्
+तेभ्यः
+तस्य
+तयोः
+तेषाम्
+तस्मिन्
+तयोः
+तेषु
+सा
+ते
+ताः
+ताम्
+ते
+ताः
+तया
+ताभ्याम्
+ताभिः
+तस्यै
+ताभ्याम्
+ताभ्यः
+तस्याः
+ताभ्याम्
+ताभ्यः
+तस्य
+तयोः
+तासाम्
+तस्याम्
+तयोः
+तासु
+तत्
+ते
+तानि
+तत्
+ते
+तानि
+तया
+ताभ्याम्
+ताभिः
+तस्यै
+ताभ्याम्
+ताभ्यः
+तस्याः
+ताभ्याम्
+ताभ्यः
+तस्य
+तयोः
+तासाम्
+तस्याम्
+तयोः
+तासु
+अयम्
+इमौ
+इमे
+इमम्
+इमौ
+इमान्
+अनेन
+आभ्याम्
+एभिः
+अस्मै
+आभ्याम्
+एभ्यः
+अस्मात्
+आभ्याम्
+एभ्यः
+अस्य
+अनयोः
+एषाम्
+अस्मिन्
+अनयोः
+एषु
+इयम्
+इमे
+इमाः
+इमाम्
+इमे
+इमाः
+अनया
+आभ्याम्
+आभिः
+अस्यै
+आभ्याम्
+आभ्यः
+अस्याः
+आभ्याम्
+आभ्यः
+अस्याः
+अनयोः
+आसाम्
+अस्याम्
+अनयोः
+आसु
+इदम्
+इमे
+इमानि
+इदम्
+इमे
+इमानि
+अनेन
+आभ्याम्
+एभिः
+अस्मै
+आभ्याम्
+एभ्यः
+अस्मात्
+आभ्याम्
+एभ्यः
+अस्य
+अनयोः
+एषाम्
+अस्मिन्
+अनयोः
+एषु
+एषः
+एतौ
+एते
+एतम्  एनम्
+एतौ  एनौ
+एतान्  एनान्
+एतेन
+एताभ्याम्
+एतैः
+एतस्मै
+एताभ्याम्
+एतेभ्यः
+एतस्मात्
+एताभ्याम्
+एतेभ्यः
+एतस्य
+एतस्मिन्
+एतेषाम्
+एतस्मिन्
+एतस्मिन्
+एतेषु
+एषा
+एते
+एताः
+एताम्  एनाम्
+एते  एने
+एताः  एनाः
+एतया  एनया
+एताभ्याम्
+एताभिः
+एतस्यै
+एताभ्याम्
+एताभ्यः
+एतस्याः
+एताभ्याम्
+एताभ्यः
+एतस्याः
+एतयोः  एनयोः
+एतासाम्
+एतस्याम्
+एतयोः  एनयोः
+एतासु
+एतत्  एतद्
+एते
+एतानि
+एतत्  एतद्  एनत्  एनद्
+एते  एने
+एतानि  एनानि
+एतेन  एनेन
+एताभ्याम्
+एतैः
+एतस्मै
+एताभ्याम्
+एतेभ्यः
+एतस्मात्
+एताभ्याम्
+एतेभ्यः
+एतस्य
+एतयोः  एनयोः
+एतेषाम्
+एतस्मिन्
+एतयोः  एनयोः
+एतेषु
+असौ
+अमू
+अमी
+अमूम्
+अमू
+अमून्
+अमुना
+अमूभ्याम्
+अमीभिः
+अमुष्मै
+अमूभ्याम्
+अमीभ्यः
+अमुष्मात्
+अमूभ्याम्
+अमीभ्यः
+अमुष्य
+अमुयोः
+अमीषाम्
+अमुष्मिन्
+अमुयोः
+अमीषु
+असौ
+अमू
+अमूः
+अमूम्
+अमू
+अमूः
+अमुया
+अमूभ्याम्
+अमूभिः
+अमुष्यै
+अमूभ्याम्
+अमूभ्यः
+अमुष्याः
+अमूभ्याम्
+अमूभ्यः
+अमुष्याः
+अमुयोः
+अमूषाम्
+अमुष्याम्
+अमुयोः
+अमूषु
+अमु
+अमुनी
+अमूनि
+अमु
+अमुनी
+अमूनि
+अमुना
+अमूभ्याम्
+अमीभिः
+अमुष्मै
+अमूभ्याम्
+अमीभ्यः
+अमुष्मात्
+अमूभ्याम्
+अमीभ्यः
+अमुष्य
+अमुयोः
+अमीषाम्
+अमुष्मिन्
+अमुयोः
+अमीषु
+कः
+कौ
+के
+कम्
+कौ
+कान्
+केन
+काभ्याम्
+कैः
+कस्मै
+काभ्याम्
+केभ्य
+कस्मात्
+काभ्याम्
+केभ्य
+कस्य
+कयोः
+केषाम्
+कस्मिन्
+कयोः
+केषु
+का
+के
+काः
+काम्
+के
+काः
+कया
+काभ्याम्
+काभिः
+कस्यै
+काभ्याम्
+काभ्यः
+कस्याः
+काभ्याम्
+काभ्यः
+कस्याः
+कयोः
+कासाम्
+कस्याम्
+कयोः
+कासु
+किम्
+के
+कानि
+किम्
+के
+कानि
+केन
+काभ्याम्
+कैः
+कस्मै
+काभ्याम्
+केभ्य
+कस्मात्
+काभ्याम्
+केभ्य
+कस्य
+कयोः
+केषाम्
+कस्मिन्
+कयोः
+केषु
+भवान्
+भवन्तौ
+भवन्तः
+भवन्तम्
+भवन्तौ
+भवतः
+भवता
+भवद्भ्याम्
+भवद्भिः
+भवते
+भवद्भ्याम्
+भवद्भ्यः
+भवतः
+भवद्भ्याम्
+भवद्भ्यः
+भवतः
+भवतोः
+भवताम्
+भवति
+भवतोः
+भवत्सु
+भवती
+भवत्यौ
+भवत्यः
+भवतीम्
+भवत्यौ
+भवतीः
+भवत्या
+भवतीभ्याम्
+भवतीभिः
+भवत्यै
+भवतीभ्याम्
+भवतीभिः
+भवत्याः
+भवतीभ्याम्
+भवतीभिः
+भवत्याः
+भवत्योः
+भवतीनाम्
+भवत्याम्
+भवत्योः
+भवतीषु
+भवत्
+भवती
+भवन्ति
+भवत्
+भवती
+भवन्ति
+भवता
+भवद्भ्याम्
+भवद्भिः
+भवते
+भवद्भ्याम्
+भवद्भ्यः
+भवतः
+भवद्भ्याम्
+भवद्भ्यः
+भवतः
+भवतोः
+भवताम्
+भवति
+भवतोः
+भवत्सु
+अये
+अरे
+अरेरे
+अविधा
+असाधुना
+अस्तोभ
+अहह
+अहावस्
+आम्
+आर्यहलम्
+आह
+आहो
+इस्
+उम्
+उवे
+काम्
+कुम्
+चमत्
+टसत्
+दृन्
+धिक्
+पाट्
+फत्
+फाट्
+फुडुत्
+बत
+बाल्
+वट्
+व्यवस्तोभति व्यवस्तुभ्
+षाट्
+स्तोभ
+हुम्मा
+हूम्
+अति
+अधि
+अनु
+अप
+अपि
+अभि
+अव
+आ
+उद्
+उप
+नि
+निर्
+परा
+परि
+प्र
+प्रति
+वि
+सम्
+अथवा उत
+अन्यथा
+इव
+च
+चेत् यदि
+तु परन्तु
+यतः करणेन हि यतस् यदर्थम् यदर्थे यर्हि यथा यत्कारणम् येन ही हिन
+यथा यतस्
+यद्यपि
+यात् अवधेस् यावति
+येन प्रकारेण
+स्थाने
+अह
+एव
+एवम्
+कच्चित्
+कु
+कुवित्
+कूपत्
+च
+चण्
+चेत्
+तत्र
+नकिम्
+नह
+नुनम्
+नेत्
+भूयस्
+मकिम्
+मकिर्
+यत्र
+युगपत्
+वा
+शश्वत्
+सूपत्
+ह
+हन्त
+हि
+""".split()
+)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -212,6 +212,11 @@ def ru_lemmatizer():
    return get_lang_class("ru").Defaults.create_lemmatizer()


+@pytest.fixture(scope="session")
+def sa_tokenizer():
+    return get_lang_class("sa").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
 def sr_tokenizer():
    return get_lang_class("sr").Defaults.create_tokenizer()
--- a/spacy/tests/lang/sa/init.py
+++ b/spacy/tests/lang/sa/init.py
--- a/spacy/tests/lang/sa/test_text.py
+++ b/spacy/tests/lang/sa/test_text.py
@ -0,0 +1,45 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_sa_tokenizer_handles_long_text(sa_tokenizer):
+    text = """नानाविधानि दिव्यानि नानावर्णाकृतीनि च।।"""
+    tokens = sa_tokenizer(text)
+    assert len(tokens) == 6
+
+
+@pytest.mark.parametrize(
+    "text,length",
+    [
+        ("श्री भगवानुवाच पश्य मे पार्थ रूपाणि शतशोऽथ सहस्रशः।", 9,),
+        ("गुणान् सर्वान् स्वभावो मूर्ध्नि वर्तते ।", 6),
+    ],
+)
+def test_sa_tokenizer_handles_cnts(sa_tokenizer, text, length):
+    tokens = sa_tokenizer(text)
+    assert len(tokens) == length
+
+
+@pytest.mark.parametrize(
+    "text,match",
+    [
+        ("10", True),
+        ("1", True),
+        ("10.000", True),
+        ("1000", True),
+        ("999,0", True),
+        ("एकः ", True),
+        ("दश", True),
+        ("पञ्चदश", True),
+        ("चत्वारिंशत् ", True),
+        ("कूपे", False),
+        (",", False),
+        ("1/2", True),
+    ],
+)
+def test_lex_attrs_like_number(sa_tokenizer, text, match):
+    tokens = sa_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match