spaCy/spacy/lang/pt/lex_attrs.py

# coding: utf8
from __future__ import unicode_literals

from ...attrs import LIKE_NUM


_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
              'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',
              'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte',
              'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',
              'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião',
              'quatrilhão']

_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
                  'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
                  'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo',
                  'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo',
                  'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo',
                  'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo',
                  'milionésimo', 'bilionésimo']


def like_num(text):
    if text.startswith(('+', '-', '±', '~')):
        text = text[1:]
    text = text.replace(',', '').replace('.', '')
    if text.isdigit():
        return True
    if text.count('/') == 1:
        num, denom = text.split('/')
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    if text.lower() in _ordinal_words:
        return True
    return False


LEX_ATTRS = {
    LIKE_NUM: like_num
}
Reorganise Portuguese language data 2017-05-08 13:52:01 +00:00			`# coding: utf8`
			`from __future__ import unicode_literals`

Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`from ...attrs import LIKE_NUM`
Reorganise Portuguese language data 2017-05-08 13:52:01 +00:00

Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',`
			`'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',`
Update lex_attrs.py (#2307) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking. 2018-05-09 18:49:31 +00:00			`'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte',`
Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',`
Update lex_attrs.py (#2307) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking. 2018-05-09 18:49:31 +00:00			`'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião',`
			`'quatrilhão']`
Reorganise Portuguese language data 2017-05-08 13:52:01 +00:00
Rename variable to keep code consistent 2018-01-08 02:38:44 +00:00			`_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',`
			`'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',`
			`'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo',`
			`'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo',`
			`'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo',`
			`'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo',`
			`'milionésimo', 'bilionésimo']`
Reorganise Portuguese language data 2017-05-08 13:52:01 +00:00
Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00
			`def like_num(text):`
💫 Make like_num work for prefixed numbers (#2808) * Only split + prefix if not numbers * Make like_num work for prefixed numbers * Add test for like_num 2018-10-01 08:49:14 +00:00			`if text.startswith(('+', '-', '±', '~')):`
			`text = text[1:]`
Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`text = text.replace(',', '').replace('.', '')`
			`if text.isdigit():`
			`return True`
			`if text.count('/') == 1:`
			`num, denom = text.split('/')`
			`if num.isdigit() and denom.isdigit():`
			`return True`
Find lowercased forms of numeric words 2018-01-08 02:25:08 +00:00			`if text.lower() in _num_words:`
Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`return True`
Find lowercased forms of ordinal words, where possible 2018-01-08 02:28:50 +00:00			`if text.lower() in _ordinal_words:`
			`return True`
Update Portuguese lexical attributes 2017-05-12 13:37:39 +00:00			`return False`


			`LEX_ATTRS = {`
			`LIKE_NUM: like_num`
			`}`