nltk – Make Me Engineer

How to tweak the NLTK sentence tokenizer

June 13, 2023 by Tarik

You need to supply a list of abbreviations to the tokenizer, like so: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set([‘dr’, ‘vs’, ‘mr’, ‘mrs’, ‘prof’, ‘inc’]) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = “is THAT what you mean, Mrs. Hussey?” sentences = sentence_splitter.tokenize(text) sentences is now: [‘is THAT what you mean, Mrs. Hussey?’] Update: … Read more

Tokenize a paragraph into sentence and then into words in NLTK

June 11, 2023 by Tarik

You probably intended to loop over sent_text: import nltk sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences # now loop over each sentence and tokenize it separately for sentence in sent_text: tokenized_text = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokenized_text) print(tagged)

Spell Checker for Python

June 10, 2023 by Tarik

You can use the autocorrect lib to spell check in python. Example Usage: from autocorrect import Speller spell = Speller(lang=’en’) print(spell(‘caaaar’)) print(spell(‘mussage’)) print(spell(‘survice’)) print(spell(‘hte’)) Result: caesar message service the

How to get rid of punctuation using NLTK tokenizer?

June 6, 2023 by Tarik

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r’\w+’) tokenizer.tokenize(‘Eighty-seven miles to go, yet. Onward!’) Output: [‘Eighty’, ‘seven’, ‘miles’, ‘to’, ‘go’, ‘yet’, ‘Onward’]

How to get all the hyponyms of a word/synset in python nltk and wordnet?

June 2, 2023 by Tarik

from nltk.corpus import wordnet as wn vehicle = wn.synset(‘vehicle.n.01’) typesOfVehicles = list(set([w for s in vehicle.closure(lambda s:s.hyponyms()) for w in s.lemma_names()])) This will give you all the unique words from every synset that is a hyponym of the noun “vehicle” (1st sense).

training data format for NLTK punkt

May 30, 2023 by Tarik

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author’s last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent). To train … Read more

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

May 28, 2023 by Tarik

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer: Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and … Read more

How do I do dependency parsing in NLTK?

May 26, 2023 by Tarik

We can use Stanford Parser from NLTK. Requirements You need to download two things from their website: The Stanford CoreNLP parser. Language model for your desired language (e.g. english language model) Warning! Make sure that your language model version matches your Stanford CoreNLP parser version! The current CoreNLP version as of May 22, 2018 is … Read more

NLTK download SSL: Certificate verify failed

May 22, 2023 by Tarik

TLDR: Here is a better solution: https://github.com/gunthercox/ChatterBot/issues/930#issuecomment-322111087 Note that when you run nltk.download(), a window will pop up and let you select which packages to download (Download is not automatically started right away). To complement the accepted answer, the following is a complete list of directories that will be searched on Mac (not limited to … Read more

error installing nltk supporting packages : nltk.download()

May 18, 2023 by Tarik

Try below code. It has downloaded package as expected import nltk import ssl try: _create_unverified_https_context = ssl._create_unverified_context except AttributeError: pass else: ssl._create_default_https_context = _create_unverified_https_context nltk.download() Looks before link was broken whicvh been fixed by ssl. Note :- MAC been used