How to tweak the NLTK sentence tokenizer

You need to supply a list of abbreviations to the tokenizer, like so: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set([‘dr’, ‘vs’, ‘mr’, ‘mrs’, ‘prof’, ‘inc’]) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = “is THAT what you mean, Mrs. Hussey?” sentences = sentence_splitter.tokenize(text) sentences is now: [‘is THAT what you mean, Mrs. Hussey?’] Update: … Read more

How to get rid of punctuation using NLTK tokenizer?

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r’\w+’) tokenizer.tokenize(‘Eighty-seven miles to go, yet. Onward!’) Output: [‘Eighty’, ‘seven’, ‘miles’, ‘to’, ‘go’, ‘yet’, ‘Onward’]

training data format for NLTK punkt

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author’s last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent). To train … Read more

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer: Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and … Read more

How do I do dependency parsing in NLTK?

We can use Stanford Parser from NLTK. Requirements You need to download two things from their website: The Stanford CoreNLP parser. Language model for your desired language (e.g. english language model) Warning! Make sure that your language model version matches your Stanford CoreNLP parser version! The current CoreNLP version as of May 22, 2018 is … Read more

NLTK download SSL: Certificate verify failed

TLDR: Here is a better solution: https://github.com/gunthercox/ChatterBot/issues/930#issuecomment-322111087 Note that when you run nltk.download(), a window will pop up and let you select which packages to download (Download is not automatically started right away). To complement the accepted answer, the following is a complete list of directories that will be searched on Mac (not limited to … Read more

tech