How to tweak the NLTK sentence tokenizer
You need to supply a list of abbreviations to the tokenizer, like so: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set([‘dr’, ‘vs’, ‘mr’, ‘mrs’, ‘prof’, ‘inc’]) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = “is THAT what you mean, Mrs. Hussey?” sentences = sentence_splitter.tokenize(text) sentences is now: [‘is THAT what you mean, Mrs. Hussey?’] Update: … Read more