Skip to main content

Sentences

Sentences

NLP uses a language model to divide the source text into sentences. In general, NLP defines a sentence as a unit of text ending with a sentence terminator (usually a punctuation mark) followed by at least one space or line return. The next sentence begins with the next non-whitespace character. Capitalization is not required to indicate the beginning of a sentence.

Sentence terminators are (for most languages) the period (.), question mark (?), exclamation mark (!), and semi-colon (;). A sentence can be terminated by more than one terminator, such as ellipsis (...) or emphatic punctuation (??? or !!!). Any combination of terminators is permitted (...!?). A blank space between sentence terminators indicates a new sentence; therefore, an ellipsis containing spaces (. . .) is actually three sentences. A sentence terminator must be followed by either a whitespace character (space, tab, or line return), or by a single quote or double quote character, followed by a whitespace character. For example, "Why?" he asked. is two sentences, but "Why?", he asked. is a single sentence.

A double line return acts as a sentence terminator with or without a sentence terminator character. Therefore, a title or a section heading is considered to be a separate sentence, if followed by a blank line. The end of the file is also treated as a sentence terminator. Therefore, if a source contains any content at all (other than whitespace) it contains at least one sentence, regardless of the presence of a sentence terminator. Similarly, the last text in a file is treated as a separate sentence, regardless of the presence of a sentence terminator.

A period followed by a blank space usually indicates a sentence break, though NLP language models recognize exceptions to this rule. For example, the English language model recognizes common abbreviations, such as “Dr.” and “Mr.” (not case-sensitive) and removes the period rather than performing a sentence break. The English language model recognizes “No.” as an abbreviation, but treats lowercase “no.” as a sentence terminator.

You can use the UserDictionary option of your Configuration to cause or avoid sentence endings in specific cases. For example, the abbreviation “Fr.” (Father or Friar) is not recognized by the English language model. It is treated as a sentence break. You can use a UserDictionary to either remove the period or to specify that this use of a period should not cause a sentence break. A UserDictionary is applied as a source is loaded; already loaded sources are not affected.

FeedbackOpens in a new tab