Skip to main content

Entities

Entities

An entity is a minimal logical unit of text. It is either a word or a group of words that NLP logically groups together into either a concept or a relation. Other logical units, such as a telephone number or an email address, are also considered entities (and are treated as concepts).

Note:

Japanese text cannot be divided into concepts and relations. Instead NLP analyzes Japanese text as a sequence of entities with associated particles. The definition of an “entity” for Japanese is roughly equivalent to a Concept in other NLP languages. For a description of NLP Japanese support (written in Japanese) refer to NLP JapaneseOpens in a new tab.

NLP normalizes entities so that they may be compared and counted. It removes non-relevant words. It translates entities into lower case letters. It removes most punctuation and some special characters from entities.

By default, NLP restricts its analysis of entities to Concepts. By default, Relations are only analyzed because of their role in linking Concepts together. This default can be overridden, as described in the “Limiting by Position” section of the NLP Queries chapter.

Path-relevant Words

NLP identifies certain words in each language as being an essential part of its analysis of sentences and paths, but otherwise not relevant. Outside of the context of a sentence or path, these words have little informational content. The following are typical path-relevant words:

  • Pronouns of all types: definite, indefinite, possessive.

  • Indefinite expressions of time, frequency, or place. For example, “then”, “soon”, “later”, “sometimes”, “all”, “here”.

Path-relevant words are not considered Concepts, nor are they counted in frequency or dominance calculations. Path-relevant words may be negation or time attribute markers. For example, ”none”, nothing”, “nowhere”, “nobody”. Path-relevant words are not stemmed.

Non-relevant Words

NLP identifies certain words in each language as being non-relevant, and excludes these words from NLP indexing. There are several kinds of non-relevant words:

  • Articles (such as “the” and “a”) and other words that the NLP language model identifies as having little or no semantic importance.

  • Prefatory words or phrases at the beginning of a sentence, such as “And”, “Nevertheless”, “However”, “On the other hand”.

  • Character strings over 150 characters that are unbroken by spaces or sentence punctuation. A “word” of this length is highly likely to be a non-text entity, and is thus excluded from NLP indexing. Because in rare cases (such as chemical nomenclature or URL strings) these 150+ character words are semantically relevant, NLP flags them with the attribute “nonsemantic”.

Non-relevant words are excluded from NLP indexing, but are preserved when sentences are displayed.

FeedbackOpens in a new tab