There is no agreement on the definition of a named entity in the research community. Often, only instances of concepts in a certain scenario are considered to be Named Entities, the following definition is taken from the NER task.
"Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002)
The CoNLL 2002 definition is a very pragmatic one in order to clarify the goals of the NER task. It also unnecessarily limits Named Entities to the concepts person, organization, location, time, and quantity. It is even arguable whether time and quantity are even real named entities. The term "named" restricts the task to entities that are rigid designators as defined by Kripke in 1981. "A rigid designator designates the same object in all possible worlds in which that object exists and never designates anything else". But following that path leads us in a very philosophical discussion about what an entity is. Also, Nadeau (2008) shows how difficult the definition of a named entity actually is. His definition is "[...] ugly and circular, but is practical!":
The types recognized by NER are any sets of words that intersect with an NER type.
We will therefore explain the term for the scope of this thesis and use the terms "entity" and "named entity" interchangeably. Some researcher also call an entity that can be extracted from the web, a "web object".
A named entity is a collection of rigidly designated chunks of text that refer to exactly one or multiple identical, real or abstract concept instances. These instances can have several aliases and one name can refer to different instances
The figure below shows the relation between concepts and entities. The figure allows us to explain our definition using some examples.
Ambiguity: We can see that the movie entity "Iron Man" has another alias named "Ironman" which however refers to the same entity, that is, their uuids are identical. Furthermore, we can observe that the chunk of text "Iron Man" is ambiguous and might also refer to the Marvel comic character, that is, chunks of texts are often ambiguous and need to be disambiguated in the context they are mentioned in. The names must be rigidly designated, the name "the 2008 movie where Robert Downey Jr. plays a comic hero" is therefore not an entity. Due to their ambiguity each entity must get a Universally Unique Identifier (UUID).
Generic and Specific: "John Hiatt" the musician refers to exactly one real world instance, while the mobile phone "Nexus One" refers to multiple identical real world instances, people are specific, that is, a concrete entity exists only once in the real world. Products are generic, the mobile phone "Nexus One" exists many times in the real world but we are not interested in these different instances but rather their common name since they all share the same attributes such as "display size" for the mobile phone example, thus the term "identical concept instances" in our definition. There are many concepts that are generic such as gene, car, or movie names. More information about specific and generic entities can be found in the language processing toolkit LingPipe.
Abstract and Real: While "Nexus One" is an instance of a real world object, the sport entity "Hockey" is an abstract entity. One can play hockey to make it real but until then it is an abstract instance of a concept. It is no concept according to our definition since there are no instances of hockey itself. The same is true for instances of event concepts such as concerts or conferences. Our definitions also allows us to have temporal and numerical instances such as "$1,000,000" as abstract entities.