LAPPS Interchange Format
[ Index | Overview | Tokens | Chunks & NER | Coreference | Phrase Structure | Dependencies ]
Chunks and Named Entities
Last updated: May 5th, 2017
Many annotations, including chunks and named entities, have a LIF realization that is similar to that of tags and tokens. In general, what they have in common is that the annotation is a sub type of Region where the annotation points at a span of text (or spans of text) and sets some attributes for that span.
Chunks
The vocabulary defines two annotations:
- http://vocab.lappsgrid.org/NounChunk.html
- http://vocab.lappsgrid.org/VerbChunk.html
These are direct subtypes of http://vocab.lappsgrid.org/Region.html and thus inherit the begin
, end
and targets
attributes. No other properties are defined by the vocabulary, but you are always allowed to add other properties to the features
dictionary. Here is an example of what a chunk could look like in LIF:
{
"@context": "http://vocab.lappsgrid.org/context-1.0.0.jsonld",
"text": { "@value": "Fido barks." },
"views": [
{
"id": "v1",
"metadata": {
"contains": {
"NounChunk": {
"producer": "GATE NounPhraseChunker v2.2.0 ",
"type": "chunker:gate" }}},
"annotations": [
{ "@type": "NounChunk", "id": "c0", "start": 0, "end": 4 }
]
}
]
}
The extent of a Region tag can be defined by the begin
and end
attributes, but it is also possible to use the targets
attribute instead and use it to refer to other annotations:
{
"@context": "http://vocab.lappsgrid.org/context-1.0.0.jsonld",
"text": { "@value": "Fido barks." },
"views": [
{
"id": "v1",
"metadata": {
"contains": {
"Token": {
"producer": "org.anc.lapps.stanford.SATokenizer:1.4.0",
"type": "tokenization:stanford" }}},
"annotations": [
{ "@type": "Token", "id": "tok0", "start": 0, "end": 4 },
{ "@type": "Token", "id": "tok1", "start": 5, "end": 10 },
{ "@type": "Token", "id": "tok2", "start": 10, "end": 11 }
]
},
{
"id": "v2",
"metadata": {
"contains": {
"NounChunk": {
"producer": "GATE NounPhraseChunker v2.2.0 ",
"type": "chunker:gate" }}},
"annotations": [
{ "@type": "NounChunk", "id": "c0", "targets": "v1:tok0" }
]
}
]
}
In general, this choice is available to any extent annotation. Note that in this case the chunker ran on top of a tokenizer and referred to the tokenizer input. It is totally legal for the chunker itself to create the tokenization itself and generate one view with both tokens and chunks.
Named Entities
The vocabulary item involved is http://vocab.lappsgrid.org/beta/NamedEntity.html. The vocabulary used to have subtypes of NamedEntity like Person, Organization and Location. These are now deprecated so a named entity annotation now looks as follows.
{
"@context": "http://vocab.lappsgrid.org/context-1.0.0.jsonld",
"text": { "@value": "Jill sleeps." },
"views": [
{
"id": "v1",
"metadata": {
"contains": {
"http://vocab.lappsgrid.org/NamedEntity": {
"producer":"edu.brandeis.cs.lappsgrid.stanford.corenlp.NamedEntityRecognizer:2.0.3",
"namedEntityCategorySet": "ner:stanford" }}}
"annotations": [
{ "@type": "NamedEntity", "id": "c0", "start": 0, "end": 4,
"features": { "category": "person", "gender": "female" } } ]
}
]
}
The thing to note here is the category
attribute. Possible values are expected to be defined in the vocabulary in the ner:stanford
section.