Multi-Word Tokens

Previous discussion can be found here.

Examples

German

1-2 im  _   _
1   in  in  PREP
1   dem der DET

Czech

4-5 abych   _   _
4   aby     aby SCONJ
5   bych    bÿt AUX

LIF Proposals

Put in separate views
Put in same view different annotation types
Put in same view differnt tokenTypes
Single token with features

Put In Separate Views

{
    "text": {
        "@value": "im",
        "@language": "de"
    },
    "views": [
        {
            "id": "v1",
            "metadata": {
                "contains": {
                    "http://vocab.lappsgrid.org/Token": {
                        "type": "lumped"
                    }
                }
            },
            "annotations": [
                {
                    "@type": "Token",
                    "id": "tk0",
                    "start": 0,
                    "end": 2
                }
            ]
        },
        {
            "id": "v2",
            "metadata": {
                "contains": {
                    "http://vocab.lappsgrid.org/Token": {
                        "type": "split"
                    }
                }
            },
            "annotations": [
                {
                    "@type": "Token",
                    "id": "tk0",
                    "targets": "v1:tk0"
                },
                {
                    "@type": "Token",
                    "id": "tk1",
                    "targets": "v1:tk0"
                }
            ]
        }
    ]
}

Issues

Complicates processing as tools will need to look in two (or more) views to reconcile all information. Naive tools may end up with the wrong token view.

Put In a Single View

Option #1

The surface token is annotated with http://vocab.lappsgrid.org/Token and the component tokens with http://vocab.lappsgrid.org/Word

{
    "text": {
        "@value": "im",
        "@language": "de"
    },
    "views": [
        {
            "id": "v1",
            "metadata": {
                "contains": {
                    "http://vocab.lappsgrid.org/Token": {
                        "type": "lumped"
                    },
                    "http://vocab.lappsgrid.org/Word": {
                        "type": "lumped"
                    }
                }
            },
            "annotations": [
                {
                    "@type": "Token",
                    "id": "tk0",
                    "start": 0,
                    "end": 2
                },
                {
                    "@type": "Word",
                    "id": "w0",
                    "features": {
                        "targets": "tk0",
                        "position": "1"
                    }
                },
                {
                    "@type": "Word",
                    "id": "w1",
                    "features": {
                        "targets": "tk0",
                        "position": "2"
                    }
                }
            ]
        }
    ]
}

Issues

How to annotate the Token with pos and lemma annotations.

Option #2

The surface token and component tokens are annotated with http://vocab.lappsgrid.org/Token and the component tokens have the tokenType feature set.

{
    "id": "tok4-5",
    "start": 177,
    "end": 182,
    "@type": "http://vocab.lappsgrid.org/Token",
    "features": {
        "word": "abych",
        "targets": [
            "mwt-4",
            "mwt-5"
        ]
    }
},
{
    "id": "mwt-4",
    "@type": "http://vocab.lappsgrid.org/Token",
    "features": {
        "word": "aby",
        "lemma": "aby",
        "pos": "SCONJ",
        "targets": [
            "tok4-5"
        ],
        "tokenType": "http://vocab.lappsgrid.org/ns/syntax/mwt"
    }
},
{
    "id": "mwt-5",
    "@type": "http://vocab.lappsgrid.org/Token",
    "features": {
        "word": "bych",
        "lemma": "b\u00fdt",
        "pos": "AUX",
        "targets": [
            "tok4-5"
        ],
        "tokenType": "http://vocab.lappsgrid.org/ns/syntax/mwt"
    }
},

Option #3

The surface token is annotated with http://vocab.lappsgrid.org/Token and the component tokens are features of the Token.

{
    "id": "tok4-5",
    "start": 177,
    "end": 182,
    "@type": "http://vocab.lappsgrid.org/Token",
    "features": {
        "word": "abych",
        "components": [
            {
                "word": "aby",
                "lemma": "aby",
                "pos": "SCONJ"
            },
            {
                "word": "bych",
                "lemma": "b\u00fdt",
                "pos": "AUX"
            }   
        ]
    }
}

Issues

What should really be an annotation is now the feature of another annotation.

The Language Applications Grid

An open framework for interoperable NLP web services

Multi-Word Tokens

Examples

LIF Proposals

Put In Separate Views

Put In a Single View

Option #1

Option #2

Option #3