Mark Amery Mark Amery - 1 month ago 22
Python Question

What do spaCy's part-of-speech and dependency tags mean?

spaCy tags up each of the

Token
s in a
Document
with a part of speech (in two different formats, one stored in the
pos
and
pos_
properties of the
Token
and the other stored in the
tag
and
tag_
properties) and a syntactic dependency to its
.head
token (stored in the
dep
and
dep_
properties).

Some of these tags are self-explanatory, even to somebody like me without a linguistics background:

>>> import spacy
>>> en_nlp = spacy.load('en')
>>> document = en_nlp("I shot a man in Reno just to watch him die.")
>>> document[1]
shot
>>> document[1].pos_
'VERB'


Others... are not:

>>> document[1].tag_
'VBD'
>>> document[2].pos_
'DET'
>>> document[3].dep_
'dobj'


Worse, the official docs don't contain even a list of the possible tags for most of these properties, nor the meanings of any of them. They sometimes mention what tokenization standard they use, but these claims aren't currently entirely accurate and on top of that the standards are tricky to track down.

What are the possible values of the
tag_
,
pos_
, and
dep_
properties, and what do they mean?

Answer

Part of speech tokens

The spaCy docs currently claim:

The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.

More precisely, the .tag_ property exposes Treebank tags, and the pos_ property exposes tags based upon the Google Universal POS Tags (although spaCy extends the list).

spaCy's docs seem to recommend that users who just want to dumbly use its results, rather than training their own models, should ignore the tag_ attribute and use only the pos_ one, stating that the tag_ attributes...

are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent.

That is to say, if spaCy releases an improved model trained on a new treebank, the tag_ attribute may have different values to that which it had before. This clearly makes it unhelpful for users who want a consistent API across version upgrades. However, since the current tags are a variant of Penn Treebank, they are likely to mostly intersect with the set described in any Penn Treebank POS tag documentation, like this: http://web.mit.edu/6.863/www/PennTreebankTags.html

The more useful pos_ tags are

A coarse-grained, less detailed tag that represents the word-class of the token

based upon the Google Universal POS Tag set. For English, a list of the tags in the Universal POS Tag set can be found here, complete with links to their definitions: http://universaldependencies.org/en/pos/index.html

The list is as follows:

However, we can see from spaCy's parts of speech module that it extends this schema with three additional POS constants, EOL, NO_TAG and SPACE, that are not part of the Universal POS Tag set. Of these:

  • From searching the source code, I don't think EOL gets used at all, although I'm not sure
  • NO_TAG is an error code. If you try parsing a sentence with a model you don't have installed, all Tokens get assigned this POS. For instance, I don't have spaCy's German model installed, and I see this on my local if I try to use it:

    >>> import spacy
    >>> de_nlp = spacy.load('de')
    >>> document = de_nlp('Ich habe meine Lederhosen verloren')
    >>> document[0]
    Ich
    >>> document[0].pos_
    ''
    >>> document[0].pos
    0
    >>> document[0].pos == spacy.parts_of_speech.NO_TAG
    True
    >>> document[1].pos == spacy.parts_of_speech.NO_TAG
    True
    >>> document[2].pos == spacy.parts_of_speech.NO_TAG
    True
    
  • SPACE is used for any spacing besides single normal ASCII spaces (which don't get their own token):

    >>> document = en_nlp("This\nsentence\thas      some weird spaces in\n\n\n\n\t\t   it.")
    >>> for token in document:
    ...   print('%r (%s)' % (str(token), token.pos_))
    ... 
    'This' (DET)
    '\n' (SPACE)
    'sentence' (NOUN)
    '\t' (SPACE)
    'has' (VERB)
    '     ' (SPACE)
    'some' (DET)
    'weird' (ADJ)
    'spaces' (NOUN)
    'in' (ADP)
    '\n\n\n\n\t\t   ' (SPACE)
    'it' (PRON)
    '.' (PUNCT)
    

Dependency tokens

As noted in the docs, the dependency tag scheme is based upon the ClearNLP project, and some documentation (unfortunately only in PDF form) of the tags can be found at http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf. That document lists these tokens:

  • ACOMP: adjectival complement
  • ADVCL: adverbial clause modifier
  • ADVMOD: adverbial modifier
  • AGENT: agent
  • AMOD: adjectival modifier
  • APPOS: appositional modifier
  • ATTR: attribute
  • AUX: auxiliary
  • AUXPASS: passive auxiliary
  • CC: coordinating conjunction
  • CCOMP: clausal complement
  • COMPLM: complementizer
  • CONJ: conjunct
  • CSUBJ: clausal subject
  • CSUBJPASS: clausal passive subject
  • DEP: unclassified dependent
  • DET: determiner
  • DOBJ: direct object
  • EXPL: expletive
  • HMOD: modifier in hyphenation
  • HYPH: hyphen
  • INFMOD: infinitival modifier
  • INTJ: interjection
  • IOBJ: indirect object
  • MARK: maker
  • META: meta modifier
  • NEG: negation modifier
  • NMOD: modifier of nominal
  • NN: noun compound modifier
  • NPADVMOD: noun phrase as adverbial modifier
  • NSUBJ: nominal subject
  • NSUBJPASS: nominal passive subject
  • NUM: numeric modifier
  • NUMBER: number compound modifier
  • OPRD: object predicate
  • PARATAXIS: parenthetical modifier
  • PARTMOD: participial modifier
  • PCOMP: complement of a preposition
  • POBJ: object of a preposition
  • POSS: possession modifier
  • POSSESSIVE: possessive modifier
  • PRECONJ: pre-correlative conjunction
  • PREDET: predeterminer (not used by Spacy - see below)
  • PREP: prepositional modifier
  • PRT: particle
  • PUNCT: punctuation
  • QUANTMOD: quantifier phrase modifier
  • RCMOD: relative clause modifier (not used by Spacy - relcl is used instead as noted below)
  • ROOT: root
  • XCOMP: open clausal complement

and also contains the actual linguistic definitions of these terms, complete with examples. However, as with part of speech tokens, spaCy doesn't quite adhere to the scheme it claims to adhere to. Looking in its symbols file, we can see that it defines a constant for each of the tokens above except PREDET, which spaCy doesn't use for some reason. Additionally, as noted in https://github.com/explosion/spaCy/issues/233, there are several dependency tokens that spaCy can emit that are neither included in the symbols file nor in the 2012 CLEAR documentation. These include acl, case, compound, dative, nummod, and relcl.

Fortunately, we can find at least brief descriptions of what these undocumented dependencies mean in code comments on the DEPTagEn interface inside the nlp4j (previously called ClearNLP) project, which spaCy uses to train its parser. For instance, the meanings of the tokens above:

  • acl - finite and non-finite clausal modifier.
  • case - case marker
  • compound - compound nouns/numbers
  • dative - dative
  • nummod - numeric modifiers
  • relcl - relative clause modifiers

These admittedly aren't great descriptions, but at least they're something! The spaCy team is aware of the deficiencies of the documentation and working to fix it, so hopefully in a while we'll have better documentation all in one place.