Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.0 (2017)


Words and tokens

While the definition of an MWE inherently relies on the notion of a word, manual annotation and automatic identification of VMWEs in our task is performed on texts which are automatically tokenized. It is therefore important to understand the distinction between words and tokens in the context of VMWEs.

A word is a linguistically (notably semantically) motivated unit. The detection of words is, thus, language-dependent and annotation experts should have a clear idea of how to define it for their own language (even if this definition proves hard in general).

A token is a technical and pragmatic notion, defined according to more or less linguistically motivated clues and depending on the particular tokenization tool at hand.

Tokens should ideally be as close as possible to words. However, in practice - due to the hardness of the (automatic) tokenization task - the relation between tokens and words is not always 1-to-1. The following cases occur:

  • A token coincides with a word:
    • вземам, решение, наяве, бял, на, се, д-р
    • mít, hlad, se, úžas
    • einen, Spaziergang, machen, Überraschung
    • κάνω, άνω, κάτω, ποδήλατο, καλός
    • take, a, walk, astonishment
    • dar, un, paseo, sorpresa
    • ،من کتاب، دوست
    • faire, une, promenade, étonnement
    • napraviti/činiti, jedan, šetnja, začuđenost
    • tesz, egy, séta, meglepetés
    • fare, una, passeggiata, sorpresa
    • ferħ, libes, sabiħ
    • robić to do, na on, dokładność precision
    • dar, uma, caminhada, supresa
    • face, o, plimbare
    • gå, på, promenad, förvåning
    • iti, na, en, sprehod, začudenost
  • Several tokens build up one word, like in abbreviations, possessive markers, words with "accidental" separators, inflected or derived forms of foreign names, etc. In this case we speak of a multitoken word (MTW): The pipe symbol '|' indicates token separation in these examples
    • т|.|н|. etc.
      год|. year
    • z|.|B|. for instance
      Wie geht|'|s How goes it How are you
    • κ. κύριος Mister
      υπΔρ υποψήφιος διδάκτορας PhD candidate
    • M|. Mister
      pp|. pages
      Pandora|'|s
    • a|. |C|. antes de Cristo before Christ
      p|. |ej|. por ejemplo for instance
      Rte|. remitente sender
    • می|-|روم، آیت|-|الله، کتاب|-|ها
    • aujourd|'|hui today
    • danas today
    • időjárás|-|jelentés weather forecast
    • vice|-|presidente vice-president
    • libs|et she wore
    • Chomsky|'|ego of Chomsky
      SMS|-|ować to write an SMS
    • vice|-|presidente vice-president
    • prim|-|ministru prime minister
      d|-|voastră polite "you"
    • EU|:|s EU's
    • g|. Mister
      str|. pages
      le|-|to
  • One token can contain several words, like in contractions and compounds. In this case we speak of a multiword token (MWT): See also the representation of MWTs in Universal Dependencies The precise word forms cannot always be straightforwardly deduced from the MWT containing them and vice versa, as in don't, della, du, etc.
    • вагон-ресторант train carriage+restaurant train buffet
    • Schulaufgabe = Schule+Aufgabe school+exercisehomework
      Apfelbaum = Apfel+Baum apple treeapple tree
    • στου = σε+του at+the.GEN
      στον = σε+τον at+the.ACC
    • don't = do+not
    • del = de+el from the
      pelirrojo = pelo + rojo hair+red red-haired
    • کتابش=کتاب+ش
    • du = de+le from the
    • della = de+la of the
    • Białymstoku=Białym+stoku white+slope Białystok.INST (a city name)
      robiłem=robi+łem do.3.SG.PRES+be.1.SG.PAST.AGLI did
      żeśmy = że+śmy that+be.1.PL.AGL that-we
    • neles = em+eles on them
    • într-o = într-+o in a
    • arvsmassa = arv+massa genetic stock
    • nanj = na+njega on him

While a VMWE always contains at least two words, the relation between VMWEs and tokens can be twofold:

  • A VMWE contains several tokens, whether each of them coincides with a word or not:
    • вземам решение make a decision (2 words, 2 tokens)
      прочитам от корица до корица to read from cover to cover (5 words, 5 tokens)
    • eine Rede halten (2 words, 2 tokens) a speech hold to give a speech
      wie geht's (2 words, 4 tokens) how goes it how are you
    • δίνω τον λόγο μου (3 words, 3 tokens) give the speech to promise
      παίζω στα δάχτυλα (3 words, possibly 4 tokens) play in-the fingers know very well
    • to take a walk (2 words, 2 tokens)
      to open Pandora's box (3 words, possibly 5 tokens)
    • dar un paseo 2 words, 2 tokens to give a walk to take a walk
      dar por sentado 3 words, 3 tokens to give for seated to take for granted irse de rositas 3 words, 4 tokens to go_self of little_roses to get off scot free
    • دستور داد (2 words, 2 tokens)
    • napraviti šetnju (2 words, 2 tokens)
      otvoriti Pandorinu kutiju(3 words, 3 tokens)
    • sétát tesz to take a walk (2 words, 2 tokens)
    • tenere un discorso (2 words, 2 tokens) hold a speech to give a speech
      cavalcare l'onda (3 words, 4 tokens) ride the wave ride the wave
    • kien idur fuq il-fatt turns on the fact
    • robi z igły widły make.3.SG a pitchfork out of a needle he makes a mountain out of a molehill (4 words, 4 tokens)
      robiłem z igły widły made.3.SG.M1+be.1.SG.AGL a pitchfork out of a needle I made a mountain out of a molehill (4 words, 5 tokens)
    • dar uma caminhada to give a walk (2 words, 2 tokens)
      cair de pára-quedas to fall with parachute to arrive unprepared in the middle of a situation (3 words, possibly 5 tokens) According to new orthography rules, this word would be written 'paraquedas'. Old spelling may still be found in annotated texts, though.
      queixar-se-ia complain-self-would would complain (2 words, possibly 5 tokens)
    • a da ortul popii to die (3 words, 3 tokens)
    • hålla ett tal (2 words, 2 tokens) hold a speech to give a speech
    • klicati jelene to call cerfs to vomit (2 words, 2 tokens)
      vreči puško v koruzo throw a rifle in the corn to give up (4 words, 4 tokens)
  • A VMWE contains one (multiword) token:
    • no example found for Bulgarian
    • vorbereiten to pre-arrange to prepare
      anfangen at-catch to begin
    • έδωσα-πήρα gave-1SG took-1SG to manage
    • to pretty-print
    • suicidarse suicide_self to commit suicide
    • court-circuiter to short circuit
    • pripremiti unaprijed napraviti/ urediti to prepare
    • kinyír out.cut to kill
    • corto-circuitare to short circuit suicidarsi suicide_self to commit suicide
    • no example found for Polish
    • queixar-se-ia compain-SELF-would would complain
    • a se-ndura RCLI.ACC-have.the.heart to have the heart
    • klargöra clear-make clarify påpeka on-point point out
    • pripraviti to pre-arrange to prepare

Note finally that multitoken words are not considered verbal MWEs since they contain one (multitoken) word only:

  • no example found for Bulgarian
  • ??
  • n.a.
  • maldecir bad say curse bienvivir well live to live in comfort
  • ricaricare to recharge
  • SMS-ować to write an SMS
  • pós-datar to post-date
  • a re-mpărțiPREFIX-splitto split againwith the aphaeresis of the sound 'î' in rapid speech; this is one word, multitoken
  • SMS-jati to write an SMS

Whenever the distinction between a word and a token is judged by a particular language team as hard to tackle, a possible option is to consider these two notions equivalent for the needs of this shared task.