Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)


Words and tokens

While the definition of an MWE inherently relies on the notion of a word, manual annotation and automatic identification of VMWEs in our task is performed on texts which are automatically tokenized. It is therefore important to understand the distinction between words and tokens in the context of VMWEs.

A word is a linguistically (notably semantically) motivated unit. The detection of words is, thus, language-dependent and annotation experts should have a clear idea of how to define it for their own language (even if this definition proves hard in general).

A token is a technical and pragmatic notion, defined according to more or less linguistically motivated clues and depending on the particular tokenization tool at hand. Note that the notion of a token is ambiguous in NLP. It can also mean an individual occurrence of a certain linguistic unit, as opposed to a type, i.e. the set of all surface realisations of a unit. In these guidelines, we refrain from using this seconf sense.

Tokens should ideally be as close as possible to words. However, in practice - due to the hardness of the (automatic) tokenization task - the relation between tokens and words is not always 1-to-1. The following cases occur:

  • A token coincides with a word:
    • вземам, решение, наяве, бял, на, се, д-р
    • mít, hlad, se, úžas
    • einen, Spaziergang, machen, Überraschung
    • κάνω, μία, βόλτα, έκπληξη
    • take, a, walk, astonishment
    • dar, un, paseo, sorpresa, maldecir, bienvivir
    • ibilaldi, bat, egin, ezuste
    • ،من کتاب، دوست
    • faire, une, promenade, étonnement
    • tóg, siúl, ionadh
    • napraviti, jedan, šetnja, začuđenost
    • tesz, egy, séta, meglepetés
    • mengambil, sebuah, berjalan, heran
    • fare, una, passeggiata, sorpresa
    • 取る, その, 歩く, 驚き
    • ferħ, libes, sabiħ
    • robić to do, na on, dokładność precision
    • dar, uma, caminhada, supresa
    • face, o, plimbare
    • iti, na, en, sprehod, začudenost
    • gå, på, promenad, förvåning
    • 采取, 一个, 步行, 惊愕
  • Several tokens build up one word, like in abbreviations, possessive markers, words with "accidental" separators, inflected or derived forms of foreign names, etc. In this case we speak of a multitoken word (MTW): The pipe symbol '|' indicates token separation in these examples
    • т|.|н|. etc.
      год|. year
    • z|.|B|. for instance
      Wie geht|'|s How goes it How are you
    • κ. κύριος Mister
      υπΔρ υποψήφιος διδάκτορας PhD candidate
    • M|. Mister
      pp|. pages
      Pandora|'|s
    • A|/|A|. a la atención de for the attention of
      a|/|f|. a favor in favor
      Rte|. remitente sender
    • etab|. eta abar and so on
    • می|-|روم، آیت|-|الله، کتاب|-|ها
    • aujourd|'|hui today
    • danas today
    • időjárás|-|jelentés weather forecast
    • vice|-|presidente vice-president
    • libs|et she wore
    • Chomsky|'|ego of Chomsky
      SMS|-|ować to write an SMS
    • vice|-|presidente vice-president
    • prim|-|ministru prime minister
      d|-|voastră polite "you"
    • g|. Mister
      str|. pages
      le|-|to
    • EU|:|s EU's
  • One token can contain several words, like in contractions and compounds. In this case we speak of a multiword token (MWT): See also the representation of MWTs in Universal Dependencies The precise word forms cannot always be straightforwardly deduced from the MWT containing them and vice versa, as in don't, della, du, etc.
    • вагон-ресторант train carriage+restaurant train buffet
    • Schulaufgabe = Schule+Aufgabe school+exercisehomework
      Apfelbaum = Apfel+Baum apple treeapple tree
    • στου = σε+του at+the.GEN
      στον = σε+τον at+the.ACC
    • don't = do+not
    • del = de+el of the from/of the
      al = a+el to+the to the
      compárese = compare+se compare SE_PARTICLE be it compared
      suicidarse = suicididar+se suicide SELF to commit suicide
    • sudurluze = sudur+luze nose+long long-nosed
      jarleku = jar(ri)+leku sit+place seat
    • کتابش=کتاب+ش
    • du = de+le from the
    • sa = i+an in the
      b'fhearr = ba+fhearr be.COND better prefer
    • uzbrdo = uz+brdo uphill
    • della = di+la of the
    • Białymstoku=Białym+stoku white+slope Białystok.INST (a city name)
      robiłem=robi+łem do.3.SG.PRES+be.1.SG.PAST.AGLI did
      żeśmy = że+śmy that+be.1.PL.AGL that-we
    • neles = em+eles on them
    • într-o = într-+o in a
    • nanj = na+njega on him
    • arvsmassa = arv+massa genetic stock

While a VMWE always contains at least two words, the relation between VMWEs and tokens can be twofold:

  • A VMWE contains several tokens, whether each of them coincides with a word or not:
    • вземам решение make a decision (2 words, 2 tokens)
      прочитам от корица до корица to read from cover to cover (5 words, 5 tokens)
    • eine Rede halten (2 words, 2 tokens) a speech hold to give a speech
      wie geht's (2 words, 4 tokens) how goes it how are you
    • δίνω τον λόγο μου (3 words, 3 tokens) give the speech to promise
      παίζω στα δάχτυλα (3 words, possibly 4 tokens) play in-the fingers know very well
    • to take a walk (2 words, 2 tokens)
      to open Pandora's box (3 words, possibly 5 tokens)
    • dar un paseo 2 words, 2 tokens to give a walk to take a walk
      dar por sentado 3 words, 3 tokens to give for seated to take for granted
      irse de rositas 3 words, 4 tokens to go_self of little_roses to get off scot free
    • ibilaldia egin (2 words, 2 tokens)
    • دستور داد (2 words, 2 tokens)
    • b'fhearr liom (2 words, 4 tokens) I would prefer
    • dignuti ruke to raise hands to give up (2 words, 2 tokens), otvoriti Pandorinu kutiju open Pandora's box to face with problems (3 words, 3 tokens)
    • sétát tesz to take a walk (2 words, 2 tokens)
    • tenere un discorso (2 words, 2 tokens) hold a speech to give a speech
      cavalcare l'onda (3 words, 4 tokens) ride the wave ride the wave
    • kien idur fuq il-fatt turns on the fact
    • robi z igły widły make.3.SG a pitchfork out of a needle he makes a mountain out of a molehill (4 words, 4 tokens)
      robił|em z igły widły made.3.SG.M1+be.1.SG.AGL a pitchfork out of a needle I made a mountain out of a molehill (4 words, 5 tokens)
    • dar uma caminhada to give a walk (2 words, 2 tokens)
      cair de pára-quedas to fall with parachute to arrive unprepared in the middle of a situation (3 words, possibly 5 tokens) According to new orthography rules, this word would be written 'paraquedas'. Old spelling may still be found in annotated texts, though.
      queixar-se-ia complain-self-would would complain (2 words, possibly 5 tokens)
    • a da ortul popii to die (3 words, 3 tokens)
    • klicati jelene to call cerfs to vomit (2 words, 2 tokens)
      vreči puško v koruzo throw a rifle in the corn to give up (4 words, 4 tokens)
    • hålla ett tal (2 words, 2 tokens) hold a speech to give a speech
  • A VMWE contains one (multiword) token:
    • no example found for Bulgarian
    • vorbereiten to pre-arrange to prepare
      anfangen at-catch to begin
    • έδωσα-πήρα gave-1SG took-1SG I tried hard
    • to pretty-print
    • suicidarse suicide_self to commit suicide
    • n.a.
    • court-circuiter to short circuit
    • pripremiti to pre-arrange to prepare
    • kinyír out.cut to kill
    • corto-circuitare to short circuit suicidarsi suicide_self to commit suicide
    • no example found for Polish
    • queixar-se-ia compain-SELF-would would complain
    • a se-ndura RCLI.ACC-have.the.heart to have the heart
    • pripraviti to pre-arrange to prepare
    • klargöra clear-make clarify påpeka on-point point out

Note finally that multitoken words are not considered verbal MWEs since they contain one (multitoken) word only:

  • no example found for Bulgarian
  • ??
  • αερολογώ air+talk to talk aimlessly
  • n.a.
  • odolustu blood+empty to bleed
  • SMS-ati to write an SMS
  • anteporre to put + in front of
  • SMS-ować to write an SMS
  • pós-datar to post-date
  • a binedispunewell-disposeto cheer up
  • SMS-jati to write an SMS

Whenever the distinction between a word and a token is judged by a particular language team as hard to tackle, a possible option is to consider these two notions equivalent for the needs of this shared task.