Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)

Lexicalized components and open slots

Just like a regular verb, the head verb of a VMWE may have a varying number of compulsory arguments, that is, arguments that must be present in each occurrence of this VMWE. For instance, the direct object and the prepositional complement are compulsory in the VMWE to take someone by surprise.

Some components of such compulsory arguments may be lexicalized, that is, always realized by the same lexemes. Here, by surprise is lexicalized while someone is not.This definition of a lexicalized component naturally extends to any syntactic type of MWE. Namely, the head of a (nominal, adjectival, prepositional etc.) MWE is lexicalized (always realized by the same lexeme) together with at least one component of at least one of its modifiers. The head verb of a VMWE is always considered lexicalized. When it can be replaced by another verb, like in to make/take a decision, we consider that these are two different VMWEs, although possibly synonymous.

Conversely, a component of a compulsory argument which can be realized by a free lexeme taken from a relatively large semantic class is called an open slot. In the following VMWE examples (cited after Gross 1994), all having the same syntactic structure NP V NP Prep NP, the lexicalized arguments are highlighted in bold:

  • Max took the bull by the horns.
  • The news took John by surprise.
  • Bob took part in the inquiry
  • Money burns a hole in Bob’s pocket.

Note on terminology: our definition of lexicalization applies to the component words of a VMWE, and not to the whole VMWE. This might be counter-intuitive, given the traditional definition of lexicalization as a diachronic process by which a lexeme (word or phrase) acquires the status of an autonomous lexical unit, that is, "a form which it could not have if it had arisen by the application of productive rules" (Bauer 1983, p. 50, apud Lipka et al. 2004, p. 6). In other words, traditionally linguistic studies would use the term "lexicalized" to refer to the whole VMWE, as it has idiosyncratic behavior and thus must be listed in the language's lexicon. Our definition, however, stems from computational linguistics and in particular from the parsing literature, in which lexicalized rules refer to rules containing terminal lexemes attached to non-terminal symbols, and a lexicalized grammar is a grammar in which the rules are lexicalized (Manning and Schütze 1999, p. 417; Jurafsky and Martin 2009, p. 507). In this sense, we regard VMWEs as syntactic subtrees in which some of the nodes are annotated with the corresponding terminal symbols that are always realized by the same lexeme (i.e. the lexicalized components) and others are non-terminal nodes that can be realized by any lexeme taken from a larger class (i.e. the open slots).

Special cases

Prepositions have a special status with respect to the notion of lexicalization. In the first, second and fourth example above, the prepositions by and in are lexicalized since they introduce lexicalized complements (the horns, surprise and pocket). However, in the third case the preposition in introduces an open slot whose meaning compositionally combines with the meaning of the VMWE took part. We say in this case that the preposition is selected by the VMWE, i.e. it belongs to the valency properties of the verb. Selected prepositions were discarded in edition 1.0 of the guidelines, and are now re-introduced experimentally and optionally via the inherently adpositional verbs (IAV). If the language team decides to take them into account, they are to be considered in the post-annotation step (step 4), i.e. when all other categories have previously been identified and categorized in the given sentence.

Reflexive clitics in inherently reflexive verbs and possesive pronouns in verbal idioms also have a special lexicalization status (see also the note on more or less frozen determiners). In some languages, the same reflexive clitic or possesive pronoun is used regardless of the person and number, inflecting for case only:

  • смея се laugh se.REFL to laugh
    намирам се find se.REFL to be (somewhere)
  • ??
  • n.a.
  • n.a.
  • n.a.
  • smijem se laugh.1.SG self I laugh
    smiješ se laugh.2.SG self You laugh
    smiju se laugh.3.PL self they laugh
  • n.a.
  • znajduję się find.1.SG.PRES self I find myself
    znajdujesz się find.2.SG.PRES self you find yourself
    znajdują się find.3.PL.PRES self they find themselves
    pójdą na swoje they will go on ones's own they will establish their own household
    pójdziemy na swoje we will go on ones's own we will establish our own household
  • n.a.
  • n.a.
  • smejim se laugh.1.SG self I laugh
    smejiš se laugh.2.SG self You laugh
    smejijo se laugh.3.PL self they laugh

In other languages, reflexive clitics and possesive pronouns agree with the subject and the verb:

  • No examples found for Bulgarian.
  • sie wundert sich she wonders self.3.SG she wonders
    ihr wundert euch you.PL wonder.2.PL self.2.PL you wonder
  • Ο Γιάννης έκανε την πλάκα του, Τα παιδιά έκαναν την πλάκα τους
  • I will do my best, They will do their best
  • yo me quejo I self.1.SG complain I complain
    te quejas you self.2.SG complain You complain
  • n.a.
  • je me trouve I self.1.SG find I find myslef
    tu te trouves you self.2.SG find you find yourself
    je vide mon sac I empty my bag I express my secret feelings
    elle vide son sac she empties her bag she expresses her secret feelings
  • io mi meraviglio I self.1.SG wonder I wonder
    tu ti meravigli you self.2.SG woder you wonder
  • eu me queixo I self.1.SG complain I complain
    tu te queixas you self.2.SG complain You complain
  • eu mă gândesc I Refl.Cl.1sg.Acc. think I am thinking
    tu te gândeștiyou Refl.Cl.2sg.Acc. thinkyou are thinking

It this case, the clitic or the pronoun is realized by different lexemes, depending on the number and gender. Strictly speaking, it is not lexicalized. However, we admit that, regardless of the language, the reflexive clitic and the possesive prounun is a unique lexeme (with lemma się, se, sich, etc. or swój, son, one's) inflecting for person and number. It is thus lexicalized in inherently reflexive verbs and verbal idioms.