shared task on automatic identification of verbal MWEs - edition 1.0 (2017)
Frequently Asked Questions (FAQ)
Annotators often face questions and challenging examples. When several annotators ask the same question, we will update the list of frequently asked questions.
However, we suggest that language teams set up another communication platform to deal with questions that are specific to a language. This can take the form of a shared online document, a wiki, a dedicated bug tracking system or mailing list. We also suggest keeping track of decisions taken considering borderline examples (with a list of expressions to which the decision applies). These should be kept in a centralized document or page that all annotators can access.
Whenever you think that a question can also be interesting to other languages, please notify the organizers and we will try to update this page.
- How to define an unexpected change in meaning?
- How to annotate lexicalized words which belong to contractions and compounds?
- How to annotate coordinated VMWEs sharing some components?
- How to annotate elliptical occurrences of VMWEs?
- How to annotate VMWEs that seem to belong to more than one category?
- How to annotate embedded VMWEs?
- Are existential expressions with there is/are considered VMWEs?
- How to categorize VMWEs which seem LVCs but do not pass all LVC tests?
- Why are verb+noun constructions with pure operator verbs (to commit, to make, to have etc.) considered LVCs?
- Does the IReflV category include verbs with non-reflexive clitics?
- Should nominalizations of VMWEs be annotated?
- How to express hesitation between different VMWE categories?
- In test 9, how can one decide whether an abstract noun is an event or a state?
- How does one decide if a more or less frozen determiner is a lexicalized VMWE component?
- Should I annotate compound and serial verbs as VMWEs? Of which category?
- If an LVC contains a complex (fixed) NP as a dependent, should I include the whole NP or just the head?
Check the glossary entry that defines undexpected change in meaning
In some languages prepositions, clitics and determiners are subject to contractions (i.e. they yield multiword tokens, MWTs). Tokenizers might not handle contraction splitting properly. In this case, a lexicalized component of a VMWE can be merged with an external word:
- haberse suicidado have+REFL suicided committed suicide
A similar problem occurs in languages with productive compounding, where a lexicalized component of a VMWE and a free modifier can build up a multitoken word (since compound splitting might not be a standard feature of a tokenizer):
unter Drogeneinfluss stehen to be under the influence of drugs
Heisshunger haben to have hot hunger to be ravenously hungry
Since the current annotation format is token-based, we prohibit correcting tokenization errors and compound splitting by the annotators for the sake of coherence. Therefore the annotation of such contractions and compounds finds no fully satisfactory solution in our schema. We propose to annotate a whole MWT each time it contains a word which is part of a VMWE. Annotators should add a textual comment about the mixed status of this MWT:
Drogeneinfluss → MWT containing a lexicalized VMWE component Einfluss and an external word Drogen
Heisshunger → MWT containing a lexicalized VMWE Hunger and an additional modifier heiss
- haberse → MWT containing a lexicalized VMWE component se and an external word haber
A component shared by two or more coordinated VMWEs should be annotated as belonging to both of them.
- Regeln und Richtlinien aufstellen to set up rules and guidelines to draw up rules and guidelines → aufstellen must be annotated both as part of to Regeln aufstellen to lay down rules and of Richtlinien aufstellen to draw up guidelines
- to have a walk or a ride → have must be annotated both as part of to have a walk and of to have a ride
- odprawić mszę i pokutę celebrate a mass and a penance→ odprawić should be annotated both as part of odprawić mszę to celebrate a mass and of odprawić pokutę to celebrate a penance
- imeti dober želodec in dobre živce to have a good stomach to bear something well and good nerves to be mentally strong → imeti have must be annotated both as part of imeti dober želodec and of imeti dobre živce
Such hesitation issues should normally be solved by the decision trees 1 and 2. For instance, consider the German expression
Similarly, the French expression
Candidate VMWEs embedded in other VMWEs should be annotated only if they have a VMWE status also outside the particular context. For instance, the VMWE to
On the other hand, the French expression
Hesitations about a possible LVC status can arise with respect to existential constructions with nouns introducing events or properties (see test 9 [N-EVENT]) as in:
- es gibt Beschwerden there are complaints
- there are complaints
- il existe des plaintes it there has complaints there are complaints
- há queixas has complaints there are complaints
- imeti pripombe have complaints there are complaints
Namely, the noun keeps its original sense and the existential verb to be or to have brings no additional meaning. However, a candidate LVC must also pass test 12 [V-REDUC]. This requires the modification of the noun by the verb's subject, which is impossible with impersonal and empty subjects like there. Therefore, such candidates cannot be LVCs.
Note, however, that existential expressions themselves can be VMWEs of type ID. For instance, in the French example
If at least one of the five LVC tests (9 to 13) is not passed, the candidate is not considered an LVC. For the sake of a deterministic VMWE categorization and higher inter-annotator agreement, we admit a definition of an LVC which might seem more restrictive than some linguistic studies usually assume. Thus, we exclude from the LVC scope:
- expressions in which the verb's syntactic subject is not necessarily the noun's semantic subject, like
to give courageor to make an impression. These candidates do not pass test 12 [V-REDUC].
- expressions where the lexicalized nominal dependent of the verb is its subject, as in
the problem lies in something; these candidates do not pass test 12 [V-REDUC].
- expressions with aspectual verbs, as in
to start, to pursue, to stop a walk. These do not pass test 11 [V-LIGHT] since they add (aspectual) semantics to the noun. The only exception is when the noun itself is already aspectual, as in to come into bloom
Pure operator verbs, i.e. such verbs which never have any semantics per se but only carry the grammatical (tense, mood etc.) information, seem to contradict the intuition behind a VMWE. Namely, they usually select a whole semantic class of nouns. For instance to commit selects any negative act (a crime, a suicide, a theft) and to perform selects any activity (a task, an experiment, a miracle). In this sense, their complements resemble open slots and the whole combinations resemble collocations. However, for the sake of a deterministic VMWE categorization and higher inter-annotator agreement, we do include verb+noun combinations with pure operator verbs, such as to
We could have organized decision tree 1 differently and exclude such cases from the VMWE scope by eliminating the LVC hypothesis. Then, to
No, the IReflV category only includes (some) combinations of a head verb with a reflexive clitic. As indicated in the borderline cases page of IReflV category, other pronouns, whenever lexicalized, trigger the ID category. Recall that whenever more than one dependent of the verb is lexicalized (including or not a reflexive clitic), the VMWE is always categorized as an ID
- sich Fragen stellen SELF questions put to doubt
- s'en aller SELF of-there go to leave
- ucvreti jo to escape her to escape something/someone by running
The only nominal VMWE variants within our annotation scope are those:
- headed by the gerund stemming from the head verb of the VMWE -
takingof the decision, and
- in which a noun stemming from a VMWE is modified by a participle or a relative clause headed by the verb stemming from the same VMWE - the
decisions takenyesterday, the decisionwhich he took.
Other nominalizations are excluded:
- Wortbruch word-break a promise which has not been hold
- a break-down, a forget-me-not
- la prise en compte the taking into account the fact of taking something into account, peut-être may-be maybe, porte-feuilles carry-sheets wallet
- zabawa czyimś kosztem a play at someone else's expenses derived from bawić się czyimś kosztem to enjoy oneself at someone else's expenses
- šala na tuj račun a joke at someone else's expenses derived from šaliti se na tuj račun to play a joke on someone
For practical reasons (e.g. compatibility with an existing annotation, or usefulness for a particular application) they can be considered language-specific VMWEs but then a new category should be defined for them, so as to keep the universal and the quasi-universal categories intact
Once identified in a text, each VMWE is to be assigned to exactly one category. Note that in this version of the guidelines we no longer admit "hesitation labels" (e.g. LVC/ID) used in the pilot annotation. Hesitation can, however, be expressed in a comment and a particular value of the annotator's confidence assigned to a particular VMWE occurrence.
The goal of test 9 is to identify whether a nouns is predicative, that is, whether it requires some semantic arguments. We talk about events and states to circumvent the question of whether a noun is predicative. Here, they are understood very largely as roughly corresponding to binary and unary predicates. For instance, we consider that an event is something that happens, and can be related to an action, activity, process or phenomenon. A state is understood as a property that may or may not change over time, including feelings, sensations, permanent and temporary properties and relations between entities. These are a very generic definitions that go far beyond the scope of what is commonly understood as an event or state.
While it is hard to define required tests to identify a predicative noun, there are some useful clues that can be used for abstract nouns (sufficient criteria).
Verb paraphrase: Is the abstract noun derivationally related to a verb with the same semantics?
John makes a decision = John decides
John has a walk = John walks
Adjective paraphrase: Is the abstract noun derivationally related to an adjective with the same semantics?
John has courage = John is courageous → and, more generally, characteristics and attributes
John has hunger/thirst = John is hungry/thirsty → and, more generally, physical sensations
John has passion/fear/anger = John is passionate/afraid/angry → and, more generally, feelings and emotions
John has problems/difficulties = Something is problematic/difficult for John → and, more generally, states
Synonym verb or adjective paraphrase: Does the abstract noun have a synonym/hypernym derivationally related to a verb or adjective with the same semantics?
John and Mary reach a consensus = John and Mary agree → consensus has no corresponging verb or adjective, but agreement is a synonym
John has a chance to do something = John is likely to do something → chance has no corresponding verb or adjective, but likelihood is a synonym
For many classes of abstract nouns, it can be tricky to apply the tests above. We advise listing in a separate document those classes of nouns that pass test 9 in your language. We suggest considering that the following categories pass test 9:
Illnesses, symptoms and health conditions:
John has a flu = John is ill (illness is a hypernym of flu)
John has contact with somebody = John contacts somebody
John has an affair with somebody = John is involved with somebody (involvement is a synonym of affair)
Mental content (internal to a cognizer):
John has a worry = John worries
John has an idea = John thinks (thought is a synonym of idea)
John has an opinion = John believes (belief is a synonym of opinion)
Please notice that events and states that have no semantic arguments do not pass test 9, even if they have verbal/adjectival paraphrases:
Natural phenomena: rain, snow, tornado, flood, earthquake
Informational content (external to a cognizer): information, news
Finally, notice that not any verb + predicative noun combination forms an LVC. Additionally, the verb needs to be "light", not adding semantics to the noun. The remaining LVC tests (tests 10 to 13) guarantee this.
Most of the time, it is easy to test whether a determiner is lexicalized by searching alternatives in corpora (or on the web). For instance, the is lexicalized in to kick the bucket because searches for other determiners (this, a, some, three, many, etc.) either do not return any result or return only literal uses of this verb phrase.
However, borderline cases do exist, in which alternatives are rare but possible, specially for LVCs and decomposable IDs. For instance, while the standard form of the idiom spill the beans forbids some determiners (#spill three/twenty beans), it is possible to find some variation (spill these/many/all/my/his/more/no beans).
We argue that the selection of some determiners (but not all) by a VMWE is comparable to selected prepositions for verbs. Thus, it can be seen as a regular grammatical phenomenon, suggesting that when the determiner varies, then it should not be included. In some VMWEs, though, determiner variation may be considered as marginal and/or incorrect, which means that it should be included in the scope of the annotated VMWE.
In short, determiners can exhibit limited variability. As a consequence, each language should document their decisions as to whether to include them or not for particular VMWE classes, to ensure consistency.
avoir la pêche have the peach to have much energy
avoir de la chance have some luck to be lucky
avoir l'occasion to have the opportunity
After annotation, we suggest that LLs use the provided analysis scripts to detect inconsistencies in the annotation of the same VMWE (e.g., including or not a determiner). They can then take an arbitrary decision and homogenise all annotated occurrences.
It depends. Most of the languages covered by the shared task for the moment do not have this kind of verb. The guidelines were written having these languages in mind, so they are not clear about compound verbs
In many Indo-European languages (including Germanic, Romance and Balto-Slavic families), verbal chains using auxiliary and modal verbs are used to express tense, modality and aspect. This is a regular linguistic phenomenon that can be applied to any verb and should not be annotated.
On the other hand, some languages like Maltese have many compound verbs that do not necessarily express tense, mood and modality. We suggest that, when the verb combinations regularly combine with any other verb adding a given meaning, they should not be annotated. Future versions of these guidelines should study the need for a new category for compound verbs, in order to cover this phenomenon.
In short, verbal chains should only be annotated as ID when they are idiomatic:
laisser tomber let fall to give up
vouloir dire want say to mean
faire tomber make fall to drop
vouloir changer want change to want to change
- dak x'mar jgħid ilbieraħ that (person) what'he-went he-says yesterday what the hell did he say yesterday
querer dizer want say to mean
querer falar want speak to want to speak
The guidelines determine that only lexicalized components should be annotated. Therefore, we suggest that, in such cases, if the NP is compositional, only the head of the NP is included in the scope of the LVC. This may lead to the annotation of odd LVCs that actually never occur by themselves without a modifier. This is not a problem and is already the case for other VMWEs, e.g. the ones that only occur with a determiner, but the determiner is not lexicalized. The only cases where the NP should be included as a whole is if the complement is a non-compositional MWE, so that it would not make any sense to annotate only the head.
παίζω το χαρτί του ευρωσκεπτικισμού to-play the paper the.SG.GEN euroscepticism.SG.GEN to use the asset of euroscepticism, to use euroscepticism as an asset
κάνω στάση εργασίας to-make stop work.SG.GEN to go on strike, to strike → the expression στάση εργασίας is non-compositional (term)
présenter un Syndrome Coronairien Aigu to present an acute coronary syndrome
mener une vie de débauche to have a life of pleasures
faire un faux pas make a false step to commit a faux pas → the expression faux pas is non-compositional
- mieć wyrzuty sumienia to have reproaches of the conscience to feel guilty
fazer uma sessão de fotos/autógrafos to make a photo/autograph session
fazer roleta russa to make russian roulette to play russian roulette → the expression roleta russa is non-compositional
ter uma situação financeira/profissional/estável to have a financial/professional/stable situation
Notice that these suggestions also apply to LVCs whose nominal complements are introduced by prepositions (i.e. verb+PP LVCs). As usual, the preposition should be included if it is lexicalized and then the NP introduced by the preposition is analyzed exactly as described above.
If the complex dependent is an acronym, you may want to add the textual comment "PART" to indiate that only part of the full version is lexicalized (generally, the head), just like for contractions and compounds.