Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)


Verbal multiword expressions versus collocations

Collocations are not considered VMWEs in this task and should not be annotated. However, the boundary between both categories is not always easy to define and should be handled with care.

We understand collocations as combinations of words whose idiosyncrasy is purely statistical. In other words, tokens in collocations tend to co-occur with each other more often than expected by chance, but they show no substantial orthographic, morphological, syntactic and (most notably) semantic idiosyncrasy. In this way we oppose MWEs to collocations.

Note that other authors understand collocations slightly differently. E.g. for Sag et al. (2002), collocations are any statistically significant cooccurrences, i.e. they include all forms of MWE. For Baldwin and Kim (2010), collocations form a proper subset of MWEs. According to (Melcuk, 2010), collocations are binary sematically compositional combinations of words subject to lexical selection constraints, i.e. they intersect with what is here understood as MWEs.

Some combinations happen to be very frequent and are perceived as "frozen":

  • качвам цената raise the price
  • eine Frage beantworten to answer a question, die Graphik zeigt the grahpic shows, einen Bus nehmen to take a bus
  • κάνω βόλτα take-1SG a walk
  • drastically drop
    the graphic shows
    to take a bus
  • responder a una pregunta to answer a question
    el gráfico muestra the graphic shows
    coger el autobús to take the bus
  • interesa agertu interest show to show interest
    galdera bati erantzun question one-to answer answer a question
    autobusa hartu bus take to take the bus
  • riješiti dvojbu to solve a dilemma, pripremati jelo to prepare a meal
  • rispondere a una domanda to answer a question
    il grafico mostra the graphic shows
    prendere un bus to take a bus
  • zalać rynek to flood the market to dominate the market
  • bater um recorde to break a record (bater to beat has a regular sense of to overcome in addition to the litteral sense)
    entrar em cartaz enter into poster arrive in theaters (for a movie) (the MWE is em cartaz in poster in theaters, the verb just usually collocates with this MWE)
  • lua un autobuztake a bus
  • drastičen upad drastical drop, graf prikazuje graphic shows, vzeti taksi to take a taxi

However, applying regular lexical alternations to them does not markedly impact their meaning.

  • вдигам цената raise the price, увеличавам цената raise the price, качвам залога raise the bet, качвам температурата raise the temperature
  • eine Anfrage beantworten to answer a request, das Diagramm zeigt the diagram shows, mit einem Bus fahren to go by bus
  • πάω βόλτα go for a walk
  • significantly drop, drastically decrease, the diagram shows, the graphic illustrates, to take a coach
  • responder a una petición to answer a request
    el diagrama muestra the diagram shows
    coger el tren to take the train
  • interesa erakutsi interest show to show interest →'erakutsi' and 'agertu' are synonyms in this context in Basque
    zalantza bati erantzun doubt one-to answer answer a doubt
    trena hartu train take to take the train
  • riješiti dilemu to solve a dilemma, pripremati obrok to prepare a meal
  • rispondere a una richiesta to answer a request
    il diagramma mostra the diagram shows
  • zdominować/zarzucić/zapełnić/nasycić rynek to dominate/overwhelm/fill/saturate the market
  • quebrar/bater/ultrapassar/estabelecer um recorde to break/beat/overcome/establish a record
    o recorde foi quebrado the record was broken
    entrar/estar/permanecer/ficar/continuar/ter em cartaz enter/be/remain/stay/continue/have in poster
  • lua o mașină
  • občuten upad significant drop, drastično zmanjšanje drastical decrease, diagram prikazuje diagram shows, slika prikazuje picture shows

The difficulty of distinguishing collocations from VMWEs lies in the fact that lexical variability is relevant to some VMWEs:

  • нямам пукната пара/пукнат грош to not have a single penny, to be very poor
    имам твърда/дебела глава to have a thick head, to be stubborn and not listen to advice
  • einen Willen/Menschen brechen to break a will/person
  • to come in handy/useful, to stand firm/fast, to break someone's spirit/will, to take the cake/biscuit
  • dar un paseo/una vuelta give a walk / a turn to go for a walk
    darse/tomar una ducha give.self/take a shower take a shower
  • min eman/egin pain give/do to hurt (somebody)
    eskola/klasea eman class give to give a class →'eskola' and 'klasea' are synonyms in Basque
  • slomiti čiju/čiji volju/duh to break someone's will/spirit
  • cogliere/prendere di sorpresa, dare/fornire un contributo
  • zapisać się złotymi literami/zgłoskami to record iteself with golden letters/syllables to be remebered and commemorated for a merit
    zamarznąć na kość/lód/sopel to freeze to bone/ice/icicle to freeze strongly
  • levar em conta/consideração take into account/consideration
    chutar o balde/pau da barraca to kick the bucket/the tent's stick to act irresponsibly
  • lua o decizie/hotărâremake a decision
  • imeti nekaj na voljo/razpolago to have something available/at disposal, odpreti nekomu pot/vrata to open a way/a door (for someone) to give someone an opportunity to do something

However, the extent of the vocabulary concerned by this variability is different for collocations and VMWEs. Namely, a head verb in a collocation usually selects a whole semantic class for each of its required arguments. For instance, the verb to take to use a vehicle to travel selects a whole semantic class of means of transport. Similarly, the verb to drop can select a large set of adverbs describing the degree: drastically/significantly/remarkably/slightly/reasonably drop. Conversely, lexical variability in a VMWE is limited to a closed list of lexemes, sometimes only loosely semantically related. For instance, the VMWEs to take a cake/biscuit and to stand firm/fast do not keep their idiomatic readings with semantically close complements: #to take a cookie/wafer, *to stand hard/rigid/solid etc. See also Test VID.2.

Some Light-verb constructions (LVCs) and multiverb constructions (MVCs) belong to the gray zone between MWEs and collocations in the sense that some operator (light) verbs seem to select large classes of nouns, as in to make a speech/declaration/remark/etc. However, some studies (e.g. Bonial 2014) show that there is no such thing as truly productive light verbs (e.g. to give a look vs. to give a stare). Therefore, we do include LVCs and MVCs in our annotation scope.