Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)


Adding new examples in your language

It is often useful to have examples of a phenomenon shown in your own language. We collect these examples for each language using an online shared spreadsheet, and we present these examples as in the template below:

  • MWEs with their lexicalized components in Arabic are indicated like this.
  • MWEs with their lexicalized components in Bulgarian are indicated like this.
  • MWEs with their lexicalized components in Czech are indicated like this.
  • MWEs with their lexicalized components in German are indicated like this.
  • MWEs with their lexicalized components in Greek are indicated like this.
  • MWEs with their lexicalized components in English are indicated like this.
  • MWEs with their lexicalized components in Spanish are indicated like this.
  • MWEs with their lexicalized components in Basque are indicated like this.
  • MWEs with their lexicalized components in Farsi are indicated like this.
  • MWEs with their lexicalized components in French are indicated like this.
  • MWEs with their lexicalized components in Irish are indicated like this.
  • MWEs with their lexicalized components in Hebrew are indicated like this.
  • MWEs with their lexicalized components in Hindi are indicated like this.
  • MWEs with their lexicalized components in Croatian are indicated like this.
  • MWEs with their lexicalized components in Hungarian are indicated like this.
  • MWEs with their lexicalized components in Indonesian are indicated like this.
  • MWEs with their lexicalized components in Italian are indicated like this.
  • MWEs with their lexicalized components in Japanese are indicated like this.
  • MWEs with their lexicalized components in Lithuanian are indicated like this.
  • MWEs with their lexicalized components in Maltese are indicated like this.
  • MWEs with their lexicalized components in Polish are indicated like this.
  • MWEs with their lexicalized components in Portuguese are indicated like this.
  • MWEs with their lexicalized components in Romanian are indicated like this.
  • MWEs with their lexicalized components in Slovene are indicated like this.
  • MWEs with their lexicalized components in Swedish are indicated like this.
  • MWEs with their lexicalized components in Turkish are indicated like this.
  • MWEs with their lexicalized components in Chinese are indicated like this.

Examples are preceded by the 2-letter language code in parentheses (e.g. EN for English). You can control what languages are shown and hidden by toggling the header buttons. See the section on notation for more information.

In order to see the ID of all examples, make sure the ID button is toggled on the header of the current page. Now look at the template above. You should see this ID: 7.2_A_template-mwe. The 7.2 represents the current section number (in bold in the TOC on the left). The letter A (or B, C, D...) indicates the position of the example inside this page. The name template-mwe is a more human-readable identifier for this example.

The spreadsheet

The spreadsheet can be accessed through this link to Google Docs. From time to time, the guidelines will be updated based on the contents of the spreadsheet.

The spreadsheet is divided into the following columns: ID-section, ID-order, ID-name, lang, HTML-example and Status. In order to edit an example, you need to look at its ID, and then find the appropriate place in the spreadsheet. For example, for the ID 7.2_A_template-mwe, you should look for the lines with ID-order 7.2 (towards the bottom of the spreadsheet). Then look for ID-order A on the second column. Check that the third column contains the ID-name template-mwe.

You will then see a sequence of examples, one for each language. The examples in the template above were collected from this spreadsheet. The rest of this page will teach you how to add you own examples to this spreadsheet.

When adding examples for your own language, we advise you to always start by copying an example that has already been filled in for another language, and then adapting it to your language. Remember that you should not translate an example, but rather find an example of the target phenomenon in your language, regardless if it is a direct translation or not. Therefore, before entering an example in the spreadsheet, you should always check its context using its ID. A quick way to do this is to search (Ctrl+F) the ID of an example in the full-text version of the guidelines (where the ID button is on).

If we notice something wrong or suspicious with your example, we may correct it (e.g. you forgot a closing <lex> tag). If we cannot correct the example, we will ask you to check it by using the last column of the spreadsheet, Status.

If you think that a phenomenon is not relevant for your language or that examples are not needed for a given phenomenon, just leave the corresponding cell empty.

Examples with tags

If you have not done it yet, open the spreadsheet and look for the entry 7.2_A_template-mwe. Let us analyse the English example (look for EN in the fourth column). The fifth column should read as follows:

MWEs with <lex>their lexicalized components</lex> in English are indicated like this.

As you can see, this is exactly the same text that was shown in the template above, except that the lexicalized components are surrounded by the tags <lex> and </lex>. When writing an example, you will often have to use XML tags. We describe below the most important ones.

Bold: you should surround lexicalized components with the tags <lex> and </lex>. For example, consider the code He will <lex>take</lex> a <lex>shower</lex>. This code is presented as follows:

  • He will take a shower

Red: By default, all examples are typeset using the language's color. Sometimes, examples contain counter-examples, that is, something that looks like a VMWE but that should not be annotated. The <nmwe> and </nmwe> tags can be used to represent these non-MWEs, which will be shown in red. For example, the code <nmwe>This is not an MWE</nmwe> yields the following:

  • This is not an MWE

Underlining: Some examples use underlining to focus on some of the words. This can be done with the tags <u> and </u>. For example, the code <nmwe>This is <u>not</u> an MWE</nmwe> yields the following:

  • This is not an MWE

Latin-script transcription: You can optionally provide latin-script transcription if your language does not use latin characters. Latin-script transcriptions must be surrounded by the tags <latin> and </latin>. For example, the code الدرس <latin>ad-dars</latin> generates the example below. The latin transcription should always appear after the example in the original script, and before glosses and translations.

  • الدرس ad-dars

Gloss icon: You should also provide English glosses and translation for your examples. Glosses and translations should always be provided in English, and never in another language. Glosses must be surrounded by the tags <gl> and </gl>. Translations must be surrounded by <trans> and </trans>. English examples can also use the tag <trans> to indicate the meaning of an idiomatic expression. For example, the code <lex>défendre</lex> son <lex>bifteck</lex> <gl>defend one's beefsteak</gl> <trans>to defend one's interests</trans> generates the example below. Notice that the code for gloss and translation is only shown when the user hovers the gloss icon. For consistency, you should always follow this order: original text <latin>transcription (optional)</latin> <gl>the gloss</gl> <trans>the translation</trans>.

  • défendre son bifteck defend one's beefsteak to defend one's interests

Normal: Some examples are presented followed by an explanation, in normal font (black color). This is done by using the tags <n> and </n>. For example, the code some words <n>→ further details</n> generates this:

  • some words → further details

Newline: Sometimes, one may want to add several examples for a single phenomenon in the same language. If they are rather long, they should be presented on separate lines using the tag <br/>. This tag is special as it does not come in pairs: you only write one tag with the slash at the end (technically, it is an empty XML element). For example, the code example 1 <br/> example 2 <br/> example 3 will be rendered as follows:

  • example 1
    example 2
    example 3

Inside normal text, you may also use tags such as <i> (italics), <strong> (bold), as well as other HTML tags. If another language is using a given tag for an example, you can use it too. Otherwise, try to stick to the established conventions.