Annotation guidelines
PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)

Best practices

Annotating VMWEs in text is a hard task. Many tests are semantic and require not only a strong knowledge about the language, but also knowledge of advanced notions in linguistics. As a consequence, ensuring annotation quality and, above all, intra- and inter-annotator consistency, is a challenge. We provide here a set of hints that you can use to try to optimize the annotation effort and ensure the quality of the resulting corpus.

Resources and people

This website only covers the annotation guidelines. Do not forget that many other resources are available on the PARSEME shared task 1.1 website. That website is not for system authors, but for language leaders, annotators and organizers. It contains many useful data, notably the names and contacts of people that can help you, and user manuals for FLAT, for the language leaders, etc. Also, you can use the mailing lists if you need to ask questions that could be relevant for other teams as well. In short, don't be shy to ask if you would like to do something but you're not exactly sure where to start :-)

NotVMWE label

The new FLAT configurations for edition 1.1 allow you to use an optional annotation label called NotVMWE. This is not a new VMWE category, but an auxiliary label which simply means "this is not a VMWE". NotVMWE is an optional and useful label you can use to indicate that something should not be annotated, specially if it is a borderline case. Adding this annotation allows you to add a textual comment saying why you decided not to annotate this construction (e.g. after discussing it with fellow annotators and recording the decision in the list of solved cases).

While you don't need to use this label, we recommend that you use it for challenging/hard cases which, in the end, you decide not to annotate as a VMWE. This kind of annotation will be useful when performing consistency checks. Of course, NotVMWE labels will all be removed in the final released corpora, since this kind of information is irrelevant for shared task participants.

List of solved cases

In edition 1.0, some languages have ensured consistency by keeping a separate shared document (e.g. a Google spreadsheet) where hard/challenging cases were documented. We advise language leaders to implement such a list of solved cases. This allows all annotators to contribute to the discussion of hard cases, and to reach a common decision that can be later applied systematically to all occurrences of the expression and for similar expressions. From our experience, this greatly enhances the satisfaction of annotators and saves some valuable time during the consistency checks. Even for languages that have a single annotator, she/he can keep a personal list of difficult cases and their decisions, to ensure intra-annotator consistency.

Consistency checks

Once all files have been annotated, language leaders will perform the final consistency checks using semi-automatic tools. During these consistency checks, all occurrences of a single expression annotated by all annotators will be shown together. There, language leaders may change annotations performed by individual annotators if they are incoherent with the other annotations. Therefore, do not worry too much if you are unsure about an annotation. Try to be as consistent as possible, but if you do not remember a particular annotation performed earlier, it is not necessary to search through the corpus on FLAT (this is quite time-consuming). If there is some minor inconsistency, it will probably be corrected later by the language leader. But note your decision down on the list of solved cases so that next time you come across the same expression (or a similar one) you do not spend so much time thinking about it.

Intuition and tradition vs. guidelines

You may sometimes (often) find that the guidelines do not reflect your intuition about a given construction, or that they contradict the linguistic tradition and literature in your language. We understand that this is frustrating, but please, remember that our main objective is achieving universal modelling of MWEs while preserving diversity. Therefore, please refrain from using undocumented criteria (a.k.a. intuition), or tests that are only known/documented in your language.

The guidelines were designed taking feedback from many language teams into account. They are also meant to continuously evolve, and we do count on you to play an active role in this process. Therefore, if you disagree with their current version, please, choose one of the two options:

  • Follow the guidelines anyway to ensure the corpus-to-guidelines consistency, but express your criticism (documented with glossed and translated examples in your language), best via Gitlab issues. You may also add comments to those annotations which you would like to modify once the guidelines have been enhanced.
  • Create a language-specific section for the guidelines, describing your own tests and decision trees. We will be happy to publish it online.

Inter-annotator agreement

Usually, data annotation campaigns require measuring inter-annotator agreement (e.g. kappa) to verify that the guidelines are clear and that the annotators are well trained. We encourage language teams to measure inter-annotator agreement. However, in the PARSEME shared task, the organizers do not set any hard threshold on the kappa value required to accept your annotations as part of the shared task. This is a collaborative effort, so we do not feel comfortable with making such requirements to language teams.

Furthermore, VMWE annotation is a very hard task so inter-annotator agreement is expected to be low. We recommend that language teams use complementary tools and resources to compensate for the low agreement, such as the list of solved cases and consistency checks mentioned on this page. After the annotation is completed, we may ask you to double-annotate a sample of your data so that we can calculate inter-annotator agreement, for instance, to report it on a corpus description article. But you should not worry too much about this: do your best in trying to understand the guidelines, do not hesitate to suggest improvements, and try to train annotators as much as possible, for instance, with pilot annotations and discussions. This way, you will ensure that the data released in the shared task for your language will be of high quality. And remember you will have the opportunity to improve it incrementally for the next shared task.

TODO label

We have introduced a new label on FLAT called "{change-me} TODO". This label is a temporary mark-up used to indicated that a given VMWE must be dealt with by a human annotator. It will be used when a corpus is automatically converted and some annotations must be manually checked. For instance, the OTH category from shared task 1.0 disappeared in edition 1.1. Therefore, all VMWEs annotated as OTH in the 1.0 corpora will be automatically converted using the TODO label. This means that all TODO labels must be changed into a valid new category (e.g. VID). In the final annotated corpora, any remaining TODO label will be removed, since this is not actually a VMWE category but just an auxiliary label.

Existence questions and corpus queries

Some tests ask if is possible/impossible to find some attested variant of a candidate. While for many cases this is straightforward (the variant can be easily found), some borderline cases will inevitably occur in which it is hard to tell if a given variant is impossible or just very rare.

Decisions for hard cases like this should not be made based solely on introspection and intuition. In case of doubts, we recommend that annotators:

  1. check existing lexicons for their languages
  2. perform corpus queries using any available large raw monolingual corpus
  3. run web queries, e.g. using Sketch Engine, Linguee or plain Google
  4. discuss the case with other annotators, reach a decision and mark it in the list of solved cases

In all cases, the list of lexicons, monolingual corpora and/or web platforms to consult should be agreed upon in advance by all annotators.