Work Package 1: MWE representation and annotation

  • Partners in charge: LLF (Marie Candito) and ATILF (Mathieu Constant)
  • Partners involved: LLF, LI, LIF, LIFO, ATILF
  • Objectives: Select the set of criteria to be used in the project for MWE identification, classification, properties. Produce a gold standard corpus.
  • Final products:
    • FP.1.1: A state-of-the-art report on MWE representation,
    • FP.1.2: Guidelines indicating the criteria to identify and classify MWEs, as well as the list of properties to be encoded in the lexicon and an annotation scheme
    • FP.1.3: A gold standard corpus manually annotated by experts, including deep MWE annotation, together with the annotation guidelines
  • Subtasks:
    • WP 1.1: State-of-the art on MWE in language resources
    • WP 1.2: Setup of formal criteria for MWE identification and classification
    • WP 1.3: A gold standard


In the framework of the PARSEME Shared Task on identification of verbal MWEs, Agata Savary, Carlos Ramisch and Marie Candito participated in the writing of the annotation guidelines (Savary et al. MWE 2017). Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Yannick Parmentier, Caroline Pasquer and Jean-Yves Antoine produced the French dataset (Candito et al. TALN 2017). This dataset, composed of the Sequoia corpus and the French UD treebank (about 19,000 sentences), includes 5,000 annotated verbal MWEs,

Work in progress

The annotation of the Sequoia corpus is now being extended to all MWEs, using annotation guidelines under construction. The release of the data is planned for the end of 2017.

