Training datasets and evaluation tools are now ready to download. Please do not distribute this unofficial data set. An official version will soon be available on the CROHME website.

CROHME 2019 Submission Instructions

CROHME 2019 results will be submitted directly online by participants. Submitted results may be updated for each task, and multiple times if necessary. NOTE: Each competition task has a 'test' version used for computing competition results, along with a 'validation' version for error and sanity-checking.
To read more on instructions click here: Instructions
To submit your results: Online Submission Tool


Handwritten Formulas

Handwritten formula data will be given in two formats (online and offline). Online data will be in the same InkML and Label Graph (.lg) format used during the last CROHME. We describe online data below; the offline data will be provided as greyscale images. These images will be rendered automatically from the online data. Please note the default values are 1000x1000 pixels for equations (MainTask and SubTask-b) with 5 pixels padding and 28x28 pixels resolution for isolated symbols (SubTask-a) with the same padding (5 pixels).

*** The resolution for offline dataset is fixed, but participants can always resize the original images as a preprocessing step of their system.***

Handwritten Training Data. Stroke data is defined by lists of (x,y) coordinates, representing sampled points as each stroke is written. Groupings of stroke into symbols (segments) along with the true class of each symbol is provided in two formats: InkML and Label Graphs (.lg). Both provide a representation of expression symbols and structure. The InkML format represents formula structure in Presentation MathML (an XML tag-based representation), where label graphs files use a simpler CSV-based representation. PLease note both MathMl and LG files can be used as the ground truth, but the LaTex representations are not the official ground truth as they are not normalized.

Handwritten Test Data. Contains stroke data, but no segments (stroke groups), symbol classes, or recognized structure (represented in MathML). For a given test expression, participating systems must recognize symbol locations and identities, and then produce a file representing the expression structure. Participants may use the CROHME InkML/MathML format as given in the training data, or a simpler label graph (CSV) format representing segmentation, layout and classification for strokes in the input (this is an adjacency matrix defined by labels over strokes). Tools to convert one format to the other are publicly available, and participants will be provided links to these (see dprl).

We expanded the Training set for CROHME 2019 by adding previous test sets (2013, 2012) to it. These three are going to be used as Train set for this competition (previous train + test 2012 +test 2013). The validation set for 2019 is the previous test set in 2014. Finally, we provide a new test set for the main task. Please note that the 2016 test set will be used as a reference to compare the systems with the last competition results and should not be used in training or validation of the current systems . The test sets for SubTask-a and SubTask-b is made from 2016 test set. You can download CROHME2019 package below.

To download crohme 2019 dataset package for Task1 and Task 2 click here: Download CROHME2019 Recognition Dataset

Document Images

For math detection in document images, we will utilize the recently released GTDB-datasets, which consist of document page images collected from scientific journals and textbooks. The dataset is shared under a Creative Commons by-nc-nd licenese, which permits copying and redistributing the material in any medium or format, for non-commercial purposes provided that no derivative products are released (none will be for the competition).

These datasets were produced by the group that created the well-known infty system and datasets (i.e., Masakazu Suzuki et al.). The GTDB-1 dataset contains 31 English articles on mathematics in PDF files. The GTDB-2 dataset contains 16 articles. We will provide tools to automatically generate page images of a fixed size.

Diverse font faces and mathematical notation styles are used in these articles. Ground truth annotations provided in CSV files. They include character locations, labels, and the structure of formulas in the documents. For this task, we consider only formula locations; the ground truth permits comparison of detected regions at the raw image or character levels.

To download crohme 2019 dataset package for Task3 click here: Download CROHME2019 Detection Dataset

Evaluation Tools

CROHME Organizers provides tools for math expression selection, running the test phase, evaluation and visualization :

  • InkML viewer
  • lgEval
  • IOUeval

To download tools click here:
Download Evaluation Tools
Download symbolic LG converters
Download IOU evaluation tool