Resources

**Update, June 2024. A backup for the ARQMath data, runs, evaluation scripts, etc. stored in Google Drive (for the 'Data & Tools' button above) is available here.

Participating in ARQMath

Registration. To participate in ARQMath, please register online through the CLEF 2022 conference.

Information and Community. The primary means that the organizers use to communicate with the ARQMath community is through this web page, and the ARQMath Forum. We also have a Twitter page that we send announcements over. Please make sure to request membership in the ARQMath forum if you plan to participate. There will be important information provided in the forum during the lab.

Submission (closes May 6). Browse through the ARQMath collection to find data, tools, topic data (i.e., training data), and runs from previously participating systems. Make sure to review the most recent ARQMath-Guidelines document in the top-level directory of the collection, which explains the required formats for system and manual runs in each task.

NOTE -- Please cite the ARQMath-3 Overview paper if you use ARQMath data or tools in your work:
Mansouri, B., Zanibbi, R., Oard, D.W., and Agarwal, A. Overview of ARQMath-3 (2022): Third CLEF Lab on Answer Retrieval for Questions on Math (Working Notes Version). CLEF 2022 Working Notes, pp. 1-27.

Baseline Systems. Baseline systems for Tasks 1 and 2 are available through the ARQMath Google Drive. Note that Task 1 baselines can be converted to a Task 3 baseline simply by returning only the first hit obtained by the Task 1 system.
New: for Task 1/3, a text-based baseline system built on top of PyTerrier is now available (pt-arqmath).

Evaluation Tools, Data, and Previous Runs. Tools for evaluation are available, along with qrel files (i.e., relevance scores), a detailed record of assessment data that produced the qrels, and system runs from previous ARQMath participants. See the Evaluation directory on the ARQMath Google Drive for more information.

Formula Index Files. Formula index files are provided in three encodings -- LaTeX (from MSE posts) and Presentation MathML represent formula appearance, while formula syntax is represented using Content MathML. Both the appearance and math syntax encodings are trees: Symbol Layout Trees (SLTs) for appearance, and Operator Trees (OPTs) for formula operation syntax. Formulas are grouped by appearance into 'visually distinct' groups prior to assessment. For ARQMath 2021 and 2022, we have pre-computed and enumerated these groups, and provided the unique 'visual group' identifier in the formula index files.

Visually distinct formula groups are computed using their Tangent-S Symbol Layout Tree (SLT) representations, falling back to LaTeX strings where SLT construction fails. See the previous task overview papers for more details. Our thanks to Frank Tompa for suggesting including formula appearance groups in the provided index files.

Collection and Topics. The test collection is built from Mathematics Stack Exchange, an online Question Answering (QA) site. There were approximately 1.1 million questions on the forum when the main collection was created. Please see the Guidelines document and README files in the collection for additional details. After ARQMath03, over 200 annotated topics are available for each of Task 1 and Task 2, including qrel files for use in evaluation. Previously submitted runs from participants are available in the Runs directory. Please consult the README files in the collection for additional details.

Question Threads (HTML). Within the collection, tools are provided for generating readable question threads from the raw collection snapshot (in Python), along with the question thread HTML pages that the tool produces (provided with collection and topic data mentioned above). Threads are intended for use in studying the collection, sanity checking results during development, and are also used for the relevance assessment (by human assessors) after the submission deadline has passed. Formulas that are indexed for ARQMath are placed in span tags with the class 'math-container' that include the integer identifier for the formula in the ARQMath formula index, e.g.,

<span class="math-container" id="844">...</span>

The ARQMath Collection and Associated Code

The ARQMath collection itself is provided as a set of XML files from the Internet Archive, along with HTML thread files and formula index files. In addition, two tools were created in order to generate viewable question threads, and to generate formula index files.

Formula index - LaTeX formulas are assigned identifiers, and a separate TSV file with a formula index is produced. The second part of this process converts formulas from LaTeX to Presentation MathML and Content MathML, and finally new TSV formula index files are created for the MathML representations.
HTML question thread files - Viewable question threads generated from the XML post data. These are intended for use by participants for study/checking, and are also used during assessment.

The following two GitHub repositories are used to (1) ARQMathCode: generate the LaTeX formula index and HTML thread files, and provides scripts for creating the MathML TSV index files using LaTeXML (v. 0.8.5); and (2) MIR-MU ARQMath data processing: converts LaTeX formulas to Presentation MathML and Content MathML.

The GitHub repository for these two software tools are available here:

Resources

Participating in ARQMath

Data and Tools

The ARQMath Collection and Associated Code