# ARQMath 2020: Task 2 Results Errata (v2)

## B. Mansouri and R. Zanibbi, Apr. 30, 2021

In the first Errata reported for ARQMath, Table 2 providing results
for the formula retrieval task was updated to reflect that there was a
normalization error in the nDCG' measure. This increased nDCG' scores,
but did not affect the relative order of any pairwise system
comparisons between submitted runs.

We have found and corrected additional errors in Table 2, where the
MAP' and P@10 scores also required modification. Again, **system
comparison ordering among submitted systems is unaffected,** but the
true MAP' scores are higher, and there are small changes in results
for the P@10 results, largely due to the behavior of trec_eval when
ordering hits with equal scores.

Accompanying this errata is a spreadsheet that illustrates how the
change in using formula vs. visual distinct formula ids as defined for
ARQMath 2020 changes values of specific evaluation measures.

Here is a more detailed explanation for the differences in computed
evaluation measures, with reference to illustrations in the
spreadsheet (from B. Mansouri):

> In both Errata 1 and 2, the miscalculation was caused by using
  formula ids in the qrels judgement file, rather than the visually
  distinct formula ids. This can affect the calculation of both nDCG'
  and MAP' values, as both depend upon the number of relevant items in
  the qrels file.
 
> For instance, as shown in the spreadsheet, for query 'B.60', there
  are 394 formula instances with scores of high (3) or medium (2), but
  this number drops to 39 when identifiers for visually distinct
  formulae are used instead of formula (instance) ids. That is why for
  systems such as Tangent-S the AP' value for this query increases
  from 0.02 to 0.20, and the nDCG' value changes from 0.07 to 0.38.
 
> In the updated Table 2 of the ARQMath 2020 overview paper, instead
  of formula ids, visual ids (i.e., the ids for visually distinct
  formulas) are used to compare submitted runs with relevance
  judgments (ground truth) in the qrels file. It should be noted that
  there are systems such as Tangent-S that return the same score for
  different formula instances; the trec_eval tool reorders hits with
  identical scores in **reverse lexicographic order of formula (what
  trec\_eval calls 'document') id** (see Ian Soboroff's comments here:
  [https://github.com/usnistgov/trec_eval/issues/22](https://github.com/usnistgov/trec_eval/issues/22)).
 
> Therefore, when formula ids are replaced by visual ids, P@10 values
  may also change. For example, for topic B.60, Tangent-S returned
  formulas from rank 2 to rank 25 with a score of 2.738. The P@10
  values for this topic changes from 0.6 to 0.8, when switching from
  formula ids to instead corrctly use visual ids.

The spreadsheet shows (from first to last sheet:)
 
 1. Evaluation measures for the Tangent-S system in Task 2 before and after correction.
 
 2. The Tangent-S results for query B.60 using formula ids; both as given, then after the sorting by trec\_eval.
 
 3. The same Tangent-S results for B.60 using visually distinct formula ids (ARQMath 2020 values), then after sorting by trec\_eval.
 
 4. The ground truth (judge) qrels files using formula ids. The file has 515 entries, 394 with High (3) or Medium (2) relevance. For MAP' and P@10, lower relevance ratings are considered non-relevant.
 
 5. The ground truth (judge) qrels file using visually distinct formula ids. The file has 141 entries, 39 with High (3) or Medium (2) relevance. Again, For MAP' and P@10, lower relevance ratings are considered non-relevant.

