Source code provided where possible

PDF Document Processing

  • SymbolScraper (by Ritvik Joshi, Alex Keller, Jessica Diehl, Parag Mali, Puneeth Kukkadapu, and Mahshad Mahdavi, 2019/2020). An extension of Apache PDFBox that reports character and symbol codes along with their precise bounding box locations in PDF files.  

Formula Extraction and Recognition

  • Formula Extraction - Scanning Single Shot Detector (ScanSSD) for formula detection in document images (by Parag Mali, Puneeth Kukkadpu, and Mahshad Mahdavi). A PyTorch based system for recognizing formulas in scanned or rendered PDF documents. The dataset used to train the system is available from the CROHME + TFD 2019 competition web pages.
  • Formula Recognition - Line-of-Sight Parging with Graph-Based Attention (LPGA) for Formula Recognition (by Mahshad Mahdavi, Michael Condon, and Kenny Davila).  Python-based system with modules for recognizing formulas in handwritten strokes or images.  The version of infty used in our ICDAR 2019 paper can also be found online here: InftyMCCDB-2.

Formula Search Engines

  • MathDeck (Nishizawa, Diaz, Dmello, Liu, 2020). A new online search interface for math-aware search engines with multi-modal formula input, and a useful 'chip and card' model for storing, manipulating, sharing, and reusing formulas and associated information about them. (ECIR 2020 video demo) - source code release is planned for Summer 2020.
  • Tangent-CFT (Behrooz Mansouri, 2019).  An embedding-based formula retrieval model that incorporates both formula appearance and semantics. Retrieval is performed using the cosine similarity of embedding vectors.
  • Approach0 (Wei Zhong, 2019). Search engine that uses operator trees representing the mathematical operations in a formula for retrieval. Retrieval is done using paths of varying lengths from the leaves (operands) to the root of an operator tree.
  • Tangent-v (ECIR 2019 version) (Kenny Davila, 2019). A version of Tangent-v created for visual formula search in .png (raster) and .pdf (vector) formats. Search results from Kenny and Ritvik's ECIR 2019 paper are included in the package.
  • Tangent-v (Kenny Davila, 2018). A visual formula (and more generally, graphics) search engine. This system searches for formulas based on the appearance of their constitutent symbols and relative spatial positions in line-of-sight graphs. Tangent-v has been successfully applied to formulas in raster images, vector images (e.g., PDF), and to search formulas in lecture videos using rendered LaTeX formula queries.
  • Tangent-s (Revised Feb 2021 by M. Langsenkamp and B. Masouri; original system by Kenny Davila, Richard Zanibbi, Andrew Kane, and Frank Wm. Tompa, 2017). Search engine using a  combination of operator trees and layout trees representing formula appearance. Both formula appearance and semantics are queried using pairs of symbols and their relative paths in each type of tree, and formulas appearance and semantics search results are then combined before returning final results.

MathSeer is supported by the Alfred P. Sloan Foundation and the National Science Foundation