Available for Download

PDF Document Processing

  • SymbolScraper (by Ritvik Joshi, Parag Mali, Puneeth Kukkadapu, and Mahshad Mahdavi, 2019). An extension of Apache PDFBox that reports character and symbol codes along with their precise bounding box locations in PDF files.  

Formula Extraction and Recognition

  • Formula Extraction - Scanning Single Shot Detector (ScanSSD) for formula detection in document images (by Parag Mali, Puneeth Kukkadpu, and Mahshad Mahdavi). A PyTorch based system for recognizing formulas in scanned or rendered PDF documents. The dataset used to train the system is available from the CROHME + TFD 2019 competition web pages.
  • Formula Recognition - Line-of-Sight Parging with Graph-Based Attention (LPGA) for Formula Recognition (by Mahshad Mahdavi, Michael Condon, and Kenny Davila).  Python-based system with modules for recognizing formulas in handwritten strokes or images.  The version of infty used in our ICDAR 2019 paper can also be found online here: InftyMCCDB-2.

Formula Search Engines

  • Tangent-CFT (Behrooz Mansouri, 2019).  An embedding-based formula retrieval model that incorporates both formula appearance and semantics. Retrieval is performed using the cosine similarity of embedding vectors.
  • Approach0 (Wei Zhong, 2019). Search engine that uses operator trees representing the mathematical operations in a formula for retrieval. Retrieval is done using paths of varying lengths from the leaves (operands) to the root of an operator tree.
  • Tangent-v (ECIR 2019 version) (Kenny Davila, 2019). A version of Tangent-v created for visual formula search in .png (raster) and .pdf (vector) formats. Search results from Kenny and Ritvik's ECIR 2019 paper are included in the package.
  • Tangent-v (Kenny Davila, 2018). A visual formula (and more generally, graphics) search engine. This system searches for formulas based on the appearance of their constitutent symbols and relative spatial positions in line-of-sight graphs. Tangent-v has been successfully applied to formulas in raster images, vector images (e.g., PDF), and to search formulas in lecture videos using rendered LaTeX formula queries.
  • Tangent-s (Kenny Davila, Richard Zanibbi, Andrew Kane, and Frank Wm. Tompa, 2017). Search engine using a  combination of operator trees and layout trees representing formula appearance. Both formula appearance and semantics are queried using pairs of symbols and their relative paths in each type of tree, and formulas appearance and semantics search results are then combined before returning final results.

In Development

Search Interfaces and Search Engines

  • MathSeer Formula Editor and Search Interface (demo videos available; by Gavin Nishizawa, Yancarlos Diaz, Jennifer Liu, and Wei Zong). The front-end for the MathSeer system, with support for math input using handwriting, formula images, and LaTeX. MathSeer provides an innovative user interface supporting easy saving and re-use of formulas and parts of formulas. (First release planned for late 2019/early 2020).

MathSeer is supported by the Alfred P. Sloan Foundation and the National Science Foundation