Software

Source code provided where possible

PDF Document Processing

  •  [Old ] SymbolScraper (by Ritvik Joshi, Alex Keller, Jessica Diehl, Parag Mali, Puneeth Kukkadapu, and Mahshad Mahdavi, 2019/2020). An extension of Apache PDFBox that reports character and symbol codes along with their precise bounding box locations in PDF files.  
  • SymbolScraper Server (cleaner and faster version, created by Matt Langsenkamp). The new version includes a substantially improved implementation, and a docker container for fast asynchronous processing.

Formula Extraction and Recognition

  • MathSeer Formula Extraction Pipeline - locates and extracts formulas in PDF documents. Available online (open source) through GitLab. First released in September 2021, updates ongoing. Includes updated SymbolScraper, ScanSSD-XYc and QD-GGA models, along with a command-line interface for combining their outputs. Created by Ayush K. Shah, Abhisek Dey, Matt Langsenkamp, and Richard Zanibbi.
  • YOLO detector (JP Ramissini, 2022). JP Ramissini created a fast YOLO-based detector for math formulas and chemical diagrams in Spring 2022. The source code for the project can be found here: GitLab project. This project is also included within the extraction pipeline.
  • [ Old ] Formula Extraction - Scanning Single Shot Detector (ScanSSD) for formula detection in document images (by Parag Mali, Puneeth Kukkadpu, and Mahshad Mahdavi). A PyTorch based system for recognizing formulas in scanned or rendered PDF documents. The dataset used to train the system is available from the CROHME + TFD 2019 competition web pages.
  • [Old ] Formula Recognition - Line-of-Sight Parsing with Graph-Based Attention (LPGA) for Formula Recognition (by Mahshad Mahdavi, Michael Condon, and Kenny Davila).  Python-based system with modules for recognizing formulas in handwritten strokes or images.  The version of infty used in our ICDAR 2019 paper can also be found online here: InftyMCCDB-2.

Formula Search Engines

  • MathDeck (Nishizawa, Diaz, Dmello, Liu, Langsenkamp 2020 -- 2022). Online search interface for math-aware search engines with multi-modal formula input, and a useful 'chip and card' model for storing, manipulating, sharing, and reusing formulas and associated information about them. (ECIR 2020 video demo -- CHI 2021 video demo) - source will be released in late summer 2022.
  • PHOC formula retrieval models (Robin Avenoso and Matt Langsenkamp, 2022): a new approach that represents formulas by the locations of symbols, and are implemented using a variation of standard inverted indices for text. Pertinent python libraries include anyphoc (PHOC vector generation), phocindexing (indexing and retrieval using OpenSearch), and arqmath-compare (for comparing formula retrieval results from the ARQMath labs). This approach was originally devised by Robin Avenoso, and extensions and the python libraries were created by Matt Langsenkamp.
  • MathFIRE (Behrooz Mansouri, 2022) Math Formula Indexing and Retrieval using Elastic Search. A framework for retrieving using Tangent-CFT vectors (see below), using OpenSearch (closely related to Elastic Search).
  • Tangent-s (Revised Feb 2021 by M. Langsenkamp and B. Mansouri; original system by Kenny Davila, Richard Zanibbi, Andrew Kane, and Frank Wm. Tompa, 2017). Search engine using a combination of operator trees and layout trees representing formula appearance. Both formula appearance and semantics are queried using pairs of symbols and their relative paths in each type of tree, and formulas appearance and semantics search results are then combined before returning final results.
  • Tangent-CFT (Behrooz Mansouri, 2019).  An embedding-based formula retrieval model that incorporates both formula appearance and semantics. Retrieval is performed using the cosine similarity of embedding vectors.
  • Approach0 (Wei Zhong, 2019). Search engine that uses operator trees representing the mathematical operations in a formula for retrieval. Retrieval is done using paths of varying lengths from the leaves (operands) to the root of an operator tree.
  • Tangent-v (ECIR 2019 version) (Kenny Davila, 2019). A version of Tangent-v created for visual formula search in .png (raster) and .pdf (vector) formats. Search results from Kenny and Ritvik's ECIR 2019 paper are included in the package.
  • Tangent-v (Kenny Davila, 2018). A visual formula (and more generally, graphics) search engine. This system searches for formulas based on the appearance of their constitutent symbols and relative spatial positions in line-of-sight graphs. Tangent-v has been successfully applied to formulas in raster images, vector images (e.g., PDF), and to search formulas in lecture videos using rendered LaTeX formula queries.

MathSeer is supported by the Alfred P. Sloan Foundation and the National Science Foundation