Paper on an Improved Tangent Formula Search Engine to Appear at SIGIR 2016
Computer Science PhD student Kenny Davila, Dr. Richard Zanibbi, and Drs. Frank Tompa and Andrew Kane from the Univeristy of Waterloo (Canada) will have their paper entitled "Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale" appear in the proceedings of SIGIR 2016, the leading international conference on information retrieval. They will present their paper at the conference in Pisa, Italy this coming July.
The paper describes improvements to the Tangent formula search engine created by the Document and Pattern Recognition Lab. David Stalnaker created Tangent for his MSc thesis in 2013, building upon ideas from Thomas Schellenberg's earlier MSc thesis. Nidhin Pattaniyil added support for matrices and keywords in 2014 for his MSc project, producing top results for the international NTCIR-11 math retrieval task ( EE Times Story, ACM TechNews Story).
Drs. Tompa and Kane began collaborating with Kenny Davila and Dr. Zanibbi in Oct. 2014, when Dr. Zanibbi visited the University of Waterloo for a month during his sabbatical. Dr. Tompa is a Distinguished Professor Emeritus at U.Waterloo, a Fellow of the ACM, and an influential researcher in the areas of databases and information retrieval. Dr. Kane is a postdoctoral researcher at the University of Waterloo specializing in high-performance retrieval systems.
The SIGIR paper describes an improved retrieval model based on pairs of symbols in formulae, a new two-pass retrieval architecture, and a new formula similarity metric, the Maximum Subtree Similarity (MSS). Together, these modifications produce state-of-the-art retrieval results, and greatly improve Tangent's handling of queries containing variables and wildcards (e.g. to identify variants of the Pythagorean theorem using different variable names, or find all exponents with base 'e'). The paper also describes substantial reductions in storage and run-time requirements, allowing the system to be used in real-time for large collections such as Wikipedia.
Image Caption: Example search results from the Tangent search engine. The results shown were produced using the NTCIR-11 Wikipedia collection containing over 380,000 formulae.