CoMMa: Corpus Meta-Mathematics
Representing and Extracting the Meaning of Mathematical/Technical Documents

From: 2006

Funding: internal

Prof. Dr. Michael Kohlhase
M.Sc. Deyan Ginev
Jan Frederik Schaefer


Many of the foundations of the wealth of western societies are laid down in mathematical/technical documents. Such are rich in structure and communicate complex meaning effectively and efficiently, but we do not understand the underlying knowledge structures and linguistic characteristics enough to extract the underlying meanings and represent them explicitly enough so that working with them can be supported by machines.

This leads to the one brain barrier in Science, Technology, Engineering, and Mathematics (STEM): instead of being machine-processable, mathematical/technical knowledge must always pass through a human brain for innovation, application, and education. In the CoMMa project, we work towards remedying this problem for Mathematics (which can serve as a test tube for STEM documents) by applying methods from


i.e. employing languages that allow to talk about the concepts, objects, and models of mathematics. We employ the OMDoc (Open Mathematical Documents) format for representing document and knowledge structures at all levels and with flexible formality, the MMT system for formalizing knowledge in a foundation-independent manner (details and discussion).


i.e. extracting meaning-carrying structures from documents at the scale of large, representative corpora. We prepare and manage large mathematical corpora in the arXMLiv effort and distribute math linguistic data sets via the SIGMathLing initiative. We contribute to the infrastructure for math linguistics with the LLaMaPUn library and the KAT annotation tool.