The LLaMaPUn library is a RUST library that provides a wide range of processing tools for natural language and mathematics. It can be used to investigate the structure and meaning of scientific/technical documents and to build tools for extracting semantic representations from them that can be used to enhance access to and interaction with document corpora.
In particular, the LLaMaPUn library is used on the arXMLiv data set, which is a translation of the arxiv corpus to “HTML5 with MathML”.
Some of the library’s features are:
- Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g.
math
nodes, …) - Word/Sentence tokenization
- Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, …)
- Mapping between plaintext offsets and HTML nodes (using the DNM data structure)
For a more complete overview, take a look at the README file.