Archive for the ‘mathml’ Category

TEI Guidelines mention MathML, OpenMath, and OMDoc

Saturday, July 31st, 2010

Someone in the humanities must be interested in OMDoc. I was really surprised to find a reference to OMDoc in the section “Formulæ and Mathematical Expressions” guidelines (a.k.a. specification) for TEI. TEI (Text Encoding Initiative) is the standard semantic markup language for humanities, social sciences and linguistics, much like DocBook for technical manuals. All that TEI itself has is an element <formula notation=”…”/>, where notation refers to the language in which the formula is represented. But the guidelines refer to some mathematical markup languages, from which the document author is asked to “make an informed choice”:

  • TeX – the obvious candidate, also used in some examples
  • MathML – the obvious candidate when XML is desired.  They give one Presentation MathML example but also mention Content MathML.
  • OpenMath – much less expected. Nice to see that here. Oh the other hand, the links to the OpenMath standard are outdated. I should probably report that.
  • OMDoc – I didn’t expect that at all.

egomath search engine talk at DML Workshop

Sunday, July 27th, 2008

I am sitting in the DML (Digital Mathematical Libraries) workshop in Birmingham listening on Jozef Misutka’s talk on his search engine.

It is surprising how many math search engines are out there; this project has been started in 2004, and I had not really know about it. Jozef also uses an existing search engine for indexing, but he does a syntactic analysis before he indexes, at least for formulae. This is the central part of his talk. He tries to deduce the correct meaning from the input, which seems to be PDF.


  1. Normalization (heuristical),
  2. linearization (since his search engine woks on strings/words)
  3. partial evaluation (e.g. with distributivity)
  4. generalization (introduction of variables in the index)
  5. ordering (for commutative operators)

This seems to be an attempt to get semantics into the search, i.e. E-retrieval, but do we have a clear information about the Equivalence relation E.

I think that the most important contribution here is probably the analysis phase here, since he extracts formulae from something as poor as PDF.


Integrating Presentation into OMBase

Monday, March 24th, 2008

I have just been reading up on REST again, since I found a very palatable pair of articles (REST intro, and  practices). This got me thinking about the state of OMBase, and the integration of our presentation pipeline into the OMBase interface. It is RESTful, since we have MMT addressing via URIs implemented. You just use a GET to retrieve them.

What I have talked with Florian about, but maybe not with the OMBase team, is how to integrate presentation. That should be very simple from the interface point of view: we just take the same URLs, but a different HTTP header.

GET /arith1/lcm
Accept: application/omdoc+xml

gives you the OMDoc file and

GET /arith1/lcm
Accept: application/xhtml+xml

gives you the presented version (with the standard options). Now, we have written a paper about presentation and submitted it to MKM and Christine has spent a lot of ingenuity on defining user options to the presentation process.This should be easy to integrate with the URI query interface:

GET /arith1/lcm?ext=foo.ntn∫=lang:ntn;style:physics
Accept: application/xhtml+xml

That should do the trick.

MathML Support in Firefox 3 beta 4.

Sunday, March 16th, 2008

I have just installed the new firefox 3 beta 4. I have been using betas of Firefox3 for a while, and have been enthusiastic about the release, but always had to keep a Firefox 2 copy around for viewing MathML.  Without having tested this extensively, I would say that in beta 4, the level of MathML support is up to the level of Firefox 2, which allows me to make the transition to Firefox 3 fully. For some reason I do not fully understand, it seems that the font problems I was having with FF2 have also gone away.

To test FF3, I have looked at our MathML version of the Cornell eprint arXiv and the results are really impressive.

Success Rates in the arXMLiv project

Sunday, December 9th, 2007

I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…

We have been making some progress on the conversion of the arXiv  collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%.  Here is the story.

First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.

The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).

Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.


And now to the percetages:  I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be  successfully  by LaTeX, since arXiv does not have a viable package management system.

Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).

My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.

A radical new referencing scheme for Openmath and MathML (and OMDoc)

Monday, July 2nd, 2007

We are thinking about how to reference theory-constitutive elements in content dictionaries. We had distinguished “reference by location” (via usual URIs) and “reference by context” (via the OMDoc theories and their constitutive elements) in OMDoc 1.1. It was very hard to explain the latter, and the encoding was a little weird, so I dropped it again from OMDoc 1.2. But the concept is valid and important, so here we go again.

This topic is important, since we are thinking about OpenMath3 and we are adding CDs in MathML3. And I guess that there will be quite a while until we can change these two again, so we better get it right. Moreover, the referencing scheme better be compatible with those two.

Here is the idea: we have nested theories in OMDoc1.2, and we need to reference symbols from them. Now, symbols are referenced by their name, which need not be document-unique (and we do not want to do that, since we want to compose theories in documents. That is why their names have three components: the theory name (cd name in OM; which is document-unique at least in OMDoc), and a symbol name, which is theory/cd-unique. And to disambiguate we have URIs for the cds in the cdbase attribute.

We would like to generalize that in OMDoc1.8, theory names should only be unique in their context (which might be the document context or a theory). So far so good, but then we need a path-like referencing scheme at least for the cd names. So we can really combine them in one path/URI as described in a post on MathML/OM referencing.

The next step in OMDoc would be to allow any content element to be theory-like, and allow it to import. Here is a somewhat extreme example of what we would be able to do.

<!-- all statements are theories, so this is also one -->
<symbol name="nat"/>

<!– this symbol declaration imports from theory “nat” –>
<symbol name=”zero”>
<type><csymbol pref=”nat/nat”/></type>

<!– this one also needs a function type, so we import it –>
<symbol name=”suc”>
<csymbol pref=”simple-types/funtype”/>
<csymbol pref=”nat”/>
<csymbol pref=”nat”/>

<!– the third Peano Axiom (1&2 are about types) is only about suc –>
<axiom name=”peano3″>
<imports from=”suc”/>
<imports from=”quant1″/>
<csymbol pref=”quant1/forall”/>
<apply><csymbol pref=”suc/suc”/><ci>a</ci></apply>
<apply><csymbol pref=”suc/suc”/><ci>b</ci></apply>

Referencing symbols in OpenMath and MathML

Monday, July 2nd, 2007

We are currntly working at an aligned OpenMath/cMathML model for mathematical objects, based on the model for OpenMath objects. This will go into the MathmL3 and OpenMath3 specifications due in spring. Afterwards we will not be able to change much for a long time I expect, so we better get this one right.

There has been some discussion abouth the OpenMath referencing triplet: a symbol (OMS in OM) has three attributes a name, a cd, and a cdbase, e.g. the symbol for addition might be
<OMS cdbase="" cd="arith1" name="plus"/>

The cdbase and cd attributes determines a content dictionary (in this case the file and the name attribute a symbol declaration in it (the name of that must be cd-unique).

In MathML3 we want to follow the same general model, but have the definitionURL attribute for specifying meaning. Here we would use the URL currently. There was some discussion whether we should have one big CD for MathML or many small ones, … Sam Dooley remarked that if we were to use the OM triplet, then he would like to treat the cd attribute like a cdbase now, which inherits…, then we could write <apply cd="mathml">...<csymbol name="plus"/> ...</apply> (especially if we had one big CD for all MathML, then we could make the cd=”mathml” a default on the <math> element…). Frankly I find this quite attractive (after having thought about it).

I would like to take this idea a little further in MathML3: like MathML2 we use a single URI-type attribute for symbol referencing, let’s call it pref (path ref; just to distinguish it from definitionURL for this post, it could in the end becomd definitionURL to keep backwards compatibility; after all MathML does not say what kind of URLs definitionURL should be; convenient).

So we use pref attributes on csymbols, and take xml:base into the picture we can write

<csymbol pref=""/>

<math xml:base="">.... <csymbol pref="arith1/plus"/>...</math>

and even

<math xml:base="">
<apply xml:base="arith1">
<csymbol pref="plus"/>...</math>

This would make a very simple framework. All the URIs can be used for REST-ful access to the relevant features (symbol declarations in the CDs), and relative URIs work as expected. And if we write content dictionaries in a somewhat atomic way, then we can even supply them on a static web server. It would be quite simple to configure apache that it really generates the right files, for instance, in the directory …/arith1 we could have ocd.php with the CD skeleton and inclusions for the symbol declarations which are represented as files in the directory. e.g. arith1/plus.

That would make it quite simple to set up a structure that would make the cds meaningful.