I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…
We have been making some progress on the conversion of the arXiv collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%. Here is the story.
First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.
The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).
Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.
And now to the percetages: I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be successfully by LaTeX, since arXiv does not have a viable package management system.
Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).
My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.