CodeML competitors (or hopefuls)

March 24th, 2008 by kohlhase

I have just stumbled upon another justification (as in people having problems with the currenct state of the art) of our CodeML project: integrating code with syntax highlighting into presentations (and web pages, …), i.e. into situations, where we do not have a suitable parser at hand, but still want to change the appearance of the code, and have access to the semantics and structure.

Submitting content to OMBase and logging

March 24th, 2008 by kohlhase

While I was reading up on the REST papers in my last post, I stunbled upon the following best practice for making sure that material is only submitted once to a RESTful application. This is something we should adopt in OMBase as well, just to be safe.

Another thing that we should think of in this  arena is to enable some form of RESTful logging facility, so that users can find out what happened to the content. The technology that seems best suited for that seems to be RSS or Atom Syndication (probably the latter). The nice thing is that we could log all the changes to any URI we use in the system. I am not sure under which URL we would address the log, one idea is to just make use of the the mime type application/atom+xml just as for the xhtml presentation as suggested in my last post that would at least alleviate the choice of URL.

Integrating Presentation into OMBase

March 24th, 2008 by kohlhase

I have just been reading up on REST again, since I found a very palatable pair of articles (REST intro, and  practices). This got me thinking about the state of OMBase, and the integration of our presentation pipeline into the OMBase interface. It is RESTful, since we have MMT addressing via URIs implemented. You just use a GET to retrieve them.

What I have talked with Florian about, but maybe not with the OMBase team, is how to integrate presentation. That should be very simple from the interface point of view: we just take the same URLs, but a different HTTP header.

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/omdoc+xml

gives you the OMDoc file and

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/xhtml+xml

gives you the presented version (with the standard options). Now, we have written a paper about presentation and submitted it to MKM and Christine has spent a lot of ingenuity on defining user options to the presentation process.This should be easy to integrate with the URI query interface:

GET /arith1/lcm?ext=foo.ntn∫=lang:ntn;style:physics
Host: cds.omdoc.org
Accept: application/xhtml+xml

That should do the trick.

MathML Support in Firefox 3 beta 4.

March 16th, 2008 by kohlhase

I have just installed the new firefox 3 beta 4. I have been using betas of Firefox3 for a while, and have been enthusiastic about the release, but always had to keep a Firefox 2 copy around for viewing MathML.  Without having tested this extensively, I would say that in beta 4, the level of MathML support is up to the level of Firefox 2, which allows me to make the transition to Firefox 3 fully. For some reason I do not fully understand, it seems that the font problems I was having with FF2 have also gone away.

To test FF3, I have looked at our MathML version of the Cornell eprint arXiv and the results are really impressive.

Ontology repair in Physics

February 21st, 2008 by kohlhase

I am just sitting the CIAO workshop and Alan Bundy and Michael Chan are talking about a very nice topic: the evolution of ontologies in Physics. They are applying this to historical examples like the latent heat problem and the MOND theory that is hot in Physics at the moment. The idea is that when experiments contradict theory, there is a clash between the theory ontology Ot and the sensory Ontology Os, which they solve by renaming apart selected concepts between the ontologies to resolve the contradiction. So they change the ontologies by renaming. The nice thing is that they can interpret the operation of renaming as a conservative theory extension which gives a nice interpretation of minimal theory change/repair.

You can find the details here.

Even though I totally buy into their observations, I think that  it would be better to keep the theories as they are and interpret the repair operations as theory morphims. That would be a non-desctructive operation, and the operations would become very natural theory morphisms.

Success Rates in the arXMLiv project

December 9th, 2007 by kohlhase

I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…

We have been making some progress on the conversion of the arXiv  collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%.  Here is the story.

First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.

The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).

Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.

 

And now to the percetages:  I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be  successfully  by LaTeX, since arXiv does not have a viable package management system.

Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).

My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.

More Scoop Musing

August 30th, 2007 by kohlhase

In the second invited talk, Toby White is talking about SciSpace an experiment of social-software-mediated collaborative scientific research.

The main thrust of the intro is that there is a new kind of scientific practice is emerging, e.g. in the environmental sciences. This involves massive cross-institutional collaboration of scientists and programs. The problem in collaboration is not the lack of communication. We have giant bandwidth, but understanding it is the problem. But just managing e-mail discussions across multiple interlocutors is almost unworkable (think adding a person to a long one). In particular, you are interested in the history of the project, and that is extremely hard to extract from the discussions, since it is multi-threaded and distributed.

Toby and some colleagues decided that they need something like scientific Facebook. SciSpace is like MySpace, but for Scientists. The logic is trivial, to implement as a system, but it is very hard to get to look nice, and easy to use. They are using an open-source social networking framework called ELGG out of Oxford. SciSpace has about 100-200 users and about 30 active ones from that.

Toby claims that the nice thing about SciSpace is that you kind of know what people you are blogging to, you can just keep up with what your colleagues/boss is doing, and what you may contribute to.

This would be a great thing to integrate with Panta-Rhei, maybe we can even re-implement that system in ELGG.  I really wonder whether they have some kind of repository feature. Toby tells us that there are wikis, but they are not very well-integrated, at least not in the same cool way.

SCOOP 2

August 30th, 2007 by kohlhase

This is about the talk of German Nemirovskij (Fachhochschule Albstadt-Sigmaringen) about Semantic Document Annotation for Global Search on Study Programs (e.g. semester abroad). in the SWAPS project they are looking at the Bologna Module descriptions of the European Documents; they have a similar structure, so can be screen-scraped into a database. Applications: Module search Personalized Search, and Comparison. For CoPs the second is interesting, since (German claims), since CoPs can be used to tweak this.

They want to reach semantic search by doing “search wrt. Ontologies”.

Problems:

  1. how to populate Ontologies (is this a only ABox population). ch
  2. how to index Ontologies for search

ad 1: From the layout scrape attribute-value pairs, then semantic annotation of documents fragments (the values of attributes). It seems that this is ABox population, and is quite at the beginning.

SCOOP workshop (Communities of Practice)

August 30th, 2007 by kohlhase

I am sitting the the SCOOP workshop of the JEM Network, which really shaping up nicely we have the MKM people meet with education and social software guys. I will blog a couple of impressions from the KWARC angle.

The discussion is quite stimulating.

Ralf Klamma (RWTH Aachen) gives an intro to Community Information Systems and claims that the constitutive features of CoPs are:

Mutual Engagement (ME): “You have to know which community you are belonging to”; I am a little sceptical whether this is really true for CoPs in Science which are very distributed, and may even be disconnected.

Joint Enterprize (JE): There is something you want to do together, and you want to learn to do it better. This is at the center at the CoP definition of Wenger. We have been neglecting this in our KWARC models here, or taking if for granted. We need to think more about this.

Joint Resources (JR): This is really where our MKM paper sits, and I have the feeling that we have something to bring to the table here. Klammer is interested in Multi-medial theories. I must say that with the OMDoc approach, we are interested in a Omni-Medial approach (OMDoc as a omnipresent semantic medium that covers all). The idea here is that the content Markup allows to generate multiple medial representations from this source and any media can be marked up to OMDoc. So maybe this is compatible.

Klamma also talks about a cross-media theory of transcription that sounds interesting (J”ager, Stanitzek Transkribieren - Media/Lecktu”ure 2002). The gist of it seems to be that events (e.g. historical) and objects are transcribed across media (e.g. to OMDoc or SciML). So we only have access to the media trace, not the event itself (it is long gone). I wonder what this theory predicts, it seems compatible with what we are doing.

A great example: The babylonian Thalmud has been transcribed to an XML markup, where you can annotate relations. Then the text can be acessed as semantic hypertext. One effect was that thalmud students were asking tougher questions earlier. That is very encouraging. I wonder if the sources are available for this, and how an OMDoc version of the thalmud would fare, and how much of the structure could be transferred in the CD-based structure we claim to be so essential.

Disambiguation of Mathematical Text

August 17th, 2007 by kohlhase

Oooops, this is a left-over draft from MKM

…..

Claudio Sacerdoti Coen (HELM group in Bologna) is talking about disambiguation. It seems that he has really nailed down most of the practical aspects the problem.

When types do not help, then we have to ask the user, and he is try to do this with the least nuiscance. He defines the notion of a spurious error (most errors are), and the “real errors” are not. As always it is great to hear him talk. I wonder what information he needs for the algorithm, is what we have in the new OMDoc presentation system enough?

He even has a correctness proof. I want a demo.