Archive for the ‘kohlhase’ Category

DITA/OMDoc Compatibility (or topic-based writing in OMDoc)

Friday, April 9th, 2010

When I was at the WritersUA conference before easter, the compatibility (and transformation) between DITA (as a topic-centered format) and DocBook (as a narrative one) was one of the topics with wider interest. In OMDoc we have always maintained that we can follow both the topic-centered approach (which is quite natural for mathematical texts and indeed for wiki-based approaches like the one in SWiM) as well as the narrative one. So I got thinking how we would really do the topic-centered approach in OMDoc.

When I was reading Christine Müller’s Ph.D. thesis that looked a the integration of topic-based and narrative writing styles, I noticed that she says that OMDoc does not have support for topic-style writing. I think that this is wrong. Taking her example (slightly simplified)

<concept id="A.dita">
 <title>Natural Numbers</title>
 <conbody>
 <p>The set of <term>natural numbers</term>
 defined <cite>here</cite> or in <xref href="nat.dita#nat1"/>.
 </p>
 <para conref="topic/p2"/>
 </conbody>
 <related-link>http://example.com/nats.html</related-link>
</concept>

it is obviously directly  expressible in OMDoc as

<omdoc>
 <omgroup type="concept" xml:id="A.dita">
 <metadata>
 <dc:title>Natural Numbers</dc:title>
 <link rel="dita:related-link" resource="http://example.com/nats.html"/>
 </metadata>
 <omgroup type="conbody">
 <omtext&gt
 <CMP>
 <p>The set of <phrase role="term">natural numbers</phrase>
 defined <cite>here</cite> or in <ref type="cite" href="nat.dita#nat1"/>.
 </p>
 </CMP>
 </omtext>
 <ref href="topic/p2" type="include"/>
 </omgroup>
 </omgroup>
</omdoc>

(again slightly simplified; I am leaving out the relevant namespace declarations). It should be directly obvious that we can define an OMDoc sublanguage that is isomorphic to DITA. Indeed I think that this is an exercise that would be worth doing. After all, there was a message from Bryce Nordgren  about opening oup a Math domain in DITA (see http://openmath.org/pipermail/om/2009-February/001203.html for details), which could use this isomorphism as a guiding light.

Of course DITA not only has topics, but also topic maps, let me again use an example from Christine’s thesis.

<map title="title">
 <topichead navtitle="navi-title" audience="math"/>
 <topicref href="A.dita" collection-type="sequence">
 <topicref href="A1.dita"/>
 <topicref href="A2.dita"/>
 </topicref>
 <reltable>
 <relrow>
 <relcell>A.dita</relcell>
 <relcell>B.dita</relcell>
 </relrow>
 </reltable>
</map>

The first part of this map is just what we have always thought of as a narrative structure in our NarCon approach in OMDoc. So we can directly represent it as something like

<omdoc>
 <metadata>
 <dc:title>title</dc:title>
 <link rel="dita:audience" resource="something:math"/>
 <link rel="dita:navtitle" resource="navi-title"/>
 </metadata>
 <omgroup xml:id="A.narrative" type="sequence">
 <ref type="include" href="A1.omdoc"/>
 <ref type="include" href="A2.omdoc"/>
 </omgroup>
</omdoc>

I must confess that I do not really understand what the href on the top-level topicref means, so I have left it out. Note that I am only interested in the general compatibility of the formats and not the details of the translation, which will have to be worked out. That leaves us with the reltable, which (as far as I can understand it a way to specify cross-references that is a better alternative to <related-links>, since it is more portable and attached to DITA maps (which we can think of as discourse-level presentation of the content structure given by the graph of DITA topics). So I would just add the following metadata section to the <omgroup> element:

<metadata>
 <link rel="dita:related-link" resource="http://example.com/nats.html"/>
</metadata>

OK, that ends our little comparison exercise. There are a couple of conclusions I would like to draw from this:

  1. OMDoc can do topic-oriented writing quite nicely
  2. the OMDoc1.3-style metadata help significantly
  3. rather than develop a DITA ontology (hinted at with the dita: namespace prefixes) we should develop ontologies that describe the various aspects of topic-based writing in generality and find the respective markup primitives. For instance dita:audience seems weird, there must be an ontology in the eLearning realm that already formalizes this.
  4. The OMDoc-1.6 idea of leaving out the <metadata> element and freely intermixing the metadata <link>, <resource> and <meta> with the OMDoc content will make the translation much simpler and direct, e.g. for the <reltable> and <related-link> elements from DITA which are situated at the end in the original.

OK, that is all I have to say at the moment, please give me feedback.

A MathSearch Competition

Sunday, July 27th, 2008

In my last post I just learned about a new search engine. We should really have a competition and example library for Math Search Engines. We talked about this some years back but we really need to get our act together, probably for the next MKM.

I can see three tasks that we have to accomplish for a competition

  1. collect a search corpus. It seems that the arXiv would be the right thing to start from here, it is big enough  to pick competition examples randomly.
  2. cooperate on an analysis pipeline and corpora. This would allow people to cooperate without having a full analysis pipeline.
  3. collect a corpus of search queries. This may be the biggest hurdle, since we need a gold standard of what we expect the hits to be
  4. come up with “divisions”. not all engines can do the same, so we should only let comparable engines compete; also multiple divisions will allow to have multiple trophies.
  5. build a competition harness. So that tests can be automated. This will also require and thus lead to general search APIs.

This is all I can think about at the moment, so give me your feedback.

egomath search engine talk at DML Workshop

Sunday, July 27th, 2008

I am sitting in the DML (Digital Mathematical Libraries) workshop in Birmingham listening on Jozef Misutka’s talk on his search engine.

It is surprising how many math search engines are out there; this project has been started in 2004, and I had not really know about it. Jozef also uses an existing search engine for indexing, but he does a syntactic analysis before he indexes, at least for formulae. This is the central part of his talk. He tries to deduce the correct meaning from the input, which seems to be PDF.

Steps:

  1. Normalization (heuristical),
  2. linearization (since his search engine woks on strings/words)
  3. partial evaluation (e.g. with distributivity)
  4. generalization (introduction of variables in the index)
  5. ordering (for commutative operators)

This seems to be an attempt to get semantics into the search, i.e. E-retrieval, but do we have a clear information about the Equivalence relation E.

I think that the most important contribution here is probably the analysis phase here, since he extracts formulae from something as poor as PDF.

Interesting

CodeML competitors (or hopefuls)

Monday, March 24th, 2008

I have just stumbled upon another justification (as in people having problems with the currenct state of the art) of our CodeML project: integrating code with syntax highlighting into presentations (and web pages, …), i.e. into situations, where we do not have a suitable parser at hand, but still want to change the appearance of the code, and have access to the semantics and structure.

Submitting content to OMBase and logging

Monday, March 24th, 2008

While I was reading up on the REST papers in my last post, I stunbled upon the following best practice for making sure that material is only submitted once to a RESTful application. This is something we should adopt in OMBase as well, just to be safe.

Another thing that we should think of in this  arena is to enable some form of RESTful logging facility, so that users can find out what happened to the content. The technology that seems best suited for that seems to be RSS or Atom Syndication (probably the latter). The nice thing is that we could log all the changes to any URI we use in the system. I am not sure under which URL we would address the log, one idea is to just make use of the the mime type application/atom+xml just as for the xhtml presentation as suggested in my last post that would at least alleviate the choice of URL.

Integrating Presentation into OMBase

Monday, March 24th, 2008

I have just been reading up on REST again, since I found a very palatable pair of articles (REST intro, and  practices). This got me thinking about the state of OMBase, and the integration of our presentation pipeline into the OMBase interface. It is RESTful, since we have MMT addressing via URIs implemented. You just use a GET to retrieve them.

What I have talked with Florian about, but maybe not with the OMBase team, is how to integrate presentation. That should be very simple from the interface point of view: we just take the same URLs, but a different HTTP header.

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/omdoc+xml

gives you the OMDoc file and

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/xhtml+xml

gives you the presented version (with the standard options). Now, we have written a paper about presentation and submitted it to MKM and Christine has spent a lot of ingenuity on defining user options to the presentation process.This should be easy to integrate with the URI query interface:

GET /arith1/lcm?ext=foo.ntn∫=lang:ntn;style:physics
Host: cds.omdoc.org
Accept: application/xhtml+xml

That should do the trick.

MathML Support in Firefox 3 beta 4.

Sunday, March 16th, 2008

I have just installed the new firefox 3 beta 4. I have been using betas of Firefox3 for a while, and have been enthusiastic about the release, but always had to keep a Firefox 2 copy around for viewing MathML.  Without having tested this extensively, I would say that in beta 4, the level of MathML support is up to the level of Firefox 2, which allows me to make the transition to Firefox 3 fully. For some reason I do not fully understand, it seems that the font problems I was having with FF2 have also gone away.

To test FF3, I have looked at our MathML version of the Cornell eprint arXiv and the results are really impressive.

Ontology repair in Physics

Thursday, February 21st, 2008

I am just sitting the CIAO workshop and Alan Bundy and Michael Chan are talking about a very nice topic: the evolution of ontologies in Physics. They are applying this to historical examples like the latent heat problem and the MOND theory that is hot in Physics at the moment. The idea is that when experiments contradict theory, there is a clash between the theory ontology Ot and the sensory Ontology Os, which they solve by renaming apart selected concepts between the ontologies to resolve the contradiction. So they change the ontologies by renaming. The nice thing is that they can interpret the operation of renaming as a conservative theory extension which gives a nice interpretation of minimal theory change/repair.

You can find the details here.

Even though I totally buy into their observations, I think that  it would be better to keep the theories as they are and interpret the repair operations as theory morphims. That would be a non-desctructive operation, and the operations would become very natural theory morphisms.

Success Rates in the arXMLiv project

Sunday, December 9th, 2007

I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…

We have been making some progress on the conversion of the arXiv  collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%.  Here is the story.

First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.

The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).

Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.

 

And now to the percetages:  I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be  successfully  by LaTeX, since arXiv does not have a viable package management system.

Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).

My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.

More Scoop Musing

Thursday, August 30th, 2007

In the second invited talk, Toby White is talking about SciSpace an experiment of social-software-mediated collaborative scientific research.

The main thrust of the intro is that there is a new kind of scientific practice is emerging, e.g. in the environmental sciences. This involves massive cross-institutional collaboration of scientists and programs. The problem in collaboration is not the lack of communication. We have giant bandwidth, but understanding it is the problem. But just managing e-mail discussions across multiple interlocutors is almost unworkable (think adding a person to a long one). In particular, you are interested in the history of the project, and that is extremely hard to extract from the discussions, since it is multi-threaded and distributed.

Toby and some colleagues decided that they need something like scientific Facebook. SciSpace is like MySpace, but for Scientists. The logic is trivial, to implement as a system, but it is very hard to get to look nice, and easy to use. They are using an open-source social networking framework called ELGG out of Oxford. SciSpace has about 100-200 users and about 30 active ones from that.

Toby claims that the nice thing about SciSpace is that you kind of know what people you are blogging to, you can just keep up with what your colleagues/boss is doing, and what you may contribute to.

This would be a great thing to integrate with Panta-Rhei, maybe we can even re-implement that system in ELGG.  I really wonder whether they have some kind of repository feature. Toby tells us that there are wikis, but they are not very well-integrated, at least not in the same cool way.