Archive for the ‘semantic documents’ Category

Microdata vs. RDFa – What does it mean to us?

Wednesday, October 28th, 2009

Only today I became aware of microdata, the proposed way of embedding semantic annotations into HTML5. (Yes, they adopted the syntax that Michael also prefers for OMDoc, and which I personally hate, but I will get used to it.) Microdata are not to be confused with microformats, a poor man’s way of annotation that (ab)uses CSS classes and thus is compatible with HTML 4. Microdata are something like RDFa but

  1. are slightly easier to use for people who don’t understand XML namespaces
    • granted, RDFa’s excessive reliance on XML namespaces makes it hard to parse, and makes it unbearably complex to copy/paste a fragment, which is an important use case for HTML5
  2. allow for ad hoc pseudo-semantic markup when you do not use an ontology
    • What’s the point in annotating at all, then?
  3. compatible with the non-XML syntax of HTML5 (which should have been ditched IMHO, but, well, in the interest of reactionary users and software, they decided differently)

The fight for the future of RDFa in HTML is going on, but what does that mean to KWARC? We have incorporated RDFa into OMDoc as a means of extending the metadata vocabularies. RDFa, originally designed for XHTML, is prepared for being integrated into any XML language, including OMDoc. HTML5 microdata are an integral part of the HTML5 specification and would not work in other XML languages. OK, but we present OMDoc documents as HTML to make them human-readable. In this output, we want to preserve the semantics of the OMDoc markup, and for that we had always been thinking about using RDFa. (We know exactly how to do it, but just have not yet implemented that step, though.) We could use HTML5 microdata instead, but:

  1. RDFa has little software support so far, but microdata have none (beyond proofs of concept)
  2. We generate XML-compliant HTML. The non-XML syntax of HTML5 supports embedded MathML, but I doubt that it will support parallel OpenMath markup, where elements from yet another namespace are embedded into the MathML formulae.
  3. We generate HTML. The embedded annotations need not be authored manually, so they do not have to be easy to author.
  4. We are interested in using well-defined ontologies to express semantics, so we don’t need ad hoc “semantic” markup.

What do you think?

Submitting content to OMBase and logging

Monday, March 24th, 2008

While I was reading up on the REST papers in my last post, I stunbled upon the following best practice for making sure that material is only submitted once to a RESTful application. This is something we should adopt in OMBase as well, just to be safe.

Another thing that we should think of in this  arena is to enable some form of RESTful logging facility, so that users can find out what happened to the content. The technology that seems best suited for that seems to be RSS or Atom Syndication (probably the latter). The nice thing is that we could log all the changes to any URI we use in the system. I am not sure under which URL we would address the log, one idea is to just make use of the the mime type application/atom+xml just as for the xhtml presentation as suggested in my last post that would at least alleviate the choice of URL.

Integrating Presentation into OMBase

Monday, March 24th, 2008

I have just been reading up on REST again, since I found a very palatable pair of articles (REST intro, and  practices). This got me thinking about the state of OMBase, and the integration of our presentation pipeline into the OMBase interface. It is RESTful, since we have MMT addressing via URIs implemented. You just use a GET to retrieve them.

What I have talked with Florian about, but maybe not with the OMBase team, is how to integrate presentation. That should be very simple from the interface point of view: we just take the same URLs, but a different HTTP header.

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/omdoc+xml

gives you the OMDoc file and

GET /arith1/lcm
Host: cds.omdoc.org
Accept: application/xhtml+xml

gives you the presented version (with the standard options). Now, we have written a paper about presentation and submitted it to MKM and Christine has spent a lot of ingenuity on defining user options to the presentation process.This should be easy to integrate with the URI query interface:

GET /arith1/lcm?ext=foo.ntn∫=lang:ntn;style:physics
Host: cds.omdoc.org
Accept: application/xhtml+xml

That should do the trick.

Ontology repair in Physics

Thursday, February 21st, 2008

I am just sitting the CIAO workshop and Alan Bundy and Michael Chan are talking about a very nice topic: the evolution of ontologies in Physics. They are applying this to historical examples like the latent heat problem and the MOND theory that is hot in Physics at the moment. The idea is that when experiments contradict theory, there is a clash between the theory ontology Ot and the sensory Ontology Os, which they solve by renaming apart selected concepts between the ontologies to resolve the contradiction. So they change the ontologies by renaming. The nice thing is that they can interpret the operation of renaming as a conservative theory extension which gives a nice interpretation of minimal theory change/repair.

You can find the details here.

Even though I totally buy into their observations, I think that  it would be better to keep the theories as they are and interpret the repair operations as theory morphims. That would be a non-desctructive operation, and the operations would become very natural theory morphisms.

Success Rates in the arXMLiv project

Sunday, December 9th, 2007

I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…

We have been making some progress on the conversion of the arXiv  collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%.  Here is the story.

First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.

The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).

Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.

 

And now to the percetages:  I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be  successfully  by LaTeX, since arXiv does not have a viable package management system.

Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).

My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.

SCOOP 2

Thursday, August 30th, 2007

This is about the talk of German Nemirovskij (Fachhochschule Albstadt-Sigmaringen) about Semantic Document Annotation for Global Search on Study Programs (e.g. semester abroad). in the SWAPS project they are looking at the Bologna Module descriptions of the European Documents; they have a similar structure, so can be screen-scraped into a database. Applications: Module search Personalized Search, and Comparison. For CoPs the second is interesting, since (German claims), since CoPs can be used to tweak this.

They want to reach semantic search by doing “search wrt. Ontologies”.

Problems:

  1. how to populate Ontologies (is this a only ABox population). ch
  2. how to index Ontologies for search

ad 1: From the layout scrape attribute-value pairs, then semantic annotation of documents fragments (the values of attributes). It seems that this is ABox population, and is quite at the beginning.

SCOOP workshop (Communities of Practice)

Thursday, August 30th, 2007

I am sitting the the SCOOP workshop of the JEM Network, which really shaping up nicely we have the MKM people meet with education and social software guys. I will blog a couple of impressions from the KWARC angle.

The discussion is quite stimulating.

Ralf Klamma (RWTH Aachen) gives an intro to Community Information Systems and claims that the constitutive features of CoPs are:

Mutual Engagement (ME): “You have to know which community you are belonging to”; I am a little sceptical whether this is really true for CoPs in Science which are very distributed, and may even be disconnected.

Joint Enterprize (JE): There is something you want to do together, and you want to learn to do it better. This is at the center at the CoP definition of Wenger. We have been neglecting this in our KWARC models here, or taking if for granted. We need to think more about this.

Joint Resources (JR): This is really where our MKM paper sits, and I have the feeling that we have something to bring to the table here. Klammer is interested in Multi-medial theories. I must say that with the OMDoc approach, we are interested in a Omni-Medial approach (OMDoc as a omnipresent semantic medium that covers all). The idea here is that the content Markup allows to generate multiple medial representations from this source and any media can be marked up to OMDoc. So maybe this is compatible.

Klamma also talks about a cross-media theory of transcription that sounds interesting (J”ager, Stanitzek Transkribieren – Media/Lecktu”ure 2002). The gist of it seems to be that events (e.g. historical) and objects are transcribed across media (e.g. to OMDoc or SciML). So we only have access to the media trace, not the event itself (it is long gone). I wonder what this theory predicts, it seems compatible with what we are doing.

A great example: The babylonian Thalmud has been transcribed to an XML markup, where you can annotate relations. Then the text can be acessed as semantic hypertext. One effect was that thalmud students were asking tougher questions earlier. That is very encouraging. I wonder if the sources are available for this, and how an OMDoc version of the thalmud would fare, and how much of the structure could be transferred in the CD-based structure we claim to be so essential.

Grokking Conflicts in Managment of Change

Friday, August 17th, 2007

We are working on a semantics-based management of change systems (sCMS) for a while (see our locutor project at KWARC).

This project builds onto three main intuitions. In a nutshell: if we know more about the semantics of a document type,

  1. then we know what the meaning-atomic document fragments are (that have an explicit contribution and dependency on the document context, we call them “information items” or “infom”s) [infom]
  2. then we can determine less intrusive differences (some differences don’t really change the document) [mDiff]
  3. then we know when changes in document (fragment) A will affect document (fragment) B, even if A and B are far apart. [long-range effects]

Today I am thinking some more about 2. and 3. The main application of this is the notion of conflicts in change management, which I had not fully grokked (at least to my own satisfaction) in the past. Here goes my new-found understanding.

  1. conflicts are about focus (maybe the word focus is not ideal, but I will use it for now).
  2. you focus on an infom, if you write or change it, or if you explicitly set a focus on it.
  3. If there is a long-range effect on an infom A from an infom B and that changes, then there is a conflict from A to B (interestingly conflicts are directed, I claim).

Now, let us see whether this concept is enough to understand Subversion (SVN) our paradigmatic CMS. In lack of anything else SVN considers lines to be infoms, and does not have long-range effects, but only line/infom-local effects: a change effects it’s whole line/infom. Furtheremore focus is set to an infom exactly when it is changed. As a consequence, we have a conflict, exactly when there are two changes to a line (one from the update and one in the local copy).

For a more semantic document type like OMDoc we have non-trivial infoms (statements and paragraphs usually) and long-range effects given by dependency relation (e.g. a definition depends on all the concepts in the definiens or a theorem depends on it’s proof, which depends on all theorems it uses in turn). If we assume a focus on everything we have ever written, then we come to a very interesting notion of conflict. If A depends on B, which changes and I focus both, then I get a conflict from A to B, and will be notified by the sCMS.

As I said, I am not sure that focus is exactly the right concept; we might have to think of “read focus” and “write focus” to account for the directionality of conflicts. But I am pretty sure that I understand more about CMS now. I have not really checked the literature, if this is all well-known, then please tell me.