Microdata vs. RDFa – What does it mean to us?

Only today I became aware of microdata, the proposed way of embedding semantic annotations into HTML5. (Yes, they adopted the syntax that Michael also prefers for OMDoc, and which I personally hate, but I will get used to it.) Microdata are not to be confused with microformats, a poor man’s way of annotation that (ab)uses CSS classes and thus is compatible with HTML 4. Microdata are something like RDFa but

  1. are slightly easier to use for people who don’t understand XML namespaces
    • granted, RDFa’s excessive reliance on XML namespaces makes it hard to parse, and makes it unbearably complex to copy/paste a fragment, which is an important use case for HTML5
  2. allow for ad hoc pseudo-semantic markup when you do not use an ontology
    • What’s the point in annotating at all, then?
  3. compatible with the non-XML syntax of HTML5 (which should have been ditched IMHO, but, well, in the interest of reactionary users and software, they decided differently)

The fight for the future of RDFa in HTML is going on, but what does that mean to KWARC? We have incorporated RDFa into OMDoc as a means of extending the metadata vocabularies. RDFa, originally designed for XHTML, is prepared for being integrated into any XML language, including OMDoc. HTML5 microdata are an integral part of the HTML5 specification and would not work in other XML languages. OK, but we present OMDoc documents as HTML to make them human-readable. In this output, we want to preserve the semantics of the OMDoc markup, and for that we had always been thinking about using RDFa. (We know exactly how to do it, but just have not yet implemented that step, though.) We could use HTML5 microdata instead, but:

  1. RDFa has little software support so far, but microdata have none (beyond proofs of concept)
  2. We generate XML-compliant HTML. The non-XML syntax of HTML5 supports embedded MathML, but I doubt that it will support parallel OpenMath markup, where elements from yet another namespace are embedded into the MathML formulae.
  3. We generate HTML. The embedded annotations need not be authored manually, so they do not have to be easy to author.
  4. We are interested in using well-defined ontologies to express semantics, so we don’t need ad hoc “semantic” markup.

What do you think?

6 Responses to “Microdata vs. RDFa – What does it mean to us?”

  1. kohlhase says:

    Hmmm, interesting. Hixie is at it again, suggesting new syntax (that he thinks simpler) for something that involves namespaces.

    And as always it is unclear what will become of it. But all in all (like HTML5), I think that this is a useful experiment. Even as a namespaces friend never quite liked the RDFa syntax anyways, since it used namespace prefixes inside attributes. That creates some management problems, since it is not really supported by XML tools and applications (e.g. in sTeX we have to generate some xxx:dummy attribute sso that the namespace xxx of an ontology is generated at all). So I view the microdata proposal with some sympathy.

    All of course, IF microdata can express all that RDFa can, but the turtle example in the spec seems to suggest that; but I am not sure.

    But I think with OMDoc we are in a comfortable position.

    On the representation side, we can use the RDFa-based syntax we propose in OMDoc1.6 (I like HTML5 without the space indeed) and if microdata wins out over RDFa in the marketplace change to that syntax in OMDoc1.x. So there is nothing really at stake here except our implementations.

    ON the generation side things are equally relaxed. We can generate microdata from our RDFa or copy our RDFa or even generate both.

    In particular, I cannot see that there is anything XML-incompatible with HTML5 or microdata. HTML5 just adds a non-XML syntax to XHTML5 (it has been made clear by the HTML5 crowd that you can see HTML5 as a convenient? input syntax for XHTML5, which is the data model beneath it, see e.g. http://html5doctor.com/html-5-xml-xhtml-5/). So in a way HTML5 does something very similar to what we have in mind with strict/pragmatic OMDoc in OMDoc2, only the restrict it to the syntax. I view the non-XML syntax of HTML5 as “Pragmatic HTML” (easier to write) which is translated by the syntax rules in the HTML5 spec to a DOM (that is equivalent to XHTML5) i.e. “strict HTML”. We do the same: strict/pragmatic OMDoc, only that our translation is much more far-reaching.

    Finally, in the light of this, a response for your last point, we are indeed interested in an “ad hoc markup” syntax for pragmatic OMDoc which can be used for convenience. Only that we would view it as a pragmatic syntax for a well-defined ontology-based strict syntax.

  2. ako says:

    I only want to comment on the point labelled ’4′ (“We are interested in using well-defined ontologies to express semantics, so we don’t need ad hoc “semantic” markup.”) and Michael’s statement concerning this.

    I strongly believe that “ad hoc markup” is very important for an ever-growing acceptance of semantic technology. Why? The standard argument is convenience, mostly hinting at the laziness of users. But that is not all. Convenience also refers to the necessity of dealing with existing documents in all formats. I want to stress it again: necessity! As researchers we sometimes think we invent the world, so we can start at point zero. We all know that this is a false assumption for people living in the world.
    Look at our SAMSDocs project at DFKI: In a nutshell, we have strongly interrelated project documents that need to be converted to OMDoc files in order to set up a change management system (DocTip) potentially as a project management system. It was set-up as a use case, but after the project SAMS is now closed for 10 days, ideas start to float around to continue the project in that-or-that direction reusing its ‘verified’ base — only that the verification depends on working change management facilities. This sounds rather typical, don’t you think? Michael’s favorite example for the demand for sem. technology, the Apollo-rocket disaster, fits in as well. You don’t start out with semantics, but you realize you need it at some point.
    Back to the SAMSDocs project and my experiences there. First, I tried very hard to convert their tex documents into standard OMDoc. Problems over problems there, because stex is fine, but only for authoring a semantic document (thereby already brainwashing the author towards conceptual thinking). What I found was a set of documents of which their authors thought that it was well-structured, many relations and all-in-all ‘semantic’. Mmh. Semantic yes, but in an unordered, unorthodox, narrative way.
    Then, I realized that there really is no need for standard OMDoc. With Christoph and Michael’s RDFa-extension (the resource/link elements) I’m writing now distinct system ontologies in OMDoc e.g. one for the project’s organizational relations and one for an easy-going, ‘light’ object-approach in OMDoc. In the end, this will serve spreading the use of the OMDoc format.
    My point is, that usability is NOT equal to user-friendliness, it is much more. So I would say: “We are interested in using well-defined ontologies to express semantics, so we DO need SEMANTIC AD HOC MARKUP in order to get it to work!”.

  3. Christoph says:

    Hi, let me reply to some of your points:

    On Microdata vs. RDFa: Microdata cannot express two things that RDFa supports: datatypes of literals, and XML literals (see http://www.jenitennison.com/blog/node/103). These are probably not the most important features we need for OMDoc, but generally we’d rather like OMDoc to support more than less features. Wait, XML literals are probably important. In a sense, in parallel markup you also annotate stuff with XML literals. We do not want annotations to be just strings.

    Also, certain facts require more verbose markup in Microdata, compared to RDFa.

    On HTML5 vs. XML: I might have expressed it wrongly. HTML5 is not incompatible with XML. However, the non-XML syntax of HTML5 does not support namespaces. That is why RDFa cannot directly be incorporated into HTML5, and that is the root of the Microdata vs. RDFa conflict. Non-support of namespaces in some XML tools and in sTeX is IMHO rather to blame on these tools than on XML.

    On pragmatic vs. strict: I agree that we could easily switch to Microdata (both on the input side of pragmatic OMDoc metadata, and on the output side of presentational XHTML+MathML) if we realize that they should become more successful than RDFa. My position of not switching was rather in the sense of not dropping the RDFa support that we already have, but instead waiting.

    I agree with both of you that we need pragmatic markup. However, we should be careful in how much of Microdata to adopt for pragmatic OMDoc. Microdata, being almost as expressive as RDFa, is IMHO still too strict to be pragmatic ;-) So I think that pragmatic OMDoc should be much more pragmatic than Microdata (and thus: much easier to use, much less expressive, but covering most of what you need on a daily basis).

    On ad hoc markup: I am not sure about that. Ad hoc markup is not the same as pragmatic content markup. Pragmatic content markup has a well-defined semantics, by way of translation to strict content markup. Ad hoc markup can be quite un-semantic, just think of <omtext type=”foobar”>. What does it mean? If it is not documented anywhere, it does not mean anything. My position here is that we should only allow @type values for which we have specified a pragmatic→strict translation (i.e. basically the OMDoc 1.2 ones), or that we should allow authors to extend the vocabulary, provided that they specify such a translation for their extensions. I think the design of that can be guided by formula markup: We don’t allow ad hoc math content markup either. Either we use presentation markup (i.e. PMML embedded into OMDoc, which is possible), or we use content markup, and then we have to say what symbol from what CD we are using.

    Or am I wrong? I just realize that we could write <OMA><OMV name=”some-function”/> … </OMA>, i.e. use an ad hoc placeholder for something we can/want/need not yet define as a proper symbol.

    On dealing with existing documents: When we convert legacy documents to OMDoc, we can easily map ad hoc annotations to URIs (granted, not really “semantic” ones) by generating new namespaces, e.g. http://URI-of-legacy-document/annotation#. This process is automated, so it wouldn’t bother human authors.

    However, the SAMS use case may be different. Andrea, I like your idea of “realizing at some point that you need semantics” and the reference to brainwashing. But then you probably only start coding OMDoc after you realized that. Or should we also design an OMDoc-based workflow that allows you to start in a completely ad hoc way?

    Can you elaborate on the idea that “we need ad hoc markup in order to create well-defined ontologies”? You said that you are using RDFa to create ontologies within your OMDoc document. Indeed these ontologies are ad hoc from the point of view that you do not start by first defining symbols in OMDoc, but that you simply refer to names in your RDFa attributes. But compared to what you can do with Microdata, it is less ad hoc already, in that you commit to well-defined URIs for your annotation properties. THIS is IMHO what prepares you to complete the formal ontologies afterwards, as you “only” have to add semantics to those URIs. But as I see it, it differs from what I’d call ad hoc annotation, e.g. <meta property=”foobar”>value</meta>, where there is no way of identifying “foobar” in a scalable/maintainable way. (OK, you could change it to a URI in a later version of the document, you see, the case for RDFa is not as easy as I thought.)

  4. ako says:

    Hi and good morning to everyone.

    Before Michael left this morning, we started a discussion about what ‘ad-hoc markup’ is.
    On the one hand, we can define every markup that has no explicit documentation as ‘ad-hoc markup’. I believe Michael and Christoph used it this way and associated potentially ‘senseless’ or ‘meaningless’ with it.
    On the other hand, we can think of ‘ad-hoc markup’ as SEMANTIFICATION, i.e., as a process of adding semantics by-and-by.
    I argue for the second understanding, because if we have something like ‘type=argument’, it is meaningful, but it may not be defined (yet). Even ‘type=foobar’ is meaningful, I assume it refers to ‘type=later-to-come’ or ‘type=uninteresting/irrelevant (at least for the moment)’. Again, I’m on the side of the people: implicit meaning is meaning, but machines cannot understand it in this form. Our goal here would be the same, namely to get machines to work with the content. Many people find it hard to teach machines meaning, because they don’t ‘think’ like them. Here, support is needed, and therefore, yes, I believe an OMDoc-based workflow starting in a completely ad-hoc way would be wonderful for many.
    In short, the argument is ad-hoc (almost) never means meaningless.

    You asked me to elaborate on “we need ad hoc markup in order to create well-defined ontologies”?. I like to take the SAMSDocs project as an example. When I was introduced to the new task, I was presented a white paper called “Die semantische Struktur der Dokumente im SAMS-Projekt” containing lots of intersection points inbetween all SAMS documents. BUT: it didn’t say “is a copy of”, or “refines”, or “implements” — it said “is the definition of” resp. “is concerned with” resp. “is specified by”. Please note that the terms correspond in the order of their listing. This is important, because the so-called ontology given in the white paper was completely misleading. In particular, the defining occurrence of an object happened to be in a different document, so all the work of annotating it as definition in the first document was superfluent. The white paper is still meaningful, but you cannot start the ontology creation with it. The ontology emerges at some point. As I mentioned before, now I can distinguish several ontologies, which explain distinct views of relations between objects. By the way, the change management itself should act differently depending on which view is taken.

    Another point with ad-hoc markup: If I annotate a term with a “light stex” extension as \begin{moreDefinitionFor}…\end{…} instead of <omdoc:link rel=moreDefinitionFor ….., then the first feels more ad-hoc than the second. Probably, you will say now that it is just more user-friendly as it is an abbreviation. But I’m not so sure as it is not only much easier to write, it is also much easier to change — thereby losing its serious character. So, there may be ad-hoc levels?

  5. Christoph says:

    Hi Andrea, indeed I tend to associate the term “ad hoc markup” to “meaningless” markup, i.e. not having any documentation, neither formal nor informal.
    However, your further elaboration makes me realize that the thing I objected against was again something different. I objected against TECHNICALLY useless ad hoc markup, and my objections were against the SYNTAX, not against the nature of the markup. type=”foobar” is meaningless (and will always be), as “foobar” is in no namespace. When defining a pragmatic→strict mapping (which is still to be done), we were planning to define that words in no namespace should denote URIs in some special OMDoc namespace, which is legal in RDFa, so that we get proper URIs for OMDoc-1.2-compatible syntaxes like <omtext type=”introduction”> Then, however, any document author out there should not be able to abuse that OMDoc namespace, because we are the only ones who should be allowed to introduce new terms there.
    So let’s combine what you call semantification (which I like and which is IMHO very much in the OMDoc spirit) with what I call syntactically meaningful markup. An author who wants to introduce new terms ad hoc would have to start by defining some namespace prefix (preferably assisted by software), e.g. myns=”http://my.home.page/terms#”. Then, (s)he can write things like <meta property=”myns:later-to-come”>, or why not some actual term <meta property=”myns:interestingness”>, but only later give these terms a proper meaning by starting to author the “myns” ontology.
    From this point of view, the decision between RDFa and Microdata is again back at its core: enforce namespaces or not? If we had not RDFa but Microdata-like metadata in OMDoc, we would be able to do without namespaces. But IMHO that would promote syntactic meaninglessness. With Microdata, we can use “foobar” as an ad hoc attribute value (e.g. for a metadata property), but not properly give it a meaning afterwards, as it belongs (by the Microdata→RDF translation specification) to the “XHTML vocabulary” namespace that we cannot control.
    Thanks for describing the “emerging ontology” use case. I agree that that is exactly what we should aim at supporting.
    Concerning sTeX vs. OMDoc: For frequently-used annotations, we may introduce pragmatic OMDoc elements that do not look like RDFa, or we already have them in OMDoc 1.2. Then, it is IMHO really only a matter of taste whether you prefer curly over angle brackets, or whether your favorite editor supports TeX better than XML. Still I agree that sTeX has the potential of being one ad hoc level “lower” (= closer to the user) even than pragmatic OMDoc, as it is easily extensible by macros, whereas an XML language is not.

  6. Christoph says:

    An interesting paper on that topic, which summarizes the different standards from an e-learning point of view: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-506/tomberg.pdf

Leave a Reply

You must be logged in to post a comment.