Archive for the ‘kohlhase’ Category

SCOOP 2

Thursday, August 30th, 2007

This is about the talk of German Nemirovskij (Fachhochschule Albstadt-Sigmaringen) about Semantic Document Annotation for Global Search on Study Programs (e.g. semester abroad). in the SWAPS project they are looking at the Bologna Module descriptions of the European Documents; they have a similar structure, so can be screen-scraped into a database. Applications: Module search Personalized Search, and Comparison. For CoPs the second is interesting, since (German claims), since CoPs can be used to tweak this.

They want to reach semantic search by doing “search wrt. Ontologies”.

Problems:

  1. how to populate Ontologies (is this a only ABox population). ch
  2. how to index Ontologies for search

ad 1: From the layout scrape attribute-value pairs, then semantic annotation of documents fragments (the values of attributes). It seems that this is ABox population, and is quite at the beginning.

SCOOP workshop (Communities of Practice)

Thursday, August 30th, 2007

I am sitting the the SCOOP workshop of the JEM Network, which really shaping up nicely we have the MKM people meet with education and social software guys. I will blog a couple of impressions from the KWARC angle.

The discussion is quite stimulating.

Ralf Klamma (RWTH Aachen) gives an intro to Community Information Systems and claims that the constitutive features of CoPs are:

Mutual Engagement (ME): “You have to know which community you are belonging to”; I am a little sceptical whether this is really true for CoPs in Science which are very distributed, and may even be disconnected.

Joint Enterprize (JE): There is something you want to do together, and you want to learn to do it better. This is at the center at the CoP definition of Wenger. We have been neglecting this in our KWARC models here, or taking if for granted. We need to think more about this.

Joint Resources (JR): This is really where our MKM paper sits, and I have the feeling that we have something to bring to the table here. Klammer is interested in Multi-medial theories. I must say that with the OMDoc approach, we are interested in a Omni-Medial approach (OMDoc as a omnipresent semantic medium that covers all). The idea here is that the content Markup allows to generate multiple medial representations from this source and any media can be marked up to OMDoc. So maybe this is compatible.

Klamma also talks about a cross-media theory of transcription that sounds interesting (J”ager, Stanitzek Transkribieren – Media/Lecktu”ure 2002). The gist of it seems to be that events (e.g. historical) and objects are transcribed across media (e.g. to OMDoc or SciML). So we only have access to the media trace, not the event itself (it is long gone). I wonder what this theory predicts, it seems compatible with what we are doing.

A great example: The babylonian Thalmud has been transcribed to an XML markup, where you can annotate relations. Then the text can be acessed as semantic hypertext. One effect was that thalmud students were asking tougher questions earlier. That is very encouraging. I wonder if the sources are available for this, and how an OMDoc version of the thalmud would fare, and how much of the structure could be transferred in the CD-based structure we claim to be so essential.

Disambiguation of Mathematical Text

Friday, August 17th, 2007

Oooops, this is a left-over draft from MKM

…..

Claudio Sacerdoti Coen (HELM group in Bologna) is talking about disambiguation. It seems that he has really nailed down most of the practical aspects the problem.

When types do not help, then we have to ask the user, and he is try to do this with the least nuiscance. He defines the notion of a spurious error (most errors are), and the “real errors” are not. As always it is great to hear him talk. I wonder what information he needs for the algorithm, is what we have in the new OMDoc presentation system enough?

He even has a correctness proof. I want a demo.

Grokking Conflicts in Managment of Change

Friday, August 17th, 2007

We are working on a semantics-based management of change systems (sCMS) for a while (see our locutor project at KWARC).

This project builds onto three main intuitions. In a nutshell: if we know more about the semantics of a document type,

  1. then we know what the meaning-atomic document fragments are (that have an explicit contribution and dependency on the document context, we call them “information items” or “infom”s) [infom]
  2. then we can determine less intrusive differences (some differences don’t really change the document) [mDiff]
  3. then we know when changes in document (fragment) A will affect document (fragment) B, even if A and B are far apart. [long-range effects]

Today I am thinking some more about 2. and 3. The main application of this is the notion of conflicts in change management, which I had not fully grokked (at least to my own satisfaction) in the past. Here goes my new-found understanding.

  1. conflicts are about focus (maybe the word focus is not ideal, but I will use it for now).
  2. you focus on an infom, if you write or change it, or if you explicitly set a focus on it.
  3. If there is a long-range effect on an infom A from an infom B and that changes, then there is a conflict from A to B (interestingly conflicts are directed, I claim).

Now, let us see whether this concept is enough to understand Subversion (SVN) our paradigmatic CMS. In lack of anything else SVN considers lines to be infoms, and does not have long-range effects, but only line/infom-local effects: a change effects it’s whole line/infom. Furtheremore focus is set to an infom exactly when it is changed. As a consequence, we have a conflict, exactly when there are two changes to a line (one from the update and one in the local copy).

For a more semantic document type like OMDoc we have non-trivial infoms (statements and paragraphs usually) and long-range effects given by dependency relation (e.g. a definition depends on all the concepts in the definiens or a theorem depends on it’s proof, which depends on all theorems it uses in turn). If we assume a focus on everything we have ever written, then we come to a very interesting notion of conflict. If A depends on B, which changes and I focus both, then I get a conflict from A to B, and will be notified by the sCMS.

As I said, I am not sure that focus is exactly the right concept; we might have to think of “read focus” and “write focus” to account for the directionality of conflicts. But I am pretty sure that I understand more about CMS now. I have not really checked the literature, if this is all well-known, then please tell me.

A radical new referencing scheme for Openmath and MathML (and OMDoc)

Monday, July 2nd, 2007

We are thinking about how to reference theory-constitutive elements in content dictionaries. We had distinguished “reference by location” (via usual URIs) and “reference by context” (via the OMDoc theories and their constitutive elements) in OMDoc 1.1. It was very hard to explain the latter, and the encoding was a little weird, so I dropped it again from OMDoc 1.2. But the concept is valid and important, so here we go again.

This topic is important, since we are thinking about OpenMath3 and we are adding CDs in MathML3. And I guess that there will be quite a while until we can change these two again, so we better get it right. Moreover, the referencing scheme better be compatible with those two.

Here is the idea: we have nested theories in OMDoc1.2, and we need to reference symbols from them. Now, symbols are referenced by their name, which need not be document-unique (and we do not want to do that, since we want to compose theories in documents. That is why their names have three components: the theory name (cd name in OM; which is document-unique at least in OMDoc), and a symbol name, which is theory/cd-unique. And to disambiguate we have URIs for the cds in the cdbase attribute.

We would like to generalize that in OMDoc1.8, theory names should only be unique in their context (which might be the document context or a theory). So far so good, but then we need a path-like referencing scheme at least for the cd names. So we can really combine them in one path/URI as described in a post on MathML/OM referencing.

The next step in OMDoc would be to allow any content element to be theory-like, and allow it to import. Here is a somewhat extreme example of what we would be able to do.

<!-- all statements are theories, so this is also one -->
<symbol name="nat"/>

<!– this symbol declaration imports from theory “nat” –>
<symbol name=”zero”>
<imports=”nat”/>
<type><csymbol pref=”nat/nat”/></type>
</symbol>

<!– this one also needs a function type, so we import it –>
<symbol name=”suc”>
<imports=”nat”/>
<imports=”simple-types”/>
<type>
<apply>
<csymbol pref=”simple-types/funtype”/>
<csymbol pref=”nat”/>
<csymbol pref=”nat”/>
</apply>
</type>
</symbol>

<!– the third Peano Axiom (1&2 are about types) is only about suc –>
<axiom name=”peano3″>
<imports from=”suc”/>
<imports from=”quant1″/>
<bind>
<csymbol pref=”quant1/forall”/>
<bvar><ci>a</ci><ci>b</ci></bvar>
<apply>
<iff/>
<apply><eq/><ci>a</ci><ci>b</ci></apply>
<apply><eq/>
<apply><csymbol pref=”suc/suc”/><ci>a</ci></apply>
<apply><csymbol pref=”suc/suc”/><ci>b</ci></apply>
</apply>
</apply>
</bind>
</axiom>

Referencing symbols in OpenMath and MathML

Monday, July 2nd, 2007

We are currntly working at an aligned OpenMath/cMathML model for mathematical objects, based on the model for OpenMath objects. This will go into the MathmL3 and OpenMath3 specifications due in spring. Afterwards we will not be able to change much for a long time I expect, so we better get this one right.

There has been some discussion abouth the OpenMath referencing triplet: a symbol (OMS in OM) has three attributes a name, a cd, and a cdbase, e.g. the symbol for addition might be
<OMS cdbase="http://openmath.org/cds" cd="arith1" name="plus"/>

The cdbase and cd attributes determines a content dictionary (in this case the file http://openmath.org/cds/arith1.ocd) and the name attribute a symbol declaration in it (the name of that must be cd-unique).

In MathML3 we want to follow the same general model, but have the definitionURL attribute for specifying meaning. Here we would use the URL http://openmath.org/cds/arith1#plus currently. There was some discussion whether we should have one big CD for MathML or many small ones, … Sam Dooley remarked that if we were to use the OM triplet, then he would like to treat the cd attribute like a cdbase now, which inherits…, then we could write <apply cd="mathml">...<csymbol name="plus"/> ...</apply> (especially if we had one big CD for all MathML, then we could make the cd=”mathml” a default on the <math> element…). Frankly I find this quite attractive (after having thought about it).

I would like to take this idea a little further in MathML3: like MathML2 we use a single URI-type attribute for symbol referencing, let’s call it pref (path ref; just to distinguish it from definitionURL for this post, it could in the end becomd definitionURL to keep backwards compatibility; after all MathML does not say what kind of URLs definitionURL should be; convenient).

So we use pref attributes on csymbols, and take xml:base into the picture we can write

<csymbol pref="http://openmath.org/cds/arith1/plus"/>

<math xml:base="http://openmath.org/cds/">.... <csymbol pref="arith1/plus"/>...</math>

and even

<math xml:base="http://openmath.org/cds/">
<apply xml:base="arith1">
<csymbol pref="plus"/>...</math>
</apply>
</math>

This would make a very simple framework. All the URIs can be used for REST-ful access to the relevant features (symbol declarations in the CDs), and relative URIs work as expected. And if we write content dictionaries in a somewhat atomic way, then we can even supply them on a static web server. It would be quite simple to configure apache that it really generates the right files, for instance, in the directory …/arith1 we could have ocd.php with the CD skeleton and inclusions for the symbol declarations which are represented as files in the directory. e.g. arith1/plus.

That would make it quite simple to set up a structure that would make the cds meaningful.

OMDoc Versions

Monday, July 2nd, 2007

I am surprised that it is already almost a year since OMDoc 1.2 appeared, and we have to do something.

We have been thinking about the future of OMDoc Version numberings. Currently OMDoc is at 1.2, and we have collected a lot of ideas for 2.0. Some of them have been presented at the MKM 2007 Workshops, and have met friendly comments. Somehow a direct push towards 2.0 seems scary, and it may be better to use the summer to make a revision 1.3 with these new ideas, bring it out, and move on to the next set of ideas and improvements.

We came up with a new scheme: we will try to have our cake and eat it: we come out with a specification that is somewhat incremental soon (I guess fall) integrating the new ideas, and call it 1.8. That shows that we are on our way to 2.0. This version will be incremental in that we will mostly only add things, and redefine the old syntax in terms of the new, and deprecate some of the old syntax (e.g. the use of xslt in the presentation elements like we do in content MathML3). Then we have one more version (1.9) to accumulate new stuff (e.g. the new MathML3/OM3 syntax), and in 2.0 we will throw out the accumulated deprecated functionality.

That will give us a relatively clean roadmap towards 2.0, I think.

Narrative Structure of Mathematical Text

Saturday, June 30th, 2007

Here we are again at MKM 2007, listening to Krztof Retel from the Ultra group at Heriott Watt, he is talking about the narrative structure of Mathematical Text. This is very much related to our own MathUI paper.

He proposes to annotate text fragments with names and annotate with RDF triples the relations between the boxes. Then the “dependency graph” is transformed to the “graph of logical precedences” changing some directions. The first is used for checking what we call the document ontology, and the second is the consistency of the text. I do not see anything that we cannot do in OMDoc.

Q: are there any relations that we do not already have in OMDoc? I think not.
Q: is this more than just a standoff-version of OMDoc in RDF? I think not.

MathLang and OMDoc and Souring and Aggregation

Saturday, June 30th, 2007

I am sitting in Robert Lamar’s (from the Ultra Group at Heriot Watt) talk on MathLang. He has the very ambitious goal: He wants to restore natural language as an input method for mathematics. The idea is that he does a linguistic analysis on the mathematical text (including the formulae) and at every level (I would guess that he is using a categorial grammar approach for that; in any case, the result is a nicely hiearchical phrase structure (at least for english)) the “boxes” can be annotated with meaning. This seems to build on the old Nederpelt & Kamareddine weak type theory, which we also have talked about in a KWARC graduate seminar.

In any case, all he does seems to be at the text level, and does not seem to trasncend sentences. So it would really work inside the OMDoc statement level. We could just come up with an XML encoding of the MathLang boxes (do they have one) and make it an OMDoc module. That would standardize it and would keep it in sync with OMDoc and would of course give OMDoc much better control over natural language. I wonder how much of this is automatic.

A wonderful concept he is introducing is the concept of “souring” i.e. the inverse of sugaring (i.e. making it palatable to the human). So souring makes things palatable to the computer. We would probably call this preloading. The souring operation is used for analyzing chains of equations, … This seems quite similar to things I have done in sTeX (and was very proud of at the time). I will have to look it up and compare it.

He takes the souring notation to the extreme, so that he can even include aggregation into account e.g. \forall x,y:A –> \forall x:A \forall y:A. This is really nice to see for a lambda-person like me, quite nifty. Is this really automated? He has souring constructors share, chain, fold, map, position.

I wonder whether this gives a very strong presentation language for OMDoc, we already have map in our system, maybe we should look at this. I am quite intrigued.

Lessons from the DLMF search

Saturday, June 30th, 2007

I am sitting in Abdou Youssef’s talk on his search engine on the DLMF, one thing that stuck me is that he says is that he is doing hit fragment descriptions by pre-computing the fragments at indexing times, storing them in a database and then do a fragment search, i.e. in comparison with MWS, where we compute the fragment at reporting time, he only assembles the hit page from the database, which seems more reliable and of course faster. I think that this should be a standard technique in Math Search that is independent of the search engine.

Of course they have it good, since they generate all their documents from LaTeX and have good control over what is a good fragment. If we are in the general case, this is not true. But we could use some discourse grammar techniques to do the fragment computation.