Archive for the ‘MKM’ Category

A MathSearch Competition

Sunday, July 27th, 2008

In my last post I just learned about a new search engine. We should really have a competition and example library for Math Search Engines. We talked about this some years back but we really need to get our act together, probably for the next MKM.

I can see three tasks that we have to accomplish for a competition

  1. collect a search corpus. It seems that the arXiv would be the right thing to start from here, it is big enough  to pick competition examples randomly.
  2. cooperate on an analysis pipeline and corpora. This would allow people to cooperate without having a full analysis pipeline.
  3. collect a corpus of search queries. This may be the biggest hurdle, since we need a gold standard of what we expect the hits to be
  4. come up with “divisions”. not all engines can do the same, so we should only let comparable engines compete; also multiple divisions will allow to have multiple trophies.
  5. build a competition harness. So that tests can be automated. This will also require and thus lead to general search APIs.

This is all I can think about at the moment, so give me your feedback.

Stephen Watt’s talk on analyzing subject areas by symbol frequencies

Sunday, July 27th, 2008

Stephen Watt proposing to automate subject classification from the document content. He says we should belive the document more than the classifiers. I think this is potentially very useful for our KWARC work, in particular to aid in large-scale analysis of documents, e.g. in the arXiv. For instance in notation understanding (@Christine are you listening?). In fact that is what Stephen is talking about just at the moment. He is also interested in pen-based formula recognition and it is clear that this is helpful here, and this also provides the motivation of looking at formulae only, since in pen-based math only has the formulae (not a lot of words here). I think there is also a another motivation: formulae and text constrain each other.

They use of arXiv (of course) and they also have a corpus of engineering Math texts, which is a good corpus, since engineering students imprint on this (like geese on Konrad Lorenz’ rubber boots).  I would like to get my hands on this corpus.

So Stephen computed the symbol frequencies on the corpora, and used pre-existing area classifications for the classifications. The ranking of symbols seems to give a nice key to distinguish  areas. In fact, you only look at the 10 top most comon symbols to identify the area. This really looks like CoP data. This is certainly very very interesting for us.

I would really like to see whether this technique can be used to predict citation cliques/cartels or the math genealogy database.

egomath search engine talk at DML Workshop

Sunday, July 27th, 2008

I am sitting in the DML (Digital Mathematical Libraries) workshop in Birmingham listening on Jozef Misutka’s talk on his search engine.

It is surprising how many math search engines are out there; this project has been started in 2004, and I had not really know about it. Jozef also uses an existing search engine for indexing, but he does a syntactic analysis before he indexes, at least for formulae. This is the central part of his talk. He tries to deduce the correct meaning from the input, which seems to be PDF.

Steps:

  1. Normalization (heuristical),
  2. linearization (since his search engine woks on strings/words)
  3. partial evaluation (e.g. with distributivity)
  4. generalization (introduction of variables in the index)
  5. ordering (for commutative operators)

This seems to be an attempt to get semantics into the search, i.e. E-retrieval, but do we have a clear information about the Equivalence relation E.

I think that the most important contribution here is probably the analysis phase here, since he extracts formulae from something as poor as PDF.

Interesting

CodeML competitors (or hopefuls)

Monday, March 24th, 2008

I have just stumbled upon another justification (as in people having problems with the currenct state of the art) of our CodeML project: integrating code with syntax highlighting into presentations (and web pages, …), i.e. into situations, where we do not have a suitable parser at hand, but still want to change the appearance of the code, and have access to the semantics and structure.

Submitting content to OMBase and logging

Monday, March 24th, 2008

While I was reading up on the REST papers in my last post, I stunbled upon the following best practice for making sure that material is only submitted once to a RESTful application. This is something we should adopt in OMBase as well, just to be safe.

Another thing that we should think of in this  arena is to enable some form of RESTful logging facility, so that users can find out what happened to the content. The technology that seems best suited for that seems to be RSS or Atom Syndication (probably the latter). The nice thing is that we could log all the changes to any URI we use in the system. I am not sure under which URL we would address the log, one idea is to just make use of the the mime type application/atom+xml just as for the xhtml presentation as suggested in my last post that would at least alleviate the choice of URL.

Ontology repair in Physics

Thursday, February 21st, 2008

I am just sitting the CIAO workshop and Alan Bundy and Michael Chan are talking about a very nice topic: the evolution of ontologies in Physics. They are applying this to historical examples like the latent heat problem and the MOND theory that is hot in Physics at the moment. The idea is that when experiments contradict theory, there is a clash between the theory ontology Ot and the sensory Ontology Os, which they solve by renaming apart selected concepts between the ontologies to resolve the contradiction. So they change the ontologies by renaming. The nice thing is that they can interpret the operation of renaming as a conservative theory extension which gives a nice interpretation of minimal theory change/repair.

You can find the details here.

Even though I totally buy into their observations, I think that  it would be better to keep the theories as they are and interpret the repair operations as theory morphims. That would be a non-desctructive operation, and the operations would become very natural theory morphisms.

Disambiguation of Mathematical Text

Friday, August 17th, 2007

Oooops, this is a left-over draft from MKM

…..

Claudio Sacerdoti Coen (HELM group in Bologna) is talking about disambiguation. It seems that he has really nailed down most of the practical aspects the problem.

When types do not help, then we have to ask the user, and he is try to do this with the least nuiscance. He defines the notion of a spurious error (most errors are), and the “real errors” are not. As always it is great to hear him talk. I wonder what information he needs for the algorithm, is what we have in the new OMDoc presentation system enough?

He even has a correctness proof. I want a demo.

Narrative Structure of Mathematical Text

Saturday, June 30th, 2007

Here we are again at MKM 2007, listening to Krztof Retel from the Ultra group at Heriott Watt, he is talking about the narrative structure of Mathematical Text. This is very much related to our own MathUI paper.

He proposes to annotate text fragments with names and annotate with RDF triples the relations between the boxes. Then the “dependency graph” is transformed to the “graph of logical precedences” changing some directions. The first is used for checking what we call the document ontology, and the second is the consistency of the text. I do not see anything that we cannot do in OMDoc.

Q: are there any relations that we do not already have in OMDoc? I think not.
Q: is this more than just a standoff-version of OMDoc in RDF? I think not.

MathLang and OMDoc and Souring and Aggregation

Saturday, June 30th, 2007

I am sitting in Robert Lamar’s (from the Ultra Group at Heriot Watt) talk on MathLang. He has the very ambitious goal: He wants to restore natural language as an input method for mathematics. The idea is that he does a linguistic analysis on the mathematical text (including the formulae) and at every level (I would guess that he is using a categorial grammar approach for that; in any case, the result is a nicely hiearchical phrase structure (at least for english)) the “boxes” can be annotated with meaning. This seems to build on the old Nederpelt & Kamareddine weak type theory, which we also have talked about in a KWARC graduate seminar.

In any case, all he does seems to be at the text level, and does not seem to trasncend sentences. So it would really work inside the OMDoc statement level. We could just come up with an XML encoding of the MathLang boxes (do they have one) and make it an OMDoc module. That would standardize it and would keep it in sync with OMDoc and would of course give OMDoc much better control over natural language. I wonder how much of this is automatic.

A wonderful concept he is introducing is the concept of “souring” i.e. the inverse of sugaring (i.e. making it palatable to the human). So souring makes things palatable to the computer. We would probably call this preloading. The souring operation is used for analyzing chains of equations, … This seems quite similar to things I have done in sTeX (and was very proud of at the time). I will have to look it up and compare it.

He takes the souring notation to the extreme, so that he can even include aggregation into account e.g. \forall x,y:A –> \forall x:A \forall y:A. This is really nice to see for a lambda-person like me, quite nifty. Is this really automated? He has souring constructors share, chain, fold, map, position.

I wonder whether this gives a very strong presentation language for OMDoc, we already have map in our system, maybe we should look at this. I am quite intrigued.

Lessons from the DLMF search

Saturday, June 30th, 2007

I am sitting in Abdou Youssef’s talk on his search engine on the DLMF, one thing that stuck me is that he says is that he is doing hit fragment descriptions by pre-computing the fragments at indexing times, storing them in a database and then do a fragment search, i.e. in comparison with MWS, where we compute the fragment at reporting time, he only assembles the hit page from the database, which seems more reliable and of course faster. I think that this should be a standard technique in Math Search that is independent of the search engine.

Of course they have it good, since they generate all their documents from LaTeX and have good control over what is a good fragment. If we are in the general case, this is not true. But we could use some discourse grammar techniques to do the fragment computation.