Archive for July, 2008

Message from CICM: MathTran

Sunday, July 27th, 2008

Jonathan Fine focuses on capturing and improving large amounts of materials.

MathTran is a public web service and samples. Display on webpage is based on img tags and Javascript. Instant Previews. Coping with the breadth; instead of doing much with little amounts of content (breadth).

See paper

A MathSearch Competition

Sunday, July 27th, 2008

In my last post I just learned about a new search engine. We should really have a competition and example library for Math Search Engines. We talked about this some years back but we really need to get our act together, probably for the next MKM.

I can see three tasks that we have to accomplish for a competition

  1. collect a search corpus. It seems that the arXiv would be the right thing to start from here, it is big enough  to pick competition examples randomly.
  2. cooperate on an analysis pipeline and corpora. This would allow people to cooperate without having a full analysis pipeline.
  3. collect a corpus of search queries. This may be the biggest hurdle, since we need a gold standard of what we expect the hits to be
  4. come up with “divisions”. not all engines can do the same, so we should only let comparable engines compete; also multiple divisions will allow to have multiple trophies.
  5. build a competition harness. So that tests can be automated. This will also require and thus lead to general search APIs.

This is all I can think about at the moment, so give me your feedback.

Message from CICM: MathEdit – an Alternative to the Sentido Formulae Editor?

Sunday, July 27th, 2008

See MathUI submission

Message from CICM: Notation and Frequency of Symbols

Sunday, July 27th, 2008

At CICM 2008 (Workshop DML), Stephen Watt presented his work on analyzing the frequency of symbols, that would be an interesting infrastructure for further cop-based analysis.
See Michael’s blog and the DML Proceedings.

Another talk (Workshop MathUI) was on his handwriting recognition of mathematical notations: Presenting his Representation Approach. See paper

The challenge is that there is no fixed dictionary. But maybe CoPs provide some restrictions of potential parsing results? Or is frequency a better approach?

Message from CICM: iMath – Case Study on Mathematical Notation Writing

Sunday, July 27th, 2008

Marc Wagner implemented a plugin for TeXmacs which tracks a user writing and modifying a document. This was done to gain intuitions for extending is Plato editor (identifying the linguistic phenomena). And interesting aspect are that the process of writing of notations is also a practice, not just the selection of a notation. Another aspect is the level of formality users choose to solve their tasks. An analysis of the solutions might be an interesting case study for CoPs.

  • Most modification where due to notations errors (so automatic verification would be very helpful).
  • Sentences fragments where classified (linguistic ontology to deal with linguistic proofs)
  • Pointed out practice for “concluding step”.
  • Pointed out practice for “justifying” steps. (partly very hard to parse automatically: specific science or natural language)

Plan for the future: Additional components for the “ideal mathematical assistance system”. Among others

  • Linguistic Ontology for concepts, types, theory structures.
  • Dynamic Adaptation of Notations (Change Management).
  • Context Memory???.

See paper at MathUI 2008

Stephen Watt’s talk on analyzing subject areas by symbol frequencies

Sunday, July 27th, 2008

Stephen Watt proposing to automate subject classification from the document content. He says we should belive the document more than the classifiers. I think this is potentially very useful for our KWARC work, in particular to aid in large-scale analysis of documents, e.g. in the arXiv. For instance in notation understanding (@Christine are you listening?). In fact that is what Stephen is talking about just at the moment. He is also interested in pen-based formula recognition and it is clear that this is helpful here, and this also provides the motivation of looking at formulae only, since in pen-based math only has the formulae (not a lot of words here). I think there is also a another motivation: formulae and text constrain each other.

They use of arXiv (of course) and they also have a corpus of engineering Math texts, which is a good corpus, since engineering students imprint on this (like geese on Konrad Lorenz’ rubber boots).  I would like to get my hands on this corpus.

So Stephen computed the symbol frequencies on the corpora, and used pre-existing area classifications for the classifications. The ranking of symbols seems to give a nice key to distinguish  areas. In fact, you only look at the 10 top most comon symbols to identify the area. This really looks like CoP data. This is certainly very very interesting for us.

I would really like to see whether this technique can be used to predict citation cliques/cartels or the math genealogy database.

egomath search engine talk at DML Workshop

Sunday, July 27th, 2008

I am sitting in the DML (Digital Mathematical Libraries) workshop in Birmingham listening on Jozef Misutka’s talk on his search engine.

It is surprising how many math search engines are out there; this project has been started in 2004, and I had not really know about it. Jozef also uses an existing search engine for indexing, but he does a syntactic analysis before he indexes, at least for formulae. This is the central part of his talk. He tries to deduce the correct meaning from the input, which seems to be PDF.

Steps:

  1. Normalization (heuristical),
  2. linearization (since his search engine woks on strings/words)
  3. partial evaluation (e.g. with distributivity)
  4. generalization (introduction of variables in the index)
  5. ordering (for commutative operators)

This seems to be an attempt to get semantics into the search, i.e. E-retrieval, but do we have a clear information about the Equivalence relation E.

I think that the most important contribution here is probably the analysis phase here, since he extracts formulae from something as poor as PDF.

Interesting