Archive for the ‘MathSearch’ Category

A MathSearch Competition

Sunday, July 27th, 2008

In my last post I just learned about a new search engine. We should really have a competition and example library for Math Search Engines. We talked about this some years back but we really need to get our act together, probably for the next MKM.

I can see three tasks that we have to accomplish for a competition

  1. collect a search corpus. It seems that the arXiv would be the right thing to start from here, it is big enough  to pick competition examples randomly.
  2. cooperate on an analysis pipeline and corpora. This would allow people to cooperate without having a full analysis pipeline.
  3. collect a corpus of search queries. This may be the biggest hurdle, since we need a gold standard of what we expect the hits to be
  4. come up with “divisions”. not all engines can do the same, so we should only let comparable engines compete; also multiple divisions will allow to have multiple trophies.
  5. build a competition harness. So that tests can be automated. This will also require and thus lead to general search APIs.

This is all I can think about at the moment, so give me your feedback.

Stephen Watt’s talk on analyzing subject areas by symbol frequencies

Sunday, July 27th, 2008

Stephen Watt proposing to automate subject classification from the document content. He says we should belive the document more than the classifiers. I think this is potentially very useful for our KWARC work, in particular to aid in large-scale analysis of documents, e.g. in the arXiv. For instance in notation understanding (@Christine are you listening?). In fact that is what Stephen is talking about just at the moment. He is also interested in pen-based formula recognition and it is clear that this is helpful here, and this also provides the motivation of looking at formulae only, since in pen-based math only has the formulae (not a lot of words here). I think there is also a another motivation: formulae and text constrain each other.

They use of arXiv (of course) and they also have a corpus of engineering Math texts, which is a good corpus, since engineering students imprint on this (like geese on Konrad Lorenz’ rubber boots).  I would like to get my hands on this corpus.

So Stephen computed the symbol frequencies on the corpora, and used pre-existing area classifications for the classifications. The ranking of symbols seems to give a nice key to distinguish  areas. In fact, you only look at the 10 top most comon symbols to identify the area. This really looks like CoP data. This is certainly very very interesting for us.

I would really like to see whether this technique can be used to predict citation cliques/cartels or the math genealogy database.

egomath search engine talk at DML Workshop

Sunday, July 27th, 2008

I am sitting in the DML (Digital Mathematical Libraries) workshop in Birmingham listening on Jozef Misutka’s talk on his search engine.

It is surprising how many math search engines are out there; this project has been started in 2004, and I had not really know about it. Jozef also uses an existing search engine for indexing, but he does a syntactic analysis before he indexes, at least for formulae. This is the central part of his talk. He tries to deduce the correct meaning from the input, which seems to be PDF.

Steps:

  1. Normalization (heuristical),
  2. linearization (since his search engine woks on strings/words)
  3. partial evaluation (e.g. with distributivity)
  4. generalization (introduction of variables in the index)
  5. ordering (for commutative operators)

This seems to be an attempt to get semantics into the search, i.e. E-retrieval, but do we have a clear information about the Equivalence relation E.

I think that the most important contribution here is probably the analysis phase here, since he extracts formulae from something as poor as PDF.

Interesting

Success Rates in the arXMLiv project

Sunday, December 9th, 2007

I have been silent for a long time, since the semester and various papers have kept me busy. But the semester is over, now…

We have been making some progress on the conversion of the arXiv  collection from LaTeX to XHTML+MathML (see the arXMLiv project at KWARC), and I have announced that we have over 50% “success rate”. I have been asked by Aaron Krowne what success rate means and when we are going to reach 100%.  Here is the story.

First I would like to briefly talk about what we are doing. We are using Bruce Miller’s LaTeXML converter over the ca 370 000 documents contained in the arXiv. Heinrich Stamerjohanns has build a test harness for LaTeXML that parses the log files and makes the statistics available on the web. This is a very powerful way of doing things, it has exposed a lot of problems in LaTeXML and has allowed Bruce to make the program much more stable (see e.g. the fatal error development). At 370000 LaTeX documents from all over the world over 15 years, there is almost no error you will not encounter. The other result is that we are sitting on what is probably the largest collection of documents with MathML in them worldwide.

The main technical task of the arXMLiv project is to supply LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collections. A group of Jacobs University Undergrads are helping with this. Since we are still in a development mode, we do only download last year’s collection of articles after newyear (about 80000+ new ones in a couple of weeks).

Now, let’s come back to the questions: Technically success means that the LaTeXML program does not throw any, errors, i.e. that all macros are known. Whether the transform is mathematically correct, is another matter, this needs human testers, and organizing that is an interesting problem in itself. We have first ideas in this direction and will try to make progress on this front in the next months.

 

And now to the percetages:  I am not sure whether we will hit 90% at all. The problem is that this about the number of files that can still be  successfully  by LaTeX, since arXiv does not have a viable package management system.

Furthermore, arXiv papers use about 7000 packages and classes, of which I guess three quarters are used by less than five papers. So we
are only going to bother about giving LaTeXML bindings for the more important ones (following the 80/20 rule). Moreover, the older the papers get, the less likely they are are to be successful, so I guess conversion success rates will go up automatically when we add
the 2007 papers (ca 80000+). Finally, the success rates vary considerably over the different categories of the arXiv. The success rate actually dropped by 10% by to 50% by starting a big new category (we had been at about 60% before).

My personal suspicion is that we will reach 70% in the next three months, then the going will become slower, and I am not sure how much we
will go beyond 80% realistically with the resources (a couple of undergrads) any time soon. To reach this, we would have to take the project global, which I would not mind, but which I am not necessarily seeing as one of my priorities. But you are of course invited to join our little project, so just contact me if you are interested.

Lessons from the DLMF search

Saturday, June 30th, 2007

I am sitting in Abdou Youssef’s talk on his search engine on the DLMF, one thing that stuck me is that he says is that he is doing hit fragment descriptions by pre-computing the fragments at indexing times, storing them in a database and then do a fragment search, i.e. in comparison with MWS, where we compute the fragment at reporting time, he only assembles the hit page from the database, which seems more reliable and of course faster. I think that this should be a standard technique in Math Search that is independent of the search engine.

Of course they have it good, since they generate all their documents from LaTeX and have good control over what is a good fragment. If we are in the general case, this is not true. But we could use some discourse grammar techniques to do the fragment computation.

Treating Mathdex Presentation trees for MathWebSearch

Saturday, June 30th, 2007

I am sitting in Robert Miner’s (Design Science) talk on Mathdex in MKM2007, he is stressing that the most imporant thing in in their experiments turned out to be data normalization. He is actually going a long way towards semantics for the general case. At least he is generating some kind of trees, so we will be able to index them in the MWS system. They are interpretable in first-order terms, which is all that MWS needs. It would be very nice if we could build on the idea we had last MKM to have a set of challenge examples that we all can work on and compare our systems.