Stephen Watt’s talk on analyzing subject areas by symbol frequencies
Stephen Watt proposing to automate subject classification from the document content. He says we should belive the document more than the classifiers. I think this is potentially very useful for our KWARC work, in particular to aid in large-scale analysis of documents, e.g. in the arXiv. For instance in notation understanding (@Christine are you listening?). In fact that is what Stephen is talking about just at the moment. He is also interested in pen-based formula recognition and it is clear that this is helpful here, and this also provides the motivation of looking at formulae only, since in pen-based math only has the formulae (not a lot of words here). I think there is also a another motivation: formulae and text constrain each other.
They use of arXiv (of course) and they also have a corpus of engineering Math texts, which is a good corpus, since engineering students imprint on this (like geese on Konrad Lorenz’ rubber boots). I would like to get my hands on this corpus.
So Stephen computed the symbol frequencies on the corpora, and used pre-existing area classifications for the classifications. The ranking of symbols seems to give a nice key to distinguish areas. In fact, you only look at the 10 top most comon symbols to identify the area. This really looks like CoP data. This is certainly very very interesting for us.
I would really like to see whether this technique can be used to predict citation cliques/cartels or the math genealogy database.
July 27th, 2008 at 3:41 pm
[...] of symbols, that would be an interesting infrastructure for further cop-based analysis. See Michael’s blog and the DML [...]