The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems

Presentation by Andreas Hotho at Lernen, Wissen, Adaptivität (LWA 2008), University of Würzburg, 6.-8. October 2008. Track: KDML

The social bookmarking system BibSonomy has to deal with a lot of spam; which hamper the quality of search results and navigation. This talk focuses on detecting users as spammer, making all their posts invisible in the system. This decision is based on their tagging and personal data such as eMail etc. The authors present a framework that allows for automatic classification of spammers.

How to detect Spammers: Checking all their tags and, possible, the bookmarked sites. Spam posts are identified if:

  • Tags describing a web page do not fit to the content of the site.
  • Tags and/or topic of a post are not interesting for the system.

Problems:

  • Subjective notion of what is spam
  • No cross-check; noise
  • Only two classes: spam or non-spam
  • Maybe identification of spammers to not granular enough, rather flag posts as spam
  • User may have several accounts

Features:

  • Profile features (digits in name, digits in mails, length of the names, mails)
  • Activity features (time between registration and first post, number of tags per post – spammers use more, …)
  • Location features (number of users in the same domain or IP address)
  • Semantic features (automatic tag from spamming software “$Group” can be used to make tags public in some bookmarking systems, blacklist of spam tags, co-occurrence of information as “a spammer shares resources with about 18 other spammers, but only with 0.5 non-spammers”)

Classification algorithms: SVM (best), J48, Logistic regression (worse), and Naive Bayes

Leave a Reply

You must be logged in to post a comment.