You are herePublications / Detecting events in a million New York Times articles

Detecting events in a million New York Times articles


By iliasfl - Posted on 20 September 2010

TitleDetecting events in a million New York Times articles
Publication TypeConference Proceedings
Year of Conference2010
AuthorsSnowsill, T., I. Flaounas, D. T. Bie, and N. Cristianini
Conference NameMachine Learning and Knowledge Discovery in Databases, European Conference, (ECML/PKDD)
Volume6323
Pagination615-618
PublisherSpringer, LNCS
Conference LocationBarcelona, Spain
Abstract

We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such text streams often exceeds human capacity for analysis, such that automatic pattern recognition tools are indispensable.

In particular, we are interested in surprising changes in the frequency of n-grams of words, or more generally of symbols from an unlimited alphabet size. Despite the exponentially large number of possible n-grams in the size of the alphabet (which is itself unbounded), we show how these can be detected efficiently. To this end, we rely on a data structure known as a generalised suffix tree, which is additionally annotated with a limited amount of statistical information. Crucially, we show how the generalised suffix tree as well as these statistical annotations can efficiently be updated in an on-line fashion.

URLhttp://www.springerlink.com/content/2811715q36363736/