Non-topical factors in information access

Jussi Karlgren, SICS

Research in information retrieval has traditionally concentrated on making assumptions about the content of documents based on very shallow semantic analysis through word occurrence statistics of various kinds.
Now, texts are more than bags of words, and the semantic analysis information retrieval systems typically use is overly simple. There is ample reason to try to broaden the view of what text is and why.
But better content analysis alone will not be enough. Texts are more than their meaning. Texts have structure, they have a context, they are written in a style conformant or discordant to a genre they are to be understood in, they may be carefully written or hastily thrown together,
they are written by various types of agent for various reasons. Besides information to be found in the text or from the author, texts are used by readers of various backgrounds, for various reasons, and with varying degree of satisfaction.
This talk will outline a framework within which to find more knowledge from texts than an approximation of their topic and use this knowledge to design useful tools for information access.