Concept-Based Text Representations for Categorization Problems

by Magnus Sahlgren

Standard practice in most of today's research on text categorization is to represent texts simply as the bag of words they contain, ignoring both syntax and semantics. As an alternative to this, we have developed a novel form of text representation that we call Bag-of-Concepts, which constitutes a distributed representation of the concepts contained in a text.

Overwhelmingly, the most common representational scheme in text categorization research is the Bag-of-Words (BoW) approach. Here a text is represented as a vector whose elements are frequency-based weights of the words in the text. These BoW vectors are then refined, by feature selection for example, meaning words are removed from the representations according to statistical measures such as document frequency, information gain or mutual information. Another refinement method is to use feature extraction. In this case, 'artificial' features are created from the original ones, either by using clustering methods such as distributional clustering, or by using factor analytic methods such as singular value decomposition.

It is important to note that feature extraction methods handle problems with word variability by one of two methods. Either they group together words that mean similar things, or they restructure the data (ie the number of features) according to a small number of salient dimensions, so that similar words get similar representations. Since these methods do not represent texts merely as collections of words, but rather as collections of concepts - whether these be synonym sets or latent dimensions - we suggest that a more fitting label for these representations is Bag-of-Concepts (BoC).

One serious problem with BoC approaches is that they tend to either be computationally expensive or require external resources such as dictionaries. To overcome this problem, we have developed an alternative approach for producing BoC representations based on Random Indexing (see ERCIM News No.50, July 2002). This is a vector space methodology for producing 'context vectors' for words based on co-occurrence data. Very simply, this is achieved by first assigning a unique random 'index vector' to each context in the data. Context vectors are then produced by summing the index vectors of the contexts in which words occur. (For an introduction to random indexing, see The point of the context vectors is that they represent the relative meanings of words; they can also be used to compute the semantic similarity of words.

We use the context vectors produced with random indexing to generate BoC representations by summing the (weighted) context vectors of every word in a text. The resulting BoC vectors are effectively combinations of the concepts (ie word meanings) that occur in the text. Note that the representations are produced using standard vector addition, which means that their dimensionality never increases even though the data might grow: the dimensionality of the vectors is a parameter in random indexing. Since we typically choose a dimensionality much lower than the number of words and contexts in the data, we also achieve a reduction in dimensionality as compared to the original BoW representations.

To evaluate the BoC representations, we have used them for text categorization, which is the task of assigning a text to one or more predefined categories from a given set. Our experiments use a support vector machine classifier for a standard text categorization collection, and we have shown that the BoC representations outperform BoW with 88.74% vs. 88.09%, counting only the ten largest categories. This suggests that BoC representations might be more appropriate to use for large-size categories.

Our experiments also showed that it is always the same categories that are improved using BoC. This suggests that we might be able to improve the performance of the classifier by combining the two types of representations. When doing so, the result improves from 82.77% to 83.91% for all categories. For the top ten categories, the result improves from 88.74% to 88.99%. While the difference is admittedly small, the increase in performance when combining representations is not negligible, and indicates that concept-based text representations deserve further study.


Please contact:
Magnus Sahlgren, SICS, Sweden
Tel: +46 8 633 1604