US 5675819 Document information retrieval using global word co-occurrence patterns
ABSTRACT – A method and apparatus accesses relevant documents based on a query. A thesaurus of word vectors is formed for the words in the corpus of documents. The word vectors represent global lexical co-occurrence patterns and relationships between word neighbors. Document vectors, which are formed from the combination of word vectors, are in the same multi-dimensional space as the word vectors. A singular value decomposition is used to reduce the dimensionality of the document vectors. A query vector is formed from the combination of word vectors associated with the words in the query. The query vector and document vectors are compared to determine the relevant documents. The query vector can be divided into several factor clusters to form factor vectors. The factor vectors are then compared to the document vectors to determine the ranking of the documents within the factor cluster.
FIELD OF THE INVENTION
This invention relates to improvements in retrieving relevant documents from a corpus of documents. More particularly, this invention determines the co-occurrence patterns of words in a document to form a thesaurus of word vectors.
BACKGROUND OF THE INVENTION
Information retrieval systems typically define similarity between queries and documents in terms of a weighted sum of matching words. The usual approach is to represent documents and queries as long vectors and use similarity search techniques. One problem is that a query and a document may share ambiguous words that are used in a different sense. Therefore, a false similarity match could occur (ambiguity problem). A second problem is that the same content may be expressed in different words. Therefore, a short query may miss a relevant document (synonymy problem). If a document is relevant but uses words synonymous to words in the query, the document cannot be found.
One solution is to lengthen the query through relevance feedback. After conducting the first search using an initial query. Additional words are added to the query to narrow the search for the next search iteration.
Another solution is to expand a query through synonym relations as found in thesaurus. A synonym relation is one that finds “closely related words” as opposed to “syntactically and semantically interchangeable words”. Thus, the procedure simply maps from one word to other closely related words.
For a thesaurus to be useful in information retrieval, it must be specific enough to offer synonyms for words as used in the corpus of interest. For example, in a corpus of computer science documents, the word “interpreter” would have meanings quite different from everyday language. A thesaurus must also cover all or most of the words found in queries, including the potentially unbounded set of proper nouns.
These two considerations suggest that generic thesauri, which are restricted to common usage, are unlikely to be helpful. Thus, the thesauri must be tuned to the corpus of interest. These might be hand built for a restricted domain or computed from the text of corpus itself.
A thesaurus is a data structure that defines semantic relatedness between words. It is typically used in information retrieval to expand search terms with other closely related words. Even if the thesaurus is not explicitly computed, the mapping performed by query expansion explicitly defines a thesaurus.
The simplest, and perhaps most conventional, approach to thesaurus construction is to manually build an explicit semantic mapping table. This is clearly labor-intensive, and hence only possible in specialized domains where repeated use may justify the cost. For example, the RUBRIC and TOPIC text retrieval systems require a domain expert to prepare a hierarchical structure of “topics” germane to a particular subject area. See McCune et al.; “Rubric, A System for Rule-based Information Retrieval”; IEEE Transactions on Software Engineering 9; pp. 939-44; 1985. Each topic is a boolean combination of other topics and search terms. Searchers then employ terms from this hierarchy to form queries that automatically expand to complex boolean expressions.
Another approach is to reuse existing on-line lexicographic databases, such as WordNet (see Voorhees et al.; “Vector Expansion in a Large Collection”; Proceedings of TREC, 1992.) or Longman’s subject codes (see Liddy et al.; “Statistically-guided Word Sense Disambiguation”; Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language; 1992 AAAI Press). However, generic thesauri of this sort will often not be specific enough for the text collection at hand. For example, in Voorhees et al., “acts” is expanded with the meaning “acts of the apostles” in a corpus of legal documents. In addition, they frequently do not record information about proper nouns, yet proper nouns are often excellent retrieval cues.
Corpus-based methods perform a computation on the text of the documents in the corpus to produce a thesaurus. For example, a hierarchical thesaurus is formed from a computer list of complex noun phrases where subsumption roughly corresponds to the subset relation defined on terms, e.g., “intelligence” subsumes “artificial intelligence”. See Evans et al.; “Automatic Indexing Using Selective NLP and First-order Thesauri”; Proceedings of the RIAO; Vol. 2, pp. 624-43; 1991. While this method is superior to approaches that treat phrase terms as unanalyzed segments, there is no notion of semantic similarity of basic terms. For example, the semantic similarity of “astronaut” and “cosmonaut” is not represented in the hierarchy.
Head-modifier relationships can be used to determine semantic closeness. See Grefenstette, G.; “Use of Syntactic Context to Produce Term Association Lists for Text Retrieval”; Proceedings of SIGIR 1992; pp. 89-97. See Ruge, G.; “Experiments on Linguistically-based Term Associations”; Information Processing & Management 28 (3); pp. 317-32; 1992. This solution is costly since parsing technology is required to determine head-modifier relations in sentences. It is also unclear to what extent words with similar heads or modifiers are good candidates for expansion. For example, adjectives referring to countries have similar heads (“the Japanese/Chilean capital”, “the Japanese/Chilean government”), but adding “Japanese” to a query that contains “Chilean” will rarely produce good results.
Semantic relatedness can be determined by considering the occurrence of terms in documents. See Crouch, C.; “An Approach to the Automatic Construction of Global Thesauri”; Information Processing & Management 26 (5); pp. 629-40; 1990. Documents are clustered into small groups based on similarity measure. Two documents are considered similar if they share a significant number of terms, with medium frequency terms preferentially weighted. Terms are then grouped by their occurrence in these document clusters. Since a complete-link document clustering is performed, the procedure is very computationally intensive and does not scale to a large reference corpus. Further, the central assumption that terms are related if they often occur in the same documents seems problematic for corpora with long documents. It also does not capture the intuitive notion that synonyms do not co-occur, but rather have similar co-occurrence patterns.
Crouch constructs thesaurus classes by grouping words into bins of related words. Unfortunately, the boundaries between classes will be inevitably somewhat artificial. If classes are made too small, some words will be cut off from part of their topical neighborhood. If classes are too large, words will be forced into classes with words from different topics. Any particular class size will either separate some words from close neighbors or lump together some words with distant terms.
A thesaurus can be constructed by defining a similarity measure on terms within the document. See Qiu et al.; “Concept Based Query Expansion”; Proceedings of SIGIR 1993. Terms are represented as high-dimensional vectors with a component for each document in the corpus. The value of each component is a function of the frequency the term has in that document. They show that query expansion using the cosine similarity measure on these vectors improves retrieval performance. However, the time complexity for computing the similarity between terms is related to the size of the corpus because the term vectors are high-dimensional. Thus, the method cannot be used on a large scale.
A large-scale singular value decomposition can be used for information retrieval. See Deerwester et al.; “Indexing by Latent Semantic Analysis”; Journal of the American Society of Information Science 41 (6); pp. 391-407; 1990. An improved retrieval performance results by inducing representations for documents that reflect term dependencies and remedy the bumpiness of small counts. However, undetected term dependencies and small counts are a problem if document occurrence is the basis of representation.
The Deerwester method is similar to context vectors, because the LSI (latent semantic indexing) algorithm can compute document and query vectors as weighted sums of term vectors. However, LSI is based on document occurrence. Decomposing a term-by-document matrix of a large collection can take days or even weeks because the time complexity is quadratic in the number of documents to process.
Documents can be represented as vectors whose entries correspond to microfeatures such as finance, animal kingdom, etc. See Gallant, S., I.; “A Practical Approach for Representing Context and for Performing Word Sense Disambiguation Using Neural Networks”; Neural Computation 3 (3); pp. 293-309; 1991. Word vectors are manually encoded for a medium number of words. Then, the document vectors are computed as sums of word vectors. However, this hand-encoding of documents is labor-intensive. Furthermore, the microfeatures chosen or the manual encoding may not accurately reflect the specifics of the document collection.
Machine-readable dictionaries can be used to derive “context vectors.” See Wilks et al., “Providing Machine Tractable Dictionary Tools”; Machine Translation; Vol. 5, No. 2, pp. 99-154; 1990. The representation of a word is based on words it co-occurs with in a dictionary rather than a corpus. Context vectors are then computed from these dictionary-based word representations. This method has the same problems as other dictionary-based approaches. In particular, a genre-specific distinction that is not covered in the dictionary is not adequately represented in the dictionary-based representations.
Some researchers argue against the utility of co-occurrence-based expansion. See Peat et al.; “The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems”; Journal of the American Society for Information Science 42 (5); pp. 378-83; 1991. Because synonyms often do not occur together, they assert that a co-occurrence-based approach may have difficulty identifying synonymy relations. Furthermore, many researchers use measures for defining closeness that will group words according to frequency. By using these measures, it is impossible for a frequent word to have an infrequent neighbor.
SUMMARY OF THE INVENTION
An object of the invention is to form a new corpus based method for constructing a thesaurus based on lexical co-occurrence of terms in the corpus.
Another object of the invention is using a method for accessing and browsing documents based on content similarity. Words and documents are represented as vectors in the same multi-dimensional space that is derived from global lexical co-occurrence patterns. The method forms an improved retrieval performance for non-literal matches with queries.
The computation of the lexical co-occurrence thesaurus proceeds in two phases. First, the lexical co-occurrence pattern of each word is presented as a multidimensional vector, the thesaurus vector. Second, a similarity measure is induced on words by comparing these vectors. Given a particular word its synonyms are then defined to be its nearest neighbors with respect to the similarity measure. This method of exploiting a lexical co-occurrence structure of words, i.e., forming a word’s vector representation from entries of its near lexical neighbors rather than from only itself is superior to conventional methods.
Lexical co-occurrence is more informative both qualitatively and quantitatively. Two terms lexically co-occur if they appear in text within some distance of each other, i.e., a window of k words. Qualitatively, the fact that two words often occur close to each other is more likely to be significant than the fact that they occur in the same documents. Quantitatively, there are more co-occurrence events than occurrence-in-document events in a given document collection. For a word occurring n times in the document collection and for a definition of co-occurrence as occurring in a window of k words, there are nk co-occurrence events. However, there are only n occurrence-in-document events. If the goal is to capture information about specific words, lexical co-occurrence is the preferred basis for statistical thesaurus construction.
Synonyms rarely co-occur. Synonyms tend to share neighbors that occur with both. For example, “litigation” and “lawsuit” share neighbors such as “court”, “judge”, and “proceedings”. A thesaurus represented by lexical co-occurrence patterns defines semantic closeness in terms of common neighbors. Thus, synonyms are not required to co-occur, but they must have similar co-occurrence patterns.
A multi-dimensional continuous space is formed where each word’s thesaurus vector represents its individual position. A continuous space does not force a classification choice, and hence avoids some of the ensuing problems.
The dimensionality of the thesaurus space is reduced by using a singular value decomposition. The closeness of terms with equal frequency occurs because the terms have about the same number of zero entries in their term vectors. For a given term, singular value decomposition assigns values to all dimensions of the space, so that frequent and infrequent terms can be close in the reduced space if they occur with similar terms. For example, the word “accident,” which may occur 2590 times, and the word “mishaps,” which may occur only 129 times, can have similar vectors that are close despite the frequency difference between them. The technique of singular value decomposition (SVD) is used to achieve a dimensional reduction by obtaining a compact and tractable representation for search purposes. The uniform representation for words and documents provides a simple and elegant user interface for query focusing and expansion.
After forming the thesaurus vectors, a context vector for each document is computed. The context vector is a combination of the weighted sums of the thesaurus vectors of all the words contained in the document. These context vectors then induce a similarity measure on documents and queries that can be directly compared to standard vector-space methods.
Another application for the thesaurus vectors is to divide a query into sub-topics. A vector for each query sub-topic is formed and compared to the document vectors. The document vectors are then scored and ranked by the degree to which they simultaneously match the subtopics of the query.