for the degree of Doctor of Philosophy
at The University of Hong Kong
Corpora annotated with semantic information are essential resources for natural language understanding by computers. There are two broad types of semantic information (McEnery & Wilson, 2001): (i) Semantic features of words – the annotation of word senses, (ii) Semantic relations of items in a sentence – essentially the annotation of event roles of participants involved in events. Recent developments of large Chinese treebanks such as the Penn Chinese Treebank and the Sinica Corpus have showed a shift of interest from syntactic annotation to semantic annotation – shallow semantic parsing. Typically, they include the annotation of event roles of syntactic constituents in relation to the event denoted by the main verb in a sentence, but not semantic features of individual words. Li et al. (2003) tagged both word senses and semantic relations among words in a sentence. However, two different frameworks were used for the two kinds of tagging, which cannot be readily integrated. This study adopts the approach of HowNet, which provides a consistent way to incorporate the two kinds of semantic information when annotating a Chinese corpus.
HowNet is a knowledge base that describes inter-concept and inter-attribute relations. Meanings of concepts are constructed from a closed set of primitives or sememes – the basic units of meanings that cannot be decomposed further. Based on the semantic features revealed by sememes, Message Structures are built, which provides a consistent way of constructing meanings from the levels of words to phrases and sentences. Such an approach enables this study to incorporate word senses and semantic dependency relations in semantic annotation.
Previous studies had tagged a corpus of 30,976 words with word senses, based on the 1999 version of HowNet (Gan & Tham, 1999). Having updated the word senses to the 2000 version, this study further proceeded to annotate Chinese texts with Message Structures, following Gan & Wong (2000a). The aim is to provide resources to help computers figure out semantic relations among all words in a sentence, thus an understanding of the message conveyed by the sentence.
Good results were obtained from off-the-shelf automatic annotation tools in word sense disambiguation and semantic dependency parsing of this corpus, compared to other HowNet-based studies. These tools were applied to finish tagging the remainder of the corpus not yet finished in previous studies. Outputs of the tools were proofread by human annotators to check if errors had occurred, saving much manpower and time compared with purely manual annotation utilized in previous studies.
Problems encountered in manual annotations of HowNet’s Message Structures and its applications to automatic parsing are discussed. Some Message Structures violate Robinson’s axioms that constrain the well-formedness of dependency structures. This thesis proposes some modifications of the structures to solve these problems in automatic parsing and suggests rules to recover the complicated structures in post-processing. The results show that HowNet provides a robust model to incorporate linguistic knowledge into a Chinese corpus, shedding light on the nature of Chinese and providing an effective means of analyzing this language.