Building word sense taxonomy and automatic annotation for mandarin chinese 2

ACKNOWLEDGEMENT First and the most, thank my supervisor, Dr Wang Hui, for her constant guidance, valuable advices, and unfailing support though my four years here, at National University of Singapore Not only is she an incrediblely responsible and patient advisor, but also an unforgettable friend who understands and cares about my life Thank Dr Zhang Min and Dr Chung Raung-Fu for serving on my committee board Dr Zhang is a Computer Scientist giving me advices in technique field Dr Chung provides many advices from the view of being a linguist I thank Lin Jinzhan, Liu Zengjiao, Qin Shaokang, Wang Yuelong, Xiao Hang, and Xu Tingting, for their critical and useful comments to my thesis And the time spent with them enriched my life in Singapore I thank NUS for supporting my dissertation research with the Research Scholarship This support made it possible for me to able to pursue doctoral degree in NUS I thank Faculty of Arts and Social Science at NUS, for the research support service they provide For general administration services, I thank Chinese Studies Department at NUS They are such a brilliant group, and I have learned so much from each one of them The most special thanks, of course, go to my family - my wife Ms Wu Yue, and my parents I could never have gone this far without their sweetest love and unconditional support i 目录 ACKNOWLEDGEMENT i Abstract 1 第一章绪论 3 1.1 义类和义类体系 3 1.2 课题的提出 4 1.3 本文的研究目标 6 1.3.1 义类体系 6 1.3.2 义类标注 8 1.4 语料来源 8 1.5 本文的章节安排 9 第二章文献综述和理论框架 10 2.1 义类建设 10 2.1.1 基于词义特征的方法 10 2.1.2 基于组合特征的方法 13 2.1.3 基于语义场和领域特征的方法 15 2.1.4 建设义类方法的总结 17 2.2 语义标注 19 2.2.1 基于知识的词义消歧 20 2.2.2 基于语料库的有监督消歧 21 2.2.3 基于语料库的无监督消歧 22 2.2.4 词义消歧方法的总结 24 2.3 本文的理论框架和研究方法 27 2.3.1 前提理论 29 2.3.2 本文的研究方法 31 第三章汉语词汇义类体系的定义（上）：名词部分 33 3.1 定义汉语词汇义类的特征 33 3.2 名词义类体系 37 ii 3.2.1 句法功能对于名词分类的作用 39 3.2.2 语义角色对于名词分类的作用 40 3.2.3 语义选择限制对于名词分类的作用 41 3.2.4 名词义类详解 42 3.3 小结 111 第四章汉语词汇义类体系的定义（下）：动词、形容词部分 112 4.1 动词义类体系 112 4.1.1 句法功能对于动词分类的作用 113 4.1.2 论元结构对于动词分类的作用 113 4.1.3 语义选择限制对于动词分类的作用 114 4.1.4 动词义类详解 114 4.2 形容词义类体系 138 4.2.1 句法功能对形容词分类的作用 139 4.2.2 论元结构对形容词分类的作用 139 4.2.3 语义选择限制对形容词分类的作用 140 4.2.4 形容词义类详解 140 4.3 建立义类体系的难点 148 第五章义类自动标注 152 5.1 基于词典的义类标注 152 5.1.1 拼音与义类的关系 153 5.1.2 义项与义类的关系 154 5.2 多义类消歧 155 5.2.1 高频义类标注 156 5.2.2 有监督的自动消歧实验 158 5.3 小结 171 第六章基于语料库统计的义类研究 173 6.1 义类频率和分布 173 6.1.1 名词义类的分布及其频率 173 6.1.2 动词义类的分布及其频率 177 6.1.3 形容词义类的分布及其频率 179 iii 6.2 义类的内部特征 181 6.2.1 义类内外特征的互补关系 182 6.2.2 计算方法 183 6.2.3 个案研究一：名词义类“1.1 生物”子类的固定词根比例 185 6.2.4 个案研究二：动词义类“1 自主变化”子类的固定词根比例 188 6.3 小结 191 第七章结论 192 附录 196 附录一：名词义类词表 197 附录二：动词义类词表 236 附录三：形容词义类词表 252 附录四：名词、动词、形容词义类分布表 259 参考文献 262 iv 表格和图表目录表格 13 个义类体系建设方法的统计 18 表格各学习算法在 DSO 数据上的准确率 22 表格各种词义消歧方法的比较 25 表格定义和标注义类的三种分类特征 32 表格义类定义框架中的语义角色 35 表格多音多义类词条示例 153 表格多音单义类词条示例 153 表格可通过读音区分义类的词条示例 154 表格可通过义项区分义类的词条示例 154 表格 10 高频义类标注正确率在 90%以上的词条 156 表格 11 高频义类标注正确率在 70%以上的词条 157 表格 12 高频义类标注正确率在 50%以上的词条 157 表格 13 语料库义类频率统计结果 157 表格 14 七个分类器的平均消歧正确率和消歧结果优异率 167 表格 15 560 条多义类词的消歧结果 167 表格 16 名词义类“1.1.5 生物部分”的子类分布 175 表格 17 名词义类“1.2 非生物”的子类分布 176 表格 18 动词义类“1 自主变化”子类的频次和成员词数量分布 178 表格 19 形容词义类“2.2 属性值”子类的频次和成员词数量分布 181 表格 20 名词义类“1.1 生物”各子类的固定词根比例 186 表格 21 名词义类“1.1.1 人”各子类的固定词根比例 186 表格 22 名词义类“1.1.2 动物”各子类的固定词根比例 187 表格 23 名词义类“1.1.3 植物”各子类的固定词根比例 187 表格 24 名词义类“1.1.4 群体”各子类的固定词根比例 187 表格 25 名词义类“1.1.5 生物部分”各子类的固定词根比例 188 表格 26 动词义类“1 自主变化”各子类的固定词根比例 188 表格 27 动词义类“1.1 过程”各子类的固定词根比例 189 表格 28 动词义类“1.2 状态”各子类的固定词根比例 189 v 表格 29 动词义类“1.2.1 境遇”各子类的固定词根比例 189 表格 30 动词义类“1.3 经历”各子类的固定词根比例 190 图表语料库义类标注实验流程图 152 图表义类自动消歧实验流程图 159 图表 WEKA 主界面 162 图表 WEKA 的选择分类器界面 162 图表 WEKA 提供的分类器参数设置界面 163 图表 WEKA 提供的测试方式 164 图表名词义类的频次和成员词数量分布 174 图表名词义类“1 具体名词”子类的频次和成员词数量分布 174 图表名词义类“1.1 生物”的子类分布 175 图表 10 名词义类“2 抽象名词”子类的频次和成员词数量分布 176 图表 11 名词义类“2.1 属性”的子类分布 177 图表 12 动词子类的频次和成员词数量分布 178 图表 13 动词义类“3 行为活动”子类的频次和成员词数量分布 179 图表 14 形容词义类“2 一价形容词”子类的频次和成员词数量分布 180 图表 15 形容词义类“2.1 生物值”子类的频次和成员词数量分布 180 vi Abstract In this dissertation, we study word sense class and word sense taxonomy We created word sense taxonomy for Chinese noun, verb and adjective, in terms of natural language processing Then conduct automatic word sense class annotation, generating Chinese word sense class corpus There are 97 word sense classes in noun taxonomy; 39 classes in verb taxonomy; 13 classes in adjective taxonomy We annotate 46650 words in corpus, among which are 25517 nouns, 15920 verbs and 5213 adjectives We have operatable definition for each class in the taxonomy, and make it suitable for corpus annotation, which are distinguishing characteristics of our word sense taxonomy We believe that the performance of word is decided by word meaning, and word sense is expressed in the form of the usage of the word We study the issue of word sense taxonomy in the frame of distributional theory, semantic selectional restrictions theory and syntagmatic theory Each class in the word sense taxonomy is defined with three types of features: syntactic performance, semantic role (for noun)/ argument structure (for verb), and semantic selectional restrictions Particually, features of syntactic performance and selectional restrictions are limited in quantity, which makes the definition operatable so that can be used as annotation instruction for human The result shows that the description of word sense class definition makes the taxonomy operateble in the process of sense class annotation The methodology applied for building the taxonomy is one of the contributions we made through this dissertation Scholars believe that the accuracy of word sense disambiguation system is hurt by the fine sense granularity of polysemous word, especially the senses with high similarity of external formation of usage Automatic classification experiments are performed for multi-class words 328 multi-class words get over 90% disambiguation precision, by simply using high-freqency class tagging techonology Then we employ machine-learning-based supervised classification techonology to disambiguate 560 multi-class words with more than 13 freqency in corpus The result of the experiments is quite encouraging, the accuracy is inspiring high: 84.1% words get the precision of over 90%; 96% words get the precision of over 85% The disambiguation results show that the word sense taxonomy has enough distintion among sense classes, and verify that the word sense taxonomy is applicable in automatic annotation The capability of being used in automatic annotation distinguishes our word sense taxonomy from other word sense taxonomy built by now Additionally, we propose the complementary relationship of internal and external features of word sense class in classification process, in a corpus based quantitative research of word sense classes Chapter proposes the idea of this dissertation and our research objectives In chapter 2, we make literature review of creating lexical knowledge base and art-of-state word sense disambiguation technologies In chapter and Chapter 4, Chinese word sense taxonomy is presented, with class definition and sample words A critical discussion of the methods we applied in creating the taxonomy is made at the end of chapter In chapter 5, we perform the automatic annotation for corpus, and automatic classification experiments for multi-class words In chapter 6, there is a corpus based quantitative research of word sense classes Chapter is the conclusion of this dissertation Keywords: word sense taxonomy; corpus annotation; syntagmatic theory; word sense disambiguation; Chinese lexical semantics 现代汉语词义分类体系的建立和自动标注第一章绪论 1.1 义类和义类体系本文研究的内容是汉语词义义类及义类体系。义类是在意义上有相似性的词义的集合，被语言学分类特征明确定义，词义是义类的成员或实例；义类体系是义类由分类特征的继承和扩充而形成的树形体系，它的基本单位是义类。在我们看来，义类和义类体系与同义词词典，知识本体，词汇网络都有一定的相似性。同义词词典是把相同词义的词的聚合起来形成的词典。同义词词典和义类体系中都有若干集合，每一个集合表达了一个意义或概念，集合中有若干词语，这些词是集合中的实例。从这个角度来说，义类和同义词是一样的，都是词义类聚的集合；义类体系和同义词词典也是一样的，都是若干词义集合的集合体。知识本体（ontology）是模式化描述知识的表示方式，它要求用形式化的特征按照一定的方式（模型）去描述知识，这样用知识本体描述出来的知识库具备共享性、可扩充性和可移植性特征（Gruber, T, 2009）。共享性是说用知识本体描述的知识可以被不同的用户使用。可扩充性是说知识本体可以描述知识库中不存在的新知识，简单地扩充知识库。可移植性是说本体知识库是一种被规则定义的数据库，所以可以方便的用于不同的计算系统。义类是被分类特征定义的词义集合，在这点上义类体系的基本思想与本体知识库相同，描述义类的是支持词义分类的语言学特征（Farrar, S, et al.，2002），这样做的目的是使得义类体系具备一定的共享性和可扩充性。词汇网络使用词语间的关系把词义连接起来。在义类体系中，词义之间没有直接的联系，义类间的联系是通过分类特征的继承和扩充得以实现，所以义类体系提供的关系基本上只有上下位关系，义类体系是以树的形式存在。 1.2 课题的提出本文课题的提出主要来源于我们对计算语言学及基于语料库的语言学研究的关注。当代语言学的研究大多是基于语料库的，尤其是应用语言学领域（如计算语言和语言教学），不仅需要高质量的语料库，也需要其他的语言知识资源，如机读词典，义类词典，语义网络等。从计算语言学（自然语言处理）研究的两个方面来看：语言学和计算技术，数据（语言知识）和算法（计算技术）是这个领域的两个基本研究问题。目前自然处理技术的主流方法是统计和规则相结合的方法。所谓规则与统计相结合，指的是运用统计模型从大规模语料库中学习到一些语言知识，然后把这些语言知识作为规则参数实现一些应用，在应用的过程中对规则参数不断进行修正，从而使得系统达到最佳。由于进入程序的语言知识都必须是规则化的、量化的信息，而面向信息处理的语言理论在短时间不能获得，所以训练程序能够学习到多少、多好的语言知识，完全依赖语料库提供了什么东西，所用到的语言学规则只是那些概率几乎为的公理化知识，这样的规则当然是很难得的，也就是说，语料库的加工越精细，统计模型的效率越高。而现在被大规模运用的语料库大多只是提供了浅层的句法信息，如词性，标注了短语的树库都很难得。可以看到，现在机器可用的语言知识大多还是语法知识，解决的问题主要是句法剖析（parsing）、模式识别等不需要太多语义知识参与的问题。目前，计算语言学正面临知识瓶颈的问题，现有的语言知识已经无法满足进一步研究的需要了，需要有语义知识的加入。义类作为一种词义知识，可以被现有的技术利用，提高现有技术的效率并拓展研究领域。要使得义类成为计算机可用的语言知识，首先得有一个合适的义类体系，这是我们关注义类问题的第一个原因。第二个原因是词义标注的问题。人们对语义标注的内容没能达成统一的认识，即语义标注要标注什么。以词义标注为例，词义标注其实是对多义词进行标注的问题。标注的目的是要对词义知识进行建模，所以首先要求用来标注的词典在多义词的释义方面做到颗粒度一致，第二要求被标注的词义知识是可以被建模的，即在语料库中可以抽取到足够的数据来作为词义分类的分类特征。词义标注首先烦胀熨帖凉清爽拘束骄傲轻快光彩沉重清明鼓舞踏实沉痛洋气乐意高昂遗憾反感难受难过安然飘飘然茫然难为情恶心无聊开心刺眼心慌喜恍惚粗直精正正黑软酸风流光明正经冷凶急 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 6 5 5 4 4 3 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.2 心理值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值暴开放狭隘含蓄痛快狭窄深沉优柔神气灵通腐败用功仗义努力霸道听话负责心虚自觉天真矜持乖坏硬窝囊厉害积极浑疯牛唐突烂漫温和恬淡苦涩开朗香香生疏宽广讲理蔫碌碌坎坷巧乏 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.3 品性值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 253 冲稀松浅薄宽阔香甜险峻精神简单低下混沌运气背运舒展昏沉落魄要好眼红忙啰唆轩昂闲懒反辛苦险恶空闲低下下三烂严整走运识趣偏心没出息在行有理有福要强像话受气守本分深宽粗老破老 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 11 8 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.1.4 状况值 2.2.1.1 可度量值 2.2.1.1 可度量值 2.2.1.1 可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值光粉肥鼓单花花死快细腻粗沉细松紧活灰粗薄响横整齐粗重立体纵直早圆阴腥细稀秃烫碎瘦生烂淡脆粗潮笨温润清朗 8 7 6 6 5 5 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 2.2.1.2 不可度量值 254 整齐朴素勉强经典深朴素清淡肯定别扭干巴软浅抽象软晓畅堂皇谨严高尚邃密轻盈冷僻干瘪锋利粗犷基本根本深邃平易缠绵曲折支离通畅轻快玄虚散漫平直平白精要含蓄冠冕简捷大概严密古典深层纯正 4 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2.2.1.2 不可度量值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值绝对肯定忠实现实严谨行对中光好通尖好死花好错死顺紧齐轻紧坏值糟圆凶透酥清干逗差清淡不行得力准袅袅冥冥考究透随便软怯尖 2 2 10 10 9 8 7 7 6 5 5 4 4 4 4 4 4 3 3 3 3 3 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.2 内容值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 255 可人可怜含糊干净平和宽绰紧张轻巧净头头正细老热左正新熟老干干矛盾小土铁死熟深浅老快绝活厚鬼肥多毒大错勉强经济一定野新 3 3 3 3 10 10 7 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.3 状态值 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他土熟生满灵死精好高沉洋牢活动好粗轻细好急紧正面宽狠淡急盈盈深尖锐火热自由狠义务稳巧盈盈随便辣悠悠形象朴实简单肯定立体自动密切惨烈 4 4 4 4 4 4 10 6 6 5 4 4 4 4 3 3 3 3 3 3 3 3 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.2.4 其他 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 256 粲然任意统一民主自足自然慢恍惚荒唐团团漂亮横盈盈随便瑟瑟偏直头深早晚旧晚临时定期久近短原始悠远原初始空余长远长久原先永久短暂久长永恒永远久远深永隽永中生史前 2 2 2 2 2 2 2 2 11 3 2 1 1 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.3 方式事件值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值先决很久好久持久悠久悠长漫长暂时长期短期历次悠邈间接前不久死幽幽相近拥挤寂寞闭塞远近相近开豁空茫卑下冷僻深邃宽松雅静喧闹热闹旷远广大茫茫旷溟濛颟顸迢迢音近黑寂开旷浩渺浩淼浩莽浩茫 2 2 1 1 1 1 1 1 1 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.1 时间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 257 遥远辽远空寂空凹荒芜荒僻荒莽荒凉狭小宽敞 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值偏僻嘈杂幽雅幽僻幽寂僻静宁静孤僻 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 2.4.2 空间值 258 附录四：名词、动词、形容词义类分布表名词义类分布表名词义类标记 1.1.1.1 身份 1.1.1.2 关系 1.1.1.3 超人 1.1.1.4 其他 1.1.2.1 兽 1.1.2.2 鸟 1.1.2.3 鱼 1.1.2.4 昆虫 1.1.2.5 微生物 1.1.3.1 草木 1.1.3.2 果实 1.1.4.1 机构 1.1.4.2 团体 1.1.4.3 其他 1.1.5.1 肢体 1.1.5.2 器官 1.1.5.3 其他 1.2.1.1.1 固体 1.2.1.1.2 液体 1.2.1.1.3 气体 1.2.1.2 能源 1.2.1.3 天文 1.2.1.4 地理 1.2.2.10 其他 1.2.2.1 食物 1.2.2.2 药物 1.2.2.3 衣物 1.2.2.4 材料 1.2.2.5 工具 1.2.2.6 标志物 1.2.2.7 作品 1.2.2.8 建筑物 1.2.2.9 钱财 1.2.3.1 生物废弃物 1.2.3.2 非生物废弃物 1.2.3.3 痕迹 1.2.3 废弃物 1.2.4 非生物部分 1.3 统称 2.1.10 其他属性 2.1.1 数量属性词性 n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n 频次 13570 11905 688 11495 4906 2777 1091 1514 66 5735 1898 1722 3708 2758 6700 1604 5365 3319 3018 844 97 2718 10677 94 2225 188 1612 876 14012 944 6839 7072 1103 167 131 541 3422 1284 2359 1388 词形数量 1664 702 153 273 370 269 147 165 19 642 296 229 566 230 315 107 841 505 196 79 18 158 815 35 538 71 400 182 2793 186 892 1265 216 36 45 161 875 170 244 254 259 2.1.2 物理属性 2.1.3 生理属性 2.1.4 心理属性 2.1.5 社会属性 2.1.6 内容属性 2.1.7 事件属性 2.1.8 动作行为属性 2.1.9 时空属性 2.2.1.1 光影 2.2.1.2 声 2.2.1.3 其他自然现象 2.2.2 社会现象 2.2.3 生理现象 2.2.4 心理现象 2.3.1 具体符号 2.3.2 抽象符号 2.4.1 社会规范 2.4.2 学科领域 2.4.3 其他信息 2.5.1 事件 2.5.2 活动 2.6.1 数量值 2.6.2 物理属性值 2.6.3 生理属性值 2.6.4 心理属性值 2.6.5 社会属性值 2.6.6 内容值 2.6.7 动作行为属性值 2.6.8 其他属性值 2.7 统称 3.1 具体时间 3.2 相对时间 3.3 时间单位 4.1 处所 4.2 方位总数动词义类分布表动词义类标记 1.1.1 存现 1.1.2 位移 1.1.3 变化 1.2.1.1 情绪 1.2.1.2 生理状态 1.2.1.3 其他境遇 1.2.2 自然现象 n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n 5008 2534 3402 3975 197 1501 2859 3189 1744 2103 4943 573 424 986 708 3537 623 2841 10689 1626 5150 379 2108 272 50 153 86 196 2286 784 821 3795 3065 6998 1198 223244 词性 v v v v v v v 223 219 273 512 38 130 298 165 171 180 586 110 155 192 111 360 113 230 1251 114 715 43 518 93 24 88 61 99 331 16 85 245 49 808 69 25801 频次 10770 13101 1931 1295 2198 2285 526 词性数量 153 294 149 259 257 601 75 260 1.2.3 一般状态 1.2.4 运动 1.3.1 经历 1.3.2 感知意向 1.3.3 所有 1.3.4 影响 1.3.5 产生关系动词 3.1.1 人自动行为 3.1.2 社会行为 3.1.3 位移 3.1.4 一般自动行为 3.2.1 人对象行为 3.2.2 一般对象行为 3.3.1 予取 3.3.2 交际 3.3.3 生产 3.3.4 其他 3.4.1 认知 3.4.2 一般心理活动能愿总数形容词义类分布表形容词义类标记 1.1 关系值 2.1.1 生理值 2.1.2 心理值 2.1.3 品性值 2.1.4 状况值 2.2.1.1 可度量值 2.2.1.2 不可度量值 2.2.2 内容值 2.2.3 状态值 2.2.4 其他 2.3 方式事件值 2.4.1 时间值 2.4.2 空间值总数 v v v v v v v v v v v v v v v v v v v v v 词性 a a a a a a a a a a a a a 7888 2034 3010 9000 4547 2705 2920 37910 6971 11283 5944 9283 15039 15110 6648 11195 2561 41808 3024 7390 5499 243875 频次 339 2730 4533 4261 443 200 22896 1579 5504 20201 5168 1177 1267 70298 1179 190 285 378 149 407 147 649 845 2695 279 453 1092 340 419 202 130 3918 101 235 38 15919 词形数量 20 220 347 488 112 1420 181 644 1232 420 43 83 5213 261 参考文献 Books: Lehrer, Adrienne, Semantic Fields and Lexical Structure, American Elsevier Publishing Co., 1974 Levin, Beth, English Verb Classes and Alternations: A Preliminary Investigation, University of Chicago Press, 1993 Christiane Fellbaum (ed.), WordNet: An Electronic Lexical Database, Massachusetts, USA: MIT Press, 1998 D.A Cruse, Lexical Semantics, New York: Cambridge University Press, 1986 Agirre, Eneko and Philip Edmonds ed., Word Sense Disambiguation: Algorithms and Applications, Springer, 2006 Zellig, Harris, Mathematical Structures of Language, New York: Interscience Publishers, 1968 Witten, Ian H., Eibe Frank and Mark A Hall, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, 2011 Lyons, John, Semantics, Cambridge University Press, 1973 Murphy, M Lynne, Semantic Relations and the Lexicon: Antonymy, Synonymy and other Paradigms, Cambridge University Press, 2003 Chomsky, Noam, Aspects of the theory of syntax, MIT Press, 1965 Patrick, Saint-Dizier and E.Viegas, Computational Lexical Semantics, Cambridge Press, 1998 Quinlan, Ross, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993 Abraham, Samuel and Ferenc Kiefer, A Theory of Structural Semantics, The Hague, Paris, Mouton & Co., 1966 Vossen P (ed), EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, 1998 Dong, Zhendong and Qiang Dong, HowNet and the Computation of Meaning, Hackensack, NJ: World Scientific Publishing Co., 2006 262 Articles: Agirre, Eneko & David Martinez, “Learning Class-to-Class Selectional Preferences”, in Proceedings of the Conference on Natural Language Learning, Toulouse, France, 2001, pp.15-22 Bai, Xiaopeng & Wang Hui, “Syntactic-Semantic model for Defining and Subcategorizing for Attribute Noun Class”, in the Proceedings of 24th Pacific Asia Conference on Language, Information and Computation, Tohoku University, Sendai, Japan, 2010 Chklovski et al, “The Senseval-3 Multilingual English-Hindi Lexical Sample Task”, in Proceedings of the Senseval-3: Third International Workshop on the Evaluation of System for the Semantic Analysis of Text, Barcelona, Spain, 2004, pp.5-8 Chou, Ya-Min and Huang Chu-Ren, “Hantology: An Ontology based on Conventionalized Conceptualization”, in Ontologies and Lexical Resources for Natural Language Processing, Chu-Ren Huang et al eds., Cambridge: Cambridge University Press, 2008 Baker, Collin F., Charles J Fillmore & John B Lowe, “The Berkeley FrameNet Project”, in The Proceeding of 17th International conference on Computational Linguistics, Vol 1, 1998 Eleni Miltsakaki and Livio Robaldo et al, “Sense Annotation in the Penn Discourse Treebank”, in Lecture Notes in Computer Science, Vol 4919, Computational Linguistics and Intelligent Text, p275-286, 2008 John, George H and Pat Langley, “Estimating Continuous Distributions in Bayesian Classifiers”, in Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, 1995, pp.338-345 Chao, Gerald and Michael G Dyer, “Word Sense Disambiguation of Adjectives Using Probabilistic Networks”, in Proceedings of the 17th International Conference on Computational Linguistics, Saarbrucken, 2000 Greame, Hirst and David St-Onge, “Lexical Chains as Representations of Context 263 in the Detection and Correction of Malaproprisms”, in WordNet: An Electronic Lexical Database, ed by Christiane Fellbaum, Massachusetts, USA: MIT Press, 1998, pp.305-332 Huang, Chu-Ren, Towards a Chinese Wordnet and a CE/EC Bi-Wordnet, in Chinese Language Sciences Workshop: Lexical Semantics, Department of Chinese, Translation and Linguistics, City University of Hong Kong, 2000 Platt, J., “Fast Training of Support Vector Machines using Sequential Minimal Optimization”, in Advances in Kernel Methods-Support Vector Learning, MIT Press, 1999, pp.185-208 Joachims T., “Making large-scale SVM Learning Practical”, in Advances in Kernel Methods, MIT-press, 1999, pp.169-184 Adam, Kilgarriff and Joseph Rosenzweig, “English Senseval: Report and Results”, in Proceedings of the International Conference on Language Resources and Evaluations (LREC), Athens, Greece, 2000, pp.1239-1244 Claudia, Leacock et al, “Using Corpus Statistics and WordNet Relations for Sense Identification”, in Computational Linguistics, Vol 24 (1), 1998, pp.147-165 Lee, Y.K., Ng H.T, “Supervised Word Sense Disambiguation with Support Vector Machines and Multiple Knowledge Sources”, in Proceedings of the SENSEVAL-3 workshop, Barcelona, Spain, 2004 Michael, Lesk, “Automatic Sense Disambiguation using Machine Readable Dictionaries: How to tell a pine cone from an ice cream cone”, in Proceedings of the ACM-SIGDOC Conference, Toronto, Canada, 1986, pp.24-26 Diana, McCarthy and John Carroll, “Disambiguation Nouns, Verbs and Adjectives Using Automatically Acquired Selectional Preferences”, in Computational Linguistics, Vol 29 (4), 2003, pp.639-654 Rada, Mihalcea, “Large Vocabulary Unsupervised Word Sense Disambiguation with Graph-Based Algorithm for Sequence Data Labeling”, in Proceedings of the Joint Human Language Technology and Empirical Methods in Natural Language Processing Conference, Vancouver, Canada, 2005, pp.411-418 264 Rada, Mihalcea and Dan Moldovan, “An Iterative Approach to Word Sense Disambiguation”, in Proceedings of the Annual Meeting of the Association for Computational Linguistics, Maryland, USA, 1999, pp.152-158 George, Miller and Walter Charles, “Contextual Correlates of Semantic Similarity”, in Language and Cognitive Processes, Vol (1), 1991, pp.1-28 George, Miller et al, “Using a Semantic Concordance for Sense Identification”, in Proceedings of the Fourth ARPA Human Language Technology Workshop, 1994, pp.303-308, Ide, Nancy and Jean Veronis, “Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art”, Computational Linguistics, MIT Press, 1998, pp.1~40 Ng, H.T and Hian B Lee, “Integrating Multiple Knowledge Sources to Disambiguation Word Senses: An Exemplar-Based Approach”, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, USA, 1996, pp.40-47 Okumura Manabu and Takeo Honda, “Word Sense Disambiguation and Text Segmentation based on Lexical Cohesion”, in Proceedings of the International Conference on Computational Linguistics, Kyoto, Japan, 1994, pp.755-761 Redington M., “Chater N & Finch, Distributional information: a powerful cue for acquiring syntactic categories”, in Cognitive Science, 1998, Vol 22, pp.425-469 Philip, Resnik, “Selection and Information: A Class-Based Approach to Lexical Relationships”, PhD thesis, University of Pennsylvania, 1993 Saffran J.R et al., “Statistical Learning by 8-Month-Old Infants”, in Science, Vol 274, 1996, pp.1926-1928 Farrar, Scott et al., “A Common Ontology for Linguistic Concept”, in Proceedings of the Knowledge Technologies Conference, Seattle, WA, 2002 Mcdonald, Scott and Michael Ramscar, “Testing the distributional hypothesis: The influence of context on judgements of semantic similarity”, in 265 Proceedings of the 23rd Annual Conference of the Cognitive Science Society, Edinburgh, Scotland, 2001 Park, Seong-Bae et al., “Word Sense Disambiguation by Learning Decision Trees for Unlabeled Data”, in Applied Intelligence 19, 2003, pp.27-38 Mark, Stevenson and Yorick Wilks, “The Interaction of Knowledge Sources in Word Sense Disambiguation”, in Computational Linguistics, Vol 27 (3), 2001, pp.321-349 Gruber, Tom, “Ontology”, in Encyclopedia of Database Systems, Ling Liu and M.Tamer Özsu (Eds.), Springer-Verlag, 2009, pp.1963-1965 Wu, Yunfang et al., “A Chinese Corpus with Word Sense Annotation”, in Proceedings of ICCPOL, Singapore, 2006, pp.414-421 Freund, Yoav and Robert E Schapire, “Experiments with a new boosting algorithm”, in Thirteenth International Conference on Machine Learning, San Francisco, 1996, pp.148-156 专书：董大年主编，《现代汉语分类词典》，北京：汉语大词典出版社，1998。符淮青，《现代汉语词汇》，北京：北京大学出版社，1985。符淮青，《词义的分析和描写》，语文出版社，1996。郭大方，《现代汉语动词分类词典》，长春：吉林教育出版社，1994。胡明扬译，C.J Fillmore 著，《格辩》，北京：商务印书馆，2002。林杏光，《词汇语义和计算语言学》，北京：语文出版社，1999。林杏光、菲白编，《简明汉语义类词典》，北京：商务印书馆，1987。梅家驹，《同义词词林》，上海：上海辞书出版社，1983。王安节、周殿龙，《形容词分类词典》，长春：吉林教育出版社，1993。王惠，《现代汉语名词词义组合分析》，北京：北京大学出版社，2004。许嘉璐、傅永和主编，《中文信息处理现代汉语词汇研究》广东教育出版社，， 2006。张志毅、张庆云，《词和词典》，中国广播电视出版社，1994。张志毅、张庆云，《词汇语义学》，北京：商务印书馆，2001。 266 论文：柏晓鹏、林进展，，第届汉语词汇语义学研讨会，新加坡，2008，页 254-261。陈群秀，，《语言文字应用》2001 年第期，页 14-20。陈小荷，，《语言文字应用》 1998 年第期，页 71-76。程月、陈小荷等，，《中国计算技术与语言问题研究——第七届中文信息处理国际会议论文集》，2007。贾玉祥、俞士汶，，《中文信息学报》2011 年第期，页 99-104。苏新春、洪桂治、唐师瑶，，《世界汉语教学》第 24 卷第期，2010，页 158-169。田久乐、赵蔚，，《吉林大学学报（信息科学版）》2010 年第期，页 602-608。王惠，， Computational Linguistics and in Chinese Language Processing, 2002, Vol (2), pp.77-88 王惠、詹卫东、俞士汶，，in The Journal of Chinese Language and Computing, 2003, Vol 13 (2), pp.159-176 肖航，，新加坡：新加坡国立大学硕士论文， 2008。顏國偉、譚慧敏，，香港科技大學計算機科學系，新加坡南洋理工大學中華語言文化中心，1999，页 5-9。 267 ... 其他属性 ――? ?2. 6.3 生理属性值 ―? ?2. 2 现象 ――? ?2. 6.4 心理属性值 ――? ?2. 2.1 自然现象 ――? ?2. 6.5 社会属性值 ―――? ?2. 2.1.1 光影 ――? ?2. 6.6 内容值 ―――? ?2. 2.1 .2 声音 ――? ?2. 6.7 动作行为属性值 ―――? ?2. 2.1.3 其他 ――? ?2. 6.8 其他属性值 ――? ?2. 2 .2 社会现象 ―? ?2. 7 统称 ――? ?2. 2.3... ――――1 .2. 1.4 地理 ――――1.1.1.4 其他 ―――1 .2. 2 人工物 ―――1.1 .2 动物 ――――1 .2. 2.1 食物 ――――1.1 .2. 1 兽 ――――1 .2. 2 .2 药物 ――――1.1 .2. 2 鸟 ――――1 .2. 2.3 衣物 ――――1.1 .2. 3 鱼 ――――1 .2. 2.4 材料 ――――1.1 .2. 4 昆虫 ――――1 .2. 2.5 工具 ――――1.1 .2. 5... ――? ?2. 1.3 生理属性 ――? ?2. 4.3 其他 ――? ?2. 1.4 心理属性 ―? ?2. 5 运动 ――? ?2. 1.5 社会属性 ――? ?2. 5.1 事件 ――? ?2. 1.6 内容属性 ――? ?2. 5 .2 活动 ――? ?2. 1.7 事件属性 ―? ?2. 6 属性值 ――? ?2. 1.8 动作行为属性 ――? ?2. 6.1 数量值 ――? ?2. 1.9 时空属性 ――? ?2. 6 .2 物理属性值 ――? ?2. 1.10

Định dạng
Số trang	273
Dung lượng	4,3 MB