ACE05 自然语言信息抽取数据集
简介
数据集概述
提供已经标注好的多种类型实体,关系和事件,目前该数据集主要用于事件抽取任务中
有中文、英文和阿拉伯文的数据
标注说明
- 标注过程如下
- 先进行1P和DUAL两轮标注,标注的结果分别存储于对应语料的fp1和fp2目录下
- 对以上两轮标注的结果进行裁决,将才绝后的标注结果存储于对应语料的adj目录下
- 对于English的语料,对adj目录下标注的结果再进行一步处理,将结果存储于timex2norm目录下
对应的标注过程和标注内容如下
1P: entities DUAL: entities
values values
events events
relations relations
| |
| |
|_________?__________|
|
|
|
V
ADJ: entities
values
events
relations
|
|
|
V
NORM: TIMEX2 normalization
(English only)
目录架构
目录架构如下
─Arabic # 阿拉伯语语料库 │ ├─bn │ │ ├─adj │ │ ├─altAdj │ │ ├─fp1 │ │ └─fp2 │ ├─nw │ │ ├─adj │ │ ├─altAdj │ │ ├─fp1 │ │ └─fp2 │ └─wl │ ├─adj │ ├─fp1 │ └─fp2 ├─Chinese # 中文语料 │ ├─bn │ │ ├─adj │ │ ├─fp1 │ │ └─fp2 │ ├─nw │ │ ├─adj │ │ ├─fp1 │ │ └─fp2 │ └─wl │ ├─adj │ ├─fp1 │ └─fp2 ├─dtd # 数据说明文件 └─English # 英文语料 ├─bc │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─bn │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─cts │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─nw │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─un │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm └─wl ├─adj ├─fp1 ├─fp2 └─timex2norm
文件解读
每份语料由如下所示的5个文件组成
Source Text (.sgm) Files - 这些文件是SGM格式的源文本文件,.sgm文件是UTF-8编码的 ACE Program Format (APF) (.apf.xml) Files - 这些文件采用ACE注释文件格式。 AG (.ag.xml) Files - 这些是使用LDC的注释工具创建的注释文件,这些文件被转换为对应的.apf.xml文件。 ID table (.tab) Files - 这些文件通过使用ag.xml文件和相应的apf.xml文件存储ID们之间的映射表 AIF (.aif.xml) Files - 这些是使用MITRE的Callisto创建的注释文件,仅适用于Valorem产生的阿拉伯数据。
以下以/English/bn/CNN_ENG_20030630_085848.18为例进行具体的解读
CNN_ENG_20030630_085848.18.sgm中内容(关于类似<DOC>这些标签的含义可见dtd/ace_source_sgml.v1.0.2.dtd)
<DOC> <DOCID> CNN_ENG_20030630_085848.18 </DOCID>#文件名字 <DOCTYPE SOURCE="broadcast news"> NEWS STORY </DOCTYPE>#文件来源 <DATETIME> 2003-06-30 09:23:30 </DATETIME>#时间 <BODY> <TEXT> <TURN>#具体内容 a wildfire in california forced hundreds of people from their homes. the fire, near the historic state park started yesterday when a trailer, hauled by a pickup, ignited on the golden state freeway. the fire consumed more than 500 acres is only about 35% contained. no injuries have been reported thankfully hat this time. </TURN> </TEXT> </BODY> <ENDTIME> 2003-06-30 09:23:54 </ENDTIME> </DOC>CNN_ENG_20030630_085848.18.apf.xml
.apf.xml文件是ACE标注过实体、关系、事件等要素后以XML格式呈现的文本(.apf.xml文件的说明文档是dtd/ace_source_sgml.apf.v5.1.1.dtd)。
说一下dtd/ace_source_sgml.apf.v5.1.1.dtd应该怎么读
<!ATTLIST relation #relation的标签具有以下的几个属性 ID ID #REQUIRED #这个REQUIRED表示必须的 TYPE (PHYS|PART-WHOLE|PER-SOC|ORG-AFF| ART|GEN-AFF|METONYMY) #REQUIRED SUBTYPE (Located|Near|Geographical| #二级分类 Subsidiary|Artifact|Business| Family|Lasting-Personal|Employment| Ownership|Founder|Student-Alum| Sports-Affiliation| Investor-Shareholder| Membership| User-Owner-Inventor-Manufacturer| Citizen-Resident-Religion-Ethnicity| Org-Location) #IMPLIED MODALITY (Asserted|Other) #IMPLIED TENSE (Past|Present|Future| #时态 Unspecified) #IMPLIED >relation标签:
<relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">回到CNN_ENG_20030630_085848.18.apf.xml其中标记的要素包括
ENTITY
<entity ID="CNN_ENG_20030630_085848.18-E2" TYPE="PER" SUBTYPE="Group" CLASS="USP"> <entity_mention ID="CNN_ENG_20030630_085848.18-E2-2" TYPE="NOM" LDCTYPE="NOM"> <extent> <charseq START="100" END="117">hundreds of people</charseq> </extent> <head> <charseq START="112" END="117">people</charseq> </head> </entity_mention> <entity_mention ID="CNN_ENG_20030630_085848.18-E2-3" TYPE="PRO" LDCTYPE="PRO"> <extent> <charseq START="124" END="128">their</charseq> </extent> <head> <charseq START="124" END="128">their</charseq> </head> </entity_mention> </entity> <entity ID="CNN_ENG_20030630_085848.18-E3" TYPE="FAC" SUBTYPE="Building-Grounds" CLASS="SPC"> <entity_mention ID="CNN_ENG_20030630_085848.18-E3-4" TYPE="NOM" LDCTYPE="NOM"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> <head> <charseq START="130" END="134">homes</charseq> </head> </entity_mention> </entity>entity包含4个必须具备的属性:ID,TYPE,SUBTYPE和CLASS
entity属性中的TYPE共有7类,分别是PER、ORG、LOC、GPE、FAC、VEH和WEA;每一类下都有若干对应的子类,具体可见dtd/ace_source_sgml.apf.v5.1.1.dtd文档;
TYPE="PER" SUBTYPE="Individual" TYPE="PER" SUBTYPE="Group" TYPE="PER" SUBTYPE="Indeterminate" TYPE="ORG" SUBTYPE="Government" ...entity_mention是对实体进一步区分他有extent和head两个子标签,extent代表词的全称,head代表词中最关键的单词。他有一系列的属性例如ID,TYPE,LDCTYPE,ROLE等。
entity还有external_link和entity_attributes这两个属性,external_link表示有些词有什么外部链接,entity_attributes表示将来可能要引入到库里的新词
VALUE
<value ID="CNN_ENG_20030630_085848.18-V1" TYPE="Numeric" SUBTYPE="Percent"> <value_mention ID="CNN_ENG_20030630_085848.18-V1-1"> <extent> <charseq START="319" END="320">35</charseq> </extent> </value_mention> </value>VALUE包含三个必备的属性:ID,TYPE和SUBTYPE
VALUE的TYPE一共有5类分别是Numeric、Contact-Info、Crime、Job-Title和Sentence;每一类下都有若干对应的子类,具体可见dtd/ace_source_sgml.apf.v5.1.1.dtd文档
TYPE="Numeric" SUBTYPE="Money" TYPE="Numeric" SUBTYPE="Percent" TYPE="Contact-Info" SUBTYPE="Phone-Number" TYPE="Contact-Info" SUBTYPE="E-Mail" TYPE="Contact-Info" SUBTYPE="URL" TYPE="Crime" TYPE="Job-Title" TYPE="Sentence"value_mention标签和上述entity_mention标签类似有extent和head两个子标签
timex2
<timex2 ID="CNN_ENG_20030630_085848.18-T1" VAL="2003-06-30T09:23:30"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T1-1"> <extent> <charseq START="44" END="62">2003-06-30 09:23:30</charseq> </extent> </timex2_mention> </timex2> <timex2 ID="CNN_ENG_20030630_085848.18-T2" VAL="2003-06-29"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T2-1"> <extent> <charseq START="184" END="192">yesterday</charseq> </extent> </timex2_mention> </timex2> <timex2 ID="CNN_ENG_20030630_085848.18-T3" VAL="2003-06-30TMO"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T3-1"> <extent> <charseq START="380" END="388">this time</charseq> </extent> </timex2_mention> </timex2>timex2可选属性包括VAL(标准形式的时间)
timex2_mention与上边同理
RELATION
<relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted"> <relation_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Arg-1"/> <relation_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Arg-2"/> <relation_mention ID="CNN_ENG_20030630_085848.18-R1-1" LEXICALCONDITION="Possessive"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-3" ROLE="Arg-1"> <extent> <charseq START="124" END="128">their</charseq> </extent> </relation_mention_argument> <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Arg-2"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> </relation_mention_argument> </relation_mention> </relation>relation包含TYPE属性表示后边两个词ROLE='Arg-1’与’Arg-2’之间的关系,关系主要包括
<!-- List of TYPE/SUBTYPE pairs (as of May 7, 2005) TYPE="PHYS" SUBTYPE="Located" TYPE="PHYS" SUBTYPE="Near" TYPE="PART-WHOLE" SUBTYPE="Geographical" TYPE="PART-WHOLE" SUBTYPE="Subsidiary" TYPE="PART-WHOLE" SUBTYPE="Artifact" ... TYPE="METONYMY" (no SUBTYPE)
EVENT
<event ID="CNN_ENG_20030630_085848.18-EV1" TYPE="Movement" SUBTYPE="Transport" MODALITY="Asserted" POLARITY="Positive" GENERICITY="Specific" TENSE="Past"> <event_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Artifact"/> <event_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Origin"/> <event_mention ID="CNN_ENG_20030630_085848.18-EV1-1"> <extent> <charseq START="93" END="134">forced hundreds of people from their homes</charseq> </extent> <ldc_scope> <charseq START="68" END="134">a wildfire in california forced hundreds of people from their homes</charseq> </ldc_scope> <anchor> <charseq START="93" END="98">forced</charseq> </anchor> <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-2" ROLE="Artifact"> <extent> <charseq START="100" END="117">hundreds of people</charseq> </extent> </event_mention_argument> <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Origin"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> </event_mention_argument> </event_mention> </event>event的TYPE属性如下
TYPE="Life" SUBTYPE="Be-Born" TYPE="Life" SUBTYPE="Die" TYPE="Life" SUBTYPE="Marry" TYPE="Life" SUBTYPE="Divorce" TYPE="Life" SUBTYPE="Injure" TYPE="Transaction" SUBTYPE="Transfer-Ownership" TYPE="Transaction" SUBTYPE="Transfer-Money" TYPE="Movement" SUBTYPE="Transport" TYPE="Business" SUBTYPE="Start-Org" TYPE="Business" SUBTYPE="End-Org" ... TYPE="Justice" SUBTYPE="Pardon" TYPE="Justice" SUBTYPE="Appeal"event共有6个必须的属性TYPE,SUBTYPE,MODALITY,POLARITY,GENERICITY,TENSE
他的子标签有event_argument、event_mention
event_mention包含extent、ldc_scope、anchor、event_mention_argument子标签,其中ldc_scope表示整个一句话,anchor是event_trigger
参考了 https://blog.csdn.net/carrie_0307/article/details/91417203 的文章