ACE05 关系抽取数据集

ACE05 自然语言信息抽取数据集

简介

  • 数据集概述

    提供已经标注好的多种类型实体,关系和事件,目前该数据集主要用于事件抽取任务中

    有中文、英文和阿拉伯文的数据

标注说明

  • 标注过程如下
  1. 先进行1P和DUAL两轮标注,标注的结果分别存储于对应语料的fp1和fp2目录下
  2. 对以上两轮标注的结果进行裁决,将才绝后的标注结果存储于对应语料的adj目录下
  3. 对于English的语料,对adj目录下标注的结果再进行一步处理,将结果存储于timex2norm目录下

对应的标注过程和标注内容如下

    1P: entities        DUAL: entities
        values                values
        events                events
        relations             relations
            |                    |
            |                    |
            |_________?__________|
                      |
                      |
                      |
                      V
                 ADJ: entities
                      values
                      events
                      relations
                      |
                      |
                      |
                      V
                 NORM: TIMEX2 normalization 
                       (English only)

目录架构

  • 目录架构如下

    ─Arabic              # 阿拉伯语语料库
    │  ├─bn
    │  │  ├─adj
    │  │  ├─altAdj
    │  │  ├─fp1
    │  │  └─fp2
    │  ├─nw
    │  │  ├─adj
    │  │  ├─altAdj
    │  │  ├─fp1
    │  │  └─fp2
    │  └─wl
    │      ├─adj
    │      ├─fp1
    │      └─fp2
    ├─Chinese             # 中文语料
    │  ├─bn
    │  │  ├─adj
    │  │  ├─fp1
    │  │  └─fp2
    │  ├─nw
    │  │  ├─adj
    │  │  ├─fp1
    │  │  └─fp2
    │  └─wl
    │      ├─adj
    │      ├─fp1
    │      └─fp2
    ├─dtd               # 数据说明文件  
    └─English           # 英文语料
        ├─bc
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─bn
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─cts
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─nw
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─un
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        └─wl
            ├─adj
            ├─fp1
            ├─fp2
            └─timex2norm
    

文件解读

  • 每份语料由如下所示的5个文件组成

    Source Text (.sgm) Files
    	- 这些文件是SGM格式的源文本文件,.sgm文件是UTF-8编码的
     ACE Program Format (APF) (.apf.xml) Files
    	- 这些文件采用ACE注释文件格式。
     AG (.ag.xml) Files
        - 这些是使用LDC的注释工具创建的注释文件,这些文件被转换为对应的.apf.xml文件。
     ID table (.tab) Files
        - 这些文件通过使用ag.xml文件和相应的apf.xml文件存储ID们之间的映射表
     AIF (.aif.xml) Files
    	- 这些是使用MITRE的Callisto创建的注释文件,仅适用于Valorem产生的阿拉伯数据。
    

以下以/English/bn/CNN_ENG_20030630_085848.18为例进行具体的解读

  • CNN_ENG_20030630_085848.18.sgm中内容(关于类似<DOC>这些标签的含义可见dtd/ace_source_sgml.v1.0.2.dtd)

    <DOC>
    <DOCID> CNN_ENG_20030630_085848.18 </DOCID>#文件名字
    <DOCTYPE SOURCE="broadcast news"> NEWS STORY </DOCTYPE>#文件来源
    <DATETIME> 2003-06-30 09:23:30 </DATETIME>#时间
    <BODY>
    <TEXT>
    <TURN>#具体内容
    a wildfire in california forced hundreds of people from their homes.
    the fire, near the historic state park started yesterday when a
    trailer, hauled by a pickup, ignited on the golden state freeway. the
    fire consumed more than 500 acres is only about 35% contained. no
    injuries have been reported thankfully hat this time.
    </TURN>
    </TEXT>
    </BODY>
    <ENDTIME> 2003-06-30 09:23:54 </ENDTIME>
    </DOC>
    
  • CNN_ENG_20030630_085848.18.apf.xml

    .apf.xml文件是ACE标注过实体、关系、事件等要素后以XML格式呈现的文本(.apf.xml文件的说明文档是dtd/ace_source_sgml.apf.v5.1.1.dtd)。

    说一下dtd/ace_source_sgml.apf.v5.1.1.dtd应该怎么读

    <!ATTLIST relation           #relation的标签具有以下的几个属性
                                 ID       ID                        #REQUIRED 
                                 									#这个REQUIRED表示必须的
                                 TYPE     (PHYS|PART-WHOLE|PER-SOC|ORG-AFF|
                                           ART|GEN-AFF|METONYMY)    #REQUIRED
                                 SUBTYPE  (Located|Near|Geographical| #二级分类
                                           Subsidiary|Artifact|Business|
                                           Family|Lasting-Personal|Employment|
                                           Ownership|Founder|Student-Alum|
                                           Sports-Affiliation|
                                           Investor-Shareholder|
                                           Membership|
                                           User-Owner-Inventor-Manufacturer|
                                           Citizen-Resident-Religion-Ethnicity|
                                           Org-Location)            #IMPLIED
                                 MODALITY (Asserted|Other)          #IMPLIED
                                 TENSE    (Past|Present|Future|		#时态
                                           Unspecified)             #IMPLIED
    >
    

    relation标签:

    <relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">
    
  • 回到CNN_ENG_20030630_085848.18.apf.xml其中标记的要素包括

    1. ENTITY

      <entity ID="CNN_ENG_20030630_085848.18-E2" TYPE="PER" SUBTYPE="Group" CLASS="USP">
        <entity_mention ID="CNN_ENG_20030630_085848.18-E2-2" TYPE="NOM" LDCTYPE="NOM">
          <extent>
            <charseq START="100" END="117">hundreds of people</charseq>
          </extent>
          <head>
            <charseq START="112" END="117">people</charseq>
          </head>
        </entity_mention>
        <entity_mention ID="CNN_ENG_20030630_085848.18-E2-3" TYPE="PRO" LDCTYPE="PRO">
          <extent>
            <charseq START="124" END="128">their</charseq>
          </extent>
          <head>
            <charseq START="124" END="128">their</charseq>
          </head>
        </entity_mention>
      </entity>
      <entity ID="CNN_ENG_20030630_085848.18-E3" TYPE="FAC" SUBTYPE="Building-Grounds" CLASS="SPC">
        <entity_mention ID="CNN_ENG_20030630_085848.18-E3-4" TYPE="NOM" LDCTYPE="NOM">
          <extent>
            <charseq START="124" END="134">their homes</charseq>
          </extent>
          <head>
            <charseq START="130" END="134">homes</charseq>
          </head>
        </entity_mention>
      </entity>
      
      • entity包含4个必须具备的属性:ID,TYPE,SUBTYPE和CLASS

      • entity属性中的TYPE共有7类,分别是PER、ORG、LOC、GPE、FAC、VEH和WEA;每一类下都有若干对应的子类,具体可见dtd/ace_source_sgml.apf.v5.1.1.dtd文档;

        TYPE="PER" SUBTYPE="Individual"
        TYPE="PER" SUBTYPE="Group"
        TYPE="PER" SUBTYPE="Indeterminate"
        
        TYPE="ORG" SUBTYPE="Government"
        ...
        
      • entity_mention是对实体进一步区分他有extent和head两个子标签,extent代表词的全称,head代表词中最关键的单词。他有一系列的属性例如ID,TYPE,LDCTYPE,ROLE等。

      • entity还有external_link和entity_attributes这两个属性,external_link表示有些词有什么外部链接,entity_attributes表示将来可能要引入到库里的新词

    2. VALUE

      <value ID="CNN_ENG_20030630_085848.18-V1" TYPE="Numeric" SUBTYPE="Percent">
        <value_mention ID="CNN_ENG_20030630_085848.18-V1-1">
          <extent>
            <charseq START="319" END="320">35</charseq>
          </extent>
        </value_mention>
      </value>
      
      • VALUE包含三个必备的属性:ID,TYPE和SUBTYPE

      • VALUE的TYPE一共有5类分别是Numeric、Contact-Info、Crime、Job-Title和Sentence;每一类下都有若干对应的子类,具体可见dtd/ace_source_sgml.apf.v5.1.1.dtd文档

        TYPE="Numeric" SUBTYPE="Money"
        TYPE="Numeric" SUBTYPE="Percent"
        TYPE="Contact-Info" SUBTYPE="Phone-Number"
        TYPE="Contact-Info" SUBTYPE="E-Mail"
        TYPE="Contact-Info" SUBTYPE="URL"
        
        TYPE="Crime"
        TYPE="Job-Title"
        TYPE="Sentence"
        
      • value_mention标签和上述entity_mention标签类似有extent和head两个子标签

    3. timex2

      <timex2 ID="CNN_ENG_20030630_085848.18-T1" VAL="2003-06-30T09:23:30">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T1-1">
          <extent>
            <charseq START="44" END="62">2003-06-30 09:23:30</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      <timex2 ID="CNN_ENG_20030630_085848.18-T2" VAL="2003-06-29">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T2-1">
          <extent>
            <charseq START="184" END="192">yesterday</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      <timex2 ID="CNN_ENG_20030630_085848.18-T3" VAL="2003-06-30TMO">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T3-1">
          <extent>
            <charseq START="380" END="388">this time</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      
      • timex2可选属性包括VAL(标准形式的时间)

      • timex2_mention与上边同理

    4. RELATION

      <relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">
        <relation_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Arg-1"/>
        <relation_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Arg-2"/>
        <relation_mention ID="CNN_ENG_20030630_085848.18-R1-1" LEXICALCONDITION="Possessive">
          <extent>
            <charseq START="124" END="134">their homes</charseq>
          </extent>
          <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-3" ROLE="Arg-1">
            <extent>
              <charseq START="124" END="128">their</charseq>
            </extent>
          </relation_mention_argument>
          <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Arg-2">
            <extent>
              <charseq START="124" END="134">their homes</charseq>
            </extent>
          </relation_mention_argument>
        </relation_mention>
      </relation>
      
      • relation包含TYPE属性表示后边两个词ROLE='Arg-1’与’Arg-2’之间的关系,关系主要包括

        <!-- List of TYPE/SUBTYPE pairs (as of May 7, 2005)
        
        TYPE="PHYS" SUBTYPE="Located"
        TYPE="PHYS" SUBTYPE="Near"
        
        TYPE="PART-WHOLE" SUBTYPE="Geographical"
        TYPE="PART-WHOLE" SUBTYPE="Subsidiary"
        TYPE="PART-WHOLE" SUBTYPE="Artifact"
        ...
        TYPE="METONYMY" (no SUBTYPE)
        
    5. EVENT

      <event ID="CNN_ENG_20030630_085848.18-EV1" TYPE="Movement" SUBTYPE="Transport" MODALITY="Asserted" POLARITY="Positive" GENERICITY="Specific" TENSE="Past">
        <event_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Artifact"/>
        <event_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Origin"/>
        <event_mention ID="CNN_ENG_20030630_085848.18-EV1-1">
          <extent>
            <charseq START="93" END="134">forced hundreds of people from their homes</charseq>
          </extent>
          <ldc_scope>
            <charseq START="68" END="134">a wildfire in california forced hundreds of people from their homes</charseq>
          </ldc_scope>
          <anchor>
            <charseq START="93" END="98">forced</charseq>
          </anchor>
          <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-2" ROLE="Artifact">
            <extent>
              <charseq START="100" END="117">hundreds of people</charseq>
            </extent>
          </event_mention_argument>
          <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Origin">
            <extent>
              <charseq START="124" END="134">their homes</charseq>
            </extent>
          </event_mention_argument>
        </event_mention>
      </event>
      
      • event的TYPE属性如下

        TYPE="Life" SUBTYPE="Be-Born"
        TYPE="Life" SUBTYPE="Die"
        TYPE="Life" SUBTYPE="Marry"
        TYPE="Life" SUBTYPE="Divorce"
        TYPE="Life" SUBTYPE="Injure"
        TYPE="Transaction" SUBTYPE="Transfer-Ownership"
        TYPE="Transaction" SUBTYPE="Transfer-Money"
        TYPE="Movement" SUBTYPE="Transport"
        TYPE="Business" SUBTYPE="Start-Org"
        TYPE="Business" SUBTYPE="End-Org"
        ...
        TYPE="Justice" SUBTYPE="Pardon"
        TYPE="Justice" SUBTYPE="Appeal"
        
      • event共有6个必须的属性TYPE,SUBTYPE,MODALITY,POLARITY,GENERICITY,TENSE

      • 他的子标签有event_argument、event_mention

      • event_mention包含extent、ldc_scope、anchor、event_mention_argument子标签,其中ldc_scope表示整个一句话,anchor是event_trigger

参考了 https://blog.csdn.net/carrie_0307/article/details/91417203 的文章


版权声明:本文为L_x_4原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。