UIE實體抽取解讀

UIE(Universal Information Extraction):Yaojie Lu等人在ACL-2022中提出了通用信息抽取統(tǒng)一框架UIE。該框架實現(xiàn)了實體抽取、關(guān)系抽取、事件抽取、情感分析等任務的統(tǒng)一建模,并使得不同任務間具備良好的遷移和泛化能力。

代碼:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie

doccano是一個開源的文本標注工具,標注后的格式如下:

#doccano_ext.json
{"id": 1, "text": "昨天晚上十點加班打車回家58元", "relations": [], "entities": [{"id": 0, "start_offset": 0, "end_offset": 6, "label": "時間"}, {"id": 1, "start_offset": 11, "end_offset": 12, "label": "目的地"}, {"id": 2, "start_offset": 12, "end_offset": 14, "label": "費用"}]}
{"id": 2, "text": "三月三號早上12點46加班,到公司54", "relations": [], "entities": [{"id": 3, "start_offset": 0, "end_offset": 11, "label": "時間"}, {"id": 4, "start_offset": 15, "end_offset": 17, "label": "目的地"}, {"id": 5, "start_offset": 17, "end_offset": 19, "label": "費用"}]}
{"id": 3, "text": "8月31號十一點零四工作加班五十塊錢", "relations": [], "entities": [{"id": 6, "start_offset": 0, "end_offset": 10, "label": "時間"}, {"id": 7, "start_offset": 14, "end_offset": 16, "label": "費用"}]}
{"id": 4, "text": "5月17號晚上10點35分加班打車回家,36塊五", "relations": [], "entities": [{"id": 8, "start_offset": 0, "end_offset": 13, "label": "時間"}, {"id": 1, "start_offset": 18, "end_offset": 19, "label": "目的地"}, {"id": 9, "start_offset": 20, "end_offset": 24, "label": "費用"}]}
{"id": 5, "text": "2009年1月份通訊費一百元", "relations": [], "entities": [{"id": 10, "start_offset": 0, "end_offset": 7, "label": "時間"}, {"id": 11, "start_offset": 11, "end_offset": 13, "label": "費用"}]}
python doccano.py \
    --doccano_file ./data/doccano_ext.json \
    --task_type ext \
    --save_dir ./data \
    --negative_ratio 0 \
    --splits 0.8 0.2 0

解析后得到train.txt格式如下:

{"content": "6月2日交通費123元", "result_list": [{"text": "6月2日", "start": 0, "end": 4}], "prompt": "時間"}
{"content": "上海虹橋高鐵到杭州時間是9月24日費用是73元", "result_list": [{"text": "上海虹橋", "start": 0, "end": 4}], "prompt": "出發(fā)地"}
{"content": "從北京飛往上海出差飛機票費150元", "result_list": [{"text": "上海", "start": 5, "end": 7}], "prompt": "目的地"}

將train.txt數(shù)據(jù)轉(zhuǎn)化為input_ids

def convert_example(example, tokenizer, max_seq_len):
    #提示學習
    encoded_inputs = tokenizer(text=[example["prompt"]],
                               text_pair=[example["content"]],
                               truncation=True,
                               max_seq_len=max_seq_len,
                               pad_to_max_seq_len=True,
                               return_attention_mask=True,
                               return_position_ids=True,
                               return_dict=False,
                               return_offsets_mapping=True)
    encoded_inputs = encoded_inputs[0]
    #offset_mapping 來映射,變換前和變化后的 id
    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
    bias = 0
    for index in range(1, len(offset_mapping)):
        mapping = offset_mapping[index]
        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
        if mapping[0] == 0 and mapping[1] == 0:
            continue
        offset_mapping[index][0] += bias
        offset_mapping[index][1] += bias
    start_ids = [0 for x in range(max_seq_len)]
    end_ids = [0 for x in range(max_seq_len)]
    for item in example["result_list"]:
        start = map_offset(item["start"] + bias, offset_mapping)
        end = map_offset(item["end"] - 1 + bias, offset_mapping)
        #start和end是input_ids中的位置
        start_ids[start] = 1.0
        end_ids[end] = 1.0

    tokenized_output = [
        encoded_inputs["input_ids"], encoded_inputs["token_type_ids"],
        encoded_inputs["position_ids"], encoded_inputs["attention_mask"],
        start_ids, end_ids
    ]
    tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
    return tuple(tokenized_output)

模型:

class UIE(ErniePretrainedModel):

    def __init__(self, encoding_model):
        super(UIE, self).__init__()
        self.encoder = encoding_model
        hidden_size = self.encoder.config["hidden_size"]
        self.linear_start = paddle.nn.Linear(hidden_size, 1)
        #二分類
        self.linear_end = paddle.nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, token_type_ids, pos_ids, att_mask):
        sequence_output, pooled_output = self.encoder(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=pos_ids,
            attention_mask=att_mask)
        start_logits = self.linear_start(sequence_output)
        start_logits = paddle.squeeze(start_logits, -1)
        start_prob = self.sigmoid(start_logits)
        end_logits = self.linear_end(sequence_output)
        end_logits = paddle.squeeze(end_logits, -1)
        end_prob = self.sigmoid(end_logits)
        return start_prob, end_prob

訓練片段代碼:

    model = UIE.from_pretrained(args.model)
    criterion = paddle.nn.BCELoss()    
    for epoch in range(1, args.num_epochs + 1):
        for batch in train_data_loader:
            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
            start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
            #start_ids shape為:batch_size * max_seq_len
            #start_ids是記錄開始位置
            start_ids = paddle.cast(start_ids, 'float32')
            end_ids = paddle.cast(end_ids, 'float32')
            loss_start = criterion(start_prob, start_ids)
            loss_end = criterion(end_prob, end_ids)
            loss = (loss_start + loss_end) / 2.0

心得

對于初學者,先理解整個流程,訓練數(shù)據(jù)處理、損失函數(shù)選擇、模型選擇、評估函數(shù),然后跑通代碼,再斷點debug代碼,查看各個關(guān)鍵環(huán)節(jié)的shape變化。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容