UIE(Universal Information Extraction)：Yaojie Lu等人在ACL-2022中提出了通用信息抽取統(tǒng)一框架UIE。該框架實現(xiàn)了實體抽取、關(guān)系抽取、事件抽取、情感分析等任務的統(tǒng)一建模，并使得不同任務間具備良好的遷移和泛化能力。

代碼：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie

doccano是一個開源的文本標注工具，標注后的格式如下：

#doccano_ext.json
{"id": 1, "text": "昨天晚上十點加班打車回家58元", "relations": [], "entities": [{"id": 0, "start_offset": 0, "end_offset": 6, "label": "時間"}, {"id": 1, "start_offset": 11, "end_offset": 12, "label": "目的地"}, {"id": 2, "start_offset": 12, "end_offset": 14, "label": "費用"}]}
{"id": 2, "text": "三月三號早上12點46加班，到公司54", "relations": [], "entities": [{"id": 3, "start_offset": 0, "end_offset": 11, "label": "時間"}, {"id": 4, "start_offset": 15, "end_offset": 17, "label": "目的地"}, {"id": 5, "start_offset": 17, "end_offset": 19, "label": "費用"}]}
{"id": 3, "text": "8月31號十一點零四工作加班五十塊錢", "relations": [], "entities": [{"id": 6, "start_offset": 0, "end_offset": 10, "label": "時間"}, {"id": 7, "start_offset": 14, "end_offset": 16, "label": "費用"}]}
{"id": 4, "text": "5月17號晚上10點35分加班打車回家，36塊五", "relations": [], "entities": [{"id": 8, "start_offset": 0, "end_offset": 13, "label": "時間"}, {"id": 1, "start_offset": 18, "end_offset": 19, "label": "目的地"}, {"id": 9, "start_offset": 20, "end_offset": 24, "label": "費用"}]}
{"id": 5, "text": "2009年1月份通訊費一百元", "relations": [], "entities": [{"id": 10, "start_offset": 0, "end_offset": 7, "label": "時間"}, {"id": 11, "start_offset": 11, "end_offset": 13, "label": "費用"}]}

python doccano.py \
    --doccano_file ./data/doccano_ext.json \
    --task_type ext \
    --save_dir ./data \
    --negative_ratio 0 \
    --splits 0.8 0.2 0

解析后得到train.txt格式如下：

{"content": "6月2日交通費123元", "result_list": [{"text": "6月2日", "start": 0, "end": 4}], "prompt": "時間"}
{"content": "上海虹橋高鐵到杭州時間是9月24日費用是73元", "result_list": [{"text": "上海虹橋", "start": 0, "end": 4}], "prompt": "出發(fā)地"}
{"content": "從北京飛往上海出差飛機票費150元", "result_list": [{"text": "上海", "start": 5, "end": 7}], "prompt": "目的地"}

將train.txt數(shù)據(jù)轉(zhuǎn)化為input_ids

def convert_example(example, tokenizer, max_seq_len):
    #提示學習
    encoded_inputs = tokenizer(text=[example["prompt"]],
                               text_pair=[example["content"]],
                               truncation=True,
                               max_seq_len=max_seq_len,
                               pad_to_max_seq_len=True,
                               return_attention_mask=True,
                               return_position_ids=True,
                               return_dict=False,
                               return_offsets_mapping=True)
    encoded_inputs = encoded_inputs[0]
    #offset_mapping 來映射，變換前和變化后的 id
    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
    bias = 0
    for index in range(1, len(offset_mapping)):
        mapping = offset_mapping[index]
        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
        if mapping[0] == 0 and mapping[1] == 0:
            continue
        offset_mapping[index][0] += bias
        offset_mapping[index][1] += bias
    start_ids = [0 for x in range(max_seq_len)]
    end_ids = [0 for x in range(max_seq_len)]
    for item in example["result_list"]:
        start = map_offset(item["start"] + bias, offset_mapping)
        end = map_offset(item["end"] - 1 + bias, offset_mapping)
        #start和end是input_ids中的位置
        start_ids[start] = 1.0
        end_ids[end] = 1.0

    tokenized_output = [
        encoded_inputs["input_ids"], encoded_inputs["token_type_ids"],
        encoded_inputs["position_ids"], encoded_inputs["attention_mask"],
        start_ids, end_ids
    ]
    tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
    return tuple(tokenized_output)

模型：

class UIE(ErniePretrainedModel):

    def __init__(self, encoding_model):
        super(UIE, self).__init__()
        self.encoder = encoding_model
        hidden_size = self.encoder.config["hidden_size"]
        self.linear_start = paddle.nn.Linear(hidden_size, 1)
        #二分類
        self.linear_end = paddle.nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, token_type_ids, pos_ids, att_mask):
        sequence_output, pooled_output = self.encoder(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=pos_ids,
            attention_mask=att_mask)
        start_logits = self.linear_start(sequence_output)
        start_logits = paddle.squeeze(start_logits, -1)
        start_prob = self.sigmoid(start_logits)
        end_logits = self.linear_end(sequence_output)
        end_logits = paddle.squeeze(end_logits, -1)
        end_prob = self.sigmoid(end_logits)
        return start_prob, end_prob

訓練片段代碼：

    model = UIE.from_pretrained(args.model)
    criterion = paddle.nn.BCELoss()    
    for epoch in range(1, args.num_epochs + 1):
        for batch in train_data_loader:
            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
            start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
            #start_ids shape為：batch_size * max_seq_len
            #start_ids是記錄開始位置
            start_ids = paddle.cast(start_ids, 'float32')
            end_ids = paddle.cast(end_ids, 'float32')
            loss_start = criterion(start_prob, start_ids)
            loss_end = criterion(end_prob, end_ids)
            loss = (loss_start + loss_end) / 2.0

心得

對于初學者，先理解整個流程，訓練數(shù)據(jù)處理、損失函數(shù)選擇、模型選擇、評估函數(shù)，然后跑通代碼，再斷點debug代碼，查看各個關(guān)鍵環(huán)節(jié)的shape變化。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

UIE實體抽取解讀

UIE實體抽取解讀

心得

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

UIE實體抽取解讀

心得

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av