UIE(Universal Information Extraction):Yaojie Lu等人在ACL-2022中提出了通用信息抽取統(tǒng)一框架UIE。該框架實現(xiàn)了實體抽取、關(guān)系抽取、事件抽取、情感分析等任務的統(tǒng)一建模,并使得不同任務間具備良好的遷移和泛化能力。
代碼:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie
doccano是一個開源的文本標注工具,標注后的格式如下:
#doccano_ext.json
{"id": 1, "text": "昨天晚上十點加班打車回家58元", "relations": [], "entities": [{"id": 0, "start_offset": 0, "end_offset": 6, "label": "時間"}, {"id": 1, "start_offset": 11, "end_offset": 12, "label": "目的地"}, {"id": 2, "start_offset": 12, "end_offset": 14, "label": "費用"}]}
{"id": 2, "text": "三月三號早上12點46加班,到公司54", "relations": [], "entities": [{"id": 3, "start_offset": 0, "end_offset": 11, "label": "時間"}, {"id": 4, "start_offset": 15, "end_offset": 17, "label": "目的地"}, {"id": 5, "start_offset": 17, "end_offset": 19, "label": "費用"}]}
{"id": 3, "text": "8月31號十一點零四工作加班五十塊錢", "relations": [], "entities": [{"id": 6, "start_offset": 0, "end_offset": 10, "label": "時間"}, {"id": 7, "start_offset": 14, "end_offset": 16, "label": "費用"}]}
{"id": 4, "text": "5月17號晚上10點35分加班打車回家,36塊五", "relations": [], "entities": [{"id": 8, "start_offset": 0, "end_offset": 13, "label": "時間"}, {"id": 1, "start_offset": 18, "end_offset": 19, "label": "目的地"}, {"id": 9, "start_offset": 20, "end_offset": 24, "label": "費用"}]}
{"id": 5, "text": "2009年1月份通訊費一百元", "relations": [], "entities": [{"id": 10, "start_offset": 0, "end_offset": 7, "label": "時間"}, {"id": 11, "start_offset": 11, "end_offset": 13, "label": "費用"}]}
python doccano.py \
--doccano_file ./data/doccano_ext.json \
--task_type ext \
--save_dir ./data \
--negative_ratio 0 \
--splits 0.8 0.2 0
解析后得到train.txt格式如下:
{"content": "6月2日交通費123元", "result_list": [{"text": "6月2日", "start": 0, "end": 4}], "prompt": "時間"}
{"content": "上海虹橋高鐵到杭州時間是9月24日費用是73元", "result_list": [{"text": "上海虹橋", "start": 0, "end": 4}], "prompt": "出發(fā)地"}
{"content": "從北京飛往上海出差飛機票費150元", "result_list": [{"text": "上海", "start": 5, "end": 7}], "prompt": "目的地"}
將train.txt數(shù)據(jù)轉(zhuǎn)化為input_ids
def convert_example(example, tokenizer, max_seq_len):
#提示學習
encoded_inputs = tokenizer(text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_seq_len,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True)
encoded_inputs = encoded_inputs[0]
#offset_mapping 來映射,變換前和變化后的 id
offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
bias = 0
for index in range(1, len(offset_mapping)):
mapping = offset_mapping[index]
if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
bias = offset_mapping[index - 1][1] + 1 # Includes [SEP] token
if mapping[0] == 0 and mapping[1] == 0:
continue
offset_mapping[index][0] += bias
offset_mapping[index][1] += bias
start_ids = [0 for x in range(max_seq_len)]
end_ids = [0 for x in range(max_seq_len)]
for item in example["result_list"]:
start = map_offset(item["start"] + bias, offset_mapping)
end = map_offset(item["end"] - 1 + bias, offset_mapping)
#start和end是input_ids中的位置
start_ids[start] = 1.0
end_ids[end] = 1.0
tokenized_output = [
encoded_inputs["input_ids"], encoded_inputs["token_type_ids"],
encoded_inputs["position_ids"], encoded_inputs["attention_mask"],
start_ids, end_ids
]
tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
return tuple(tokenized_output)
模型:
class UIE(ErniePretrainedModel):
def __init__(self, encoding_model):
super(UIE, self).__init__()
self.encoder = encoding_model
hidden_size = self.encoder.config["hidden_size"]
self.linear_start = paddle.nn.Linear(hidden_size, 1)
#二分類
self.linear_end = paddle.nn.Linear(hidden_size, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, input_ids, token_type_ids, pos_ids, att_mask):
sequence_output, pooled_output = self.encoder(
input_ids=input_ids,
token_type_ids=token_type_ids,
position_ids=pos_ids,
attention_mask=att_mask)
start_logits = self.linear_start(sequence_output)
start_logits = paddle.squeeze(start_logits, -1)
start_prob = self.sigmoid(start_logits)
end_logits = self.linear_end(sequence_output)
end_logits = paddle.squeeze(end_logits, -1)
end_prob = self.sigmoid(end_logits)
return start_prob, end_prob
訓練片段代碼:
model = UIE.from_pretrained(args.model)
criterion = paddle.nn.BCELoss()
for epoch in range(1, args.num_epochs + 1):
for batch in train_data_loader:
input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
#start_ids shape為:batch_size * max_seq_len
#start_ids是記錄開始位置
start_ids = paddle.cast(start_ids, 'float32')
end_ids = paddle.cast(end_ids, 'float32')
loss_start = criterion(start_prob, start_ids)
loss_end = criterion(end_prob, end_ids)
loss = (loss_start + loss_end) / 2.0
心得
對于初學者,先理解整個流程,訓練數(shù)據(jù)處理、損失函數(shù)選擇、模型選擇、評估函數(shù),然后跑通代碼,再斷點debug代碼,查看各個關(guān)鍵環(huán)節(jié)的shape變化。