摘要:Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semisupervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time.
We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.
基于深度學(xué)習的視覺字符識別(optical character recognition, OCR)系統(tǒng)的子系統(tǒng)手寫文本識別(handwritten text recognition)的訓(xùn)練受限,主要是訓(xùn)練樣本不足。因為訓(xùn)練樣本的收集本身就很困難,然而標注數(shù)據(jù)的任務(wù)同樣困難。采用半監(jiān)督的方法訓(xùn)練模型,可以減少標注的任務(wù)量。本文提出的是一個半監(jiān)督的生成手寫文本的對抗生成網(wǎng)絡(luò)(generative adversarial networks), ScrabbleGAN。

ScrabbleGAN的結(jié)構(gòu),大致如上圖所示,除了常規(guī)的分辨器(Discriminator),還增加了一個手寫文本的識別器R。如圖所示,由需要生成的文本,從卷積核庫(filter bank),取出對應(yīng)字符的卷積核。卷積核與噪聲數(shù)據(jù)z相乘,z用以控制文本的風格。卷積核與卷積核的空間上存在重疊。這樣生成文本是非常靈活的,每個字符可以有靈活的大小和類型,并且模型能學(xué)習到字符之間的依賴關(guān)系。分辨器用來分辨生成圖片還是真實圖片,本地的文本識別器(localized text recognizer)用來識別生成的文本中單個字符,并且沒有采用識別文本通用的循環(huán)模型,只是采用了卷積網(wǎng)絡(luò)進行識別(也就是說,識別的時候,不用考慮到識別字符的前后字符,防止識別器根據(jù)這些先驗知識進行猜測,導(dǎo)致圖片即使生成不清楚的字符,也可以被識別出,以此提升生成的質(zhì)量)。因此,模型訓(xùn)練的損失函數(shù)由兩項構(gòu)成,

由于這損失的兩個部分,并不會具有相同的大小,文中采用了如下公式,來計算最終更新的梯度:

這樣的ScrabbleGAN在文中主要提供了兩個思路應(yīng)用到訓(xùn)練具體的手寫識別系統(tǒng)中。第一種,是直接用ScrabbleGAN生成的數(shù)據(jù)來擴增已有的訓(xùn)練集;第二種應(yīng)用是遷移學(xué)習,文中利用有標簽的IAM數(shù)據(jù)集,并且在不利用CVL標簽數(shù)據(jù)的情況下,利用CVL訓(xùn)練ScrabbleGAN來生成CVL風格以及字典的數(shù)據(jù)集,從而擴充已有的IAM數(shù)據(jù)集,再利用這個擴充的數(shù)據(jù)集來訓(xùn)練識別模型,成功達到了提升性能的目的。
本文在閱讀中,其實存在幾個比較不明的地方,由于代碼實現(xiàn)沒有放出,此處留作討論:
1.卷積核庫用于生成不同的字符,這樣的卷積核庫是如何獲取的,或者是如何學(xué)習的?
2.每個字母filter的長度為什么就是8192?
3.如何分離數(shù)據(jù)集中手寫體的風格和詞匯的(Lexicon)?如何做到用IAM的風格加上CVL的詞匯?
4.如果做到無監(jiān)督訓(xùn)練這個GAN的,因為其中有一個R:如果R是提前訓(xùn)練的,那么是不是用了這個域的標簽;如果不是提前訓(xùn)練,那么這個R分類效果豈不是很差,那怎么可能做到去監(jiān)督GAN的生成效果?