運(yùn)行GATK時(shí),報(bào)錯(cuò):java.lang.IllegalArgumentException: samples cannot be empty
這個(gè)問(wèn)題在GATK Forum有討論到;
IllegalArgumentException: samples cannot be empty
使用 ValidateSamFile確定bam文件是否有問(wèn)題?
Picard ValidateSamFile
Validates a SAM or BAM file.
ValidateSamFile下有兩種mode:
VERBOSE : 檢查到100個(gè)錯(cuò)誤之后退出,并且輸出錯(cuò)誤到終端;
SUMMARYL: 輸出結(jié)果是一個(gè)表格,展示errors和warnings的數(shù)目;
$ java -jar picard.jar ValidateSamFile \
I=input.bam \
MODE=SUMMARY
Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 1287903
問(wèn)題在bam文件 MISSING_READ_GROUP;請(qǐng)自動(dòng)屏蔽WARNING結(jié)果;
獲取GATK格式bam文件
- 方法法1
bwa 比對(duì)使用參數(shù)-R
-R STR: Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null]
bwa -R 參數(shù)的使用
# 不加參數(shù)R
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
# 加參數(shù)R
bwa mem -R '@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n' ref.fa read1.fq read2.fq > aln-pe.R.sam
'@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n'
# ID:每一個(gè)Read group獨(dú)有的ID;
# LB:DNA preparation library identifier
# PL:測(cè)序使用的平臺(tái): ILLUMINA, SOLID, LS454, HELICOS and PACBIO;
# PU:Platform Unit,由三部分組成: {FLOWCELL_BARCODE}.{LANE};
# SM:樣品名
read group參數(shù)詳細(xì)信息見本文后半部分
注:GATK 使用時(shí),PU不是必須要求的;但是PU與ID同時(shí)存在時(shí),PU優(yōu)先級(jí)高于ID。
# 查看文件aln-pe.sam與aln-pe.R.sam中的group
samtools view -H aln-pe.sam | grep '@RG'
# 不會(huì)有輸出
samtools view -H aln-pe.R.sam | grep '@RG'
@RG ID:group_n LB:library_n PL:illumina PU:unit1 SM:sample_n
# 多個(gè)比對(duì)的時(shí)候,使用循環(huán)
ls *.gRNA.fastq | while read id ;
do
bwa mem -R "@RG\tID:group_${id}\tLB:library_${id}\tPL:illumina\tSM:sample_${id}" ref.fa $id > ${id}.R.sam
done
- 方法2
Picard AddOrReplaceReadGroups
Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.
$ java -jar picard.jar AddOrReplaceReadGroups \
I=input.bam \
O=output.bam \
RGID=4 \
RGLB=library1 \
RGPL=illumina \
RGPU=unit1 \
RGSM=sample1
Option Description
INPUT (String) Input file (BAM or SAM or a GA4GH url). Required.
OUTPUT (File) Output file (BAM or SAM). Required.
SORT_ORDER (SortOrder) Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RGID (String) Read Group ID Default value: 1. This option can be set to 'null' to clear the default value.
RGLB (String) Read Group library Required.
RGPL (String) Read Group platform (e.g. illumina, solid) Required.
RGPU (String) Read Group platform unit (eg. run barcode) Required.
RGSM (String) Read Group sample name Required.
Read groups詳細(xì)介紹
There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.
測(cè)序時(shí):
樣本建一個(gè)庫(kù)在同一條lane上完成測(cè)序產(chǎn)生reads sets可定義為一個(gè)Read group;
樣本建成分開獨(dú)立的庫(kù)測(cè)序得到的reads sets也就分屬于不同的Read groups;
存在于SAM/BAM /CRAM 文件中Read groups是由一系列標(biāo)簽組成;這些標(biāo)簽代表著測(cè)序中的一些技術(shù)特征;有了這些數(shù)據(jù)之后,大家就可以對(duì)bam文件進(jìn)行重復(fù)序列標(biāo)識(shí)和堿基質(zhì)量重新矯正。
GATK要求輸入的bam文件包含Read groups,如果沒有就會(huì)報(bào)錯(cuò)。
例子:
samtools view -H sample.bam | grep '@RG'
@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI
GATk 要求read group的格式
??Read group是@RG開始。
ID = Read group identifier
??每一個(gè)Read group獨(dú)有的ID;
??Illumina 測(cè)序數(shù)據(jù)中,read group IDs由flowcell ,lane name 和number組成。
??在矯正堿基質(zhì)量時(shí),read group IDs對(duì)區(qū)分技術(shù)批次效應(yīng)是必須的;在這過(guò)程中,同一read group的reads假 定為有一樣的技術(shù)誤差。
PU = Platform Unit
??Platform Unit由三部分組成: {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
??{FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell;
??The {LANE} indicates the lane of the flow cell ;
??The {SAMPLE_BARCODE} is a sample/library-specific identifier;
??GATK 使用時(shí),PU不是必須要求的;但是PU與ID同時(shí)存在時(shí),PU優(yōu)先級(jí)高于ID。
SM = Sample
??reads屬于的樣品名;SM要設(shè)定正確,因?yàn)镚ATK產(chǎn)生的VCF文件也使用這個(gè)名字。
PL= Platform/technology used to produce the read
??測(cè)序使用的平臺(tái): ILLUMINA, SOLID, LS454, HELICOS and PACBIO。
LB = DNA preparation library identifier
??對(duì)一個(gè)read group的reads進(jìn)行重復(fù)序列標(biāo)記時(shí),需要使用LB來(lái)區(qū)分reads來(lái)自那條lane;有時(shí)候,同一個(gè)庫(kù)可能在不同的lane上完成測(cè)序;為了加以區(qū)分,同一個(gè)或不同庫(kù)只要是在不同的lane產(chǎn)生的reads都要單獨(dú)給一個(gè)ID。
從read names中提取ID與PU
H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288
H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane
例子-Multi-sample and multiplexed example
三個(gè)樣品:MOM, DAD, KID;
建庫(kù):每個(gè)樣品分別建兩個(gè)庫(kù),一個(gè)insert為200bp,一個(gè)insert為400bp;
上機(jī):每個(gè)測(cè)序庫(kù)使用Illumina HiSeq的 兩條lane;
reads:來(lái)自 3 x 2 x 2 = 12條lane,可以產(chǎn)生12個(gè)bam文件,結(jié)果如下:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
#解釋
樣本DAD建了兩個(gè)庫(kù)LIB-DAD-1(200bp)和LIB-DAD-2(400bp),使用了FLOWCELL1上面的4條lane;
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
參考:
Read groups
Picard
從零開始完整學(xué)習(xí)全基因組測(cè)序(WGS)數(shù)據(jù)分析:第4節(jié) 構(gòu)建WGS主流程