Fastp提示sequence and quality have different length错误
使用HiC-pro软件分析测序数据,在bowtie2序列比对过程中出现错误:
Run HiC-Pro 3.1.0
--------------------------------------------
2023年 08月 19日 星期六 22:49:39 CST
Bowtie2 alignment step1 ...
Logs: logs/SRR1648/mapping_step1.log
Exit: Error in reads alignment - Exit
make: *** [/mnt/data/soft/hicpro/HiC-Pro_3.1.0/bin/../scripts//Makefile:88:bowtie_global] 错误 1
在HiC-pro软件中没有排查出错误,于是尝试用Fastp软件处理下fastq文件。
结果出现了详细的错误提示:
ERROR: sequence and quality have different length:
@SRR1648.49420915 E00563:451:HGW3TCCX2:1:1110:23490:58075/2
GTGAGCCAAGATTGCGCCACTGCACTCCAGCCTGGGCAACAAGAGCAAAACTGTCTCAAAAAAAAAAAAGAAAAAAATGAGTAGGGGATTGA--F-7--ATATAC49420AGTAGGG:58075/2
TGATAACAATGTCATTTTGTGAA70.49420JJJJJJJJJJJJJJJJJJJJJJJJJJJ-FJJJAGGT
+JJJJJJJJJJFJJJJX2:1:11JJJJJJJJJJJJJJ670.494208JJJJJJJJ7F7FJJJJJFJJJJJJJAJJJJJFJ494208JJJJJJJJ7075/2
序列碱基与质量两部分内容都是混乱的。
猜测可能是因为Hi-C数据文件很大,在下载过程中出现的问题。
(理论上应该进行md5检查的,但在数据网站没有找到)
于是重新进行了该文件的下载,再次检查这段序列,恢复正常了。
@SRR1648.49420915 E00563:451:HGW3TCCX2:1:1110:23490:58075/2
GTGAGCCAAGATTGCGCCACTGCACTCCAGCCTGGGCAACAAGAGCAAAACTGTCTCAAAAAAAAAAAAGAAAAAAATGAGTAGGGGATTGATCACGCCATTGCACTTCCGCTTGGGCAACAAGAGCAAAACTCTGTCTCAAAAAAAAAA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJJ<FJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJFJAFFJJFJJJJJJJJJJJJJ
如果重新下载后,问题依旧存在,那就只能删除fastq文件中的指定序列了。注意,双端测序文件_1删除一个序列之后,对应的_2也要删掉对应的序列,否则比对时会出现新的错误。
删除fastq文件中指定序列的python程序参考:
## gzip_fix.py
import gzip
import argparse
parser = argparse.ArgumentParser(description='fix reads from fastq.gz')
parser.add_argument('--input_file', '-f', dest='input_file', help='input a fastq.gz file')
parser.add_argument('--out_file', '-o', dest='out_file', help='input outfile name,end by gz')
parser.add_argument('--fix_key', '-k', dest='fix_key', help='input fix key, eg SRR164xxx.500')
args = parser.parse_args()
#EXAMPLE: python ./gzip_fix.py -f ./SRR1648_2.fastq.gz -o ./SRR1648_fix_2.fastq.gz -k SRR1648.49420915
print(args.input_file + '\n' + args.out_file + '\n' + args.fix_key)
fastqdict = {}
outfile = gzip.open(args.out_file, 'wb')
if_write = 0
with gzip.open(args.input_file, 'rb') as fastq:
for line in fastq:
if line.decode().startswith('@'):
fastqid = line.decode().strip().split()[0][1:]
if fastqid!=args.fix_key:
if_write = 1
else:
if_write = 0
if if_write == 1:
outfile.write(line)
else:
print('Fix Succeed: ' + args.fix_key + ' - ' + line.decode())
outfile.close()