Python DNA序列中子序列出現頻率

作者：佚名 2010-03-15 09:47:43

Python DNA序列是在Python語言中使用比較廣泛的一個。當然我們在不斷的進行學習中還是有不少的問題存在。下面我們就詳細的看看。

Python DNA序列在使用的時候有很多需要我們注意的東西，其實在不斷的學習中有很多問題存在，下面我們就詳細的看看如何進行相關的技術學校。ms是我師弟的rotation project：給定一堆Python DNA序列，即由字符A, C, G, T組成的字符串，統計所有長度為n的子序列出現的頻率。

比如 ACGTACGT，子序列長度為2，于是 AC=2, CG=2, GT=2, TA=1，其余長度為2的子序列頻率為0.

最先想到的就是建一個字典，key是所有可能的子序列，value是這個子序列出現的頻率。Python DNA序列但是當子序列比較長的時候，比如 n=8，需要一個有65536 (4的8次方) 個key-value pair的字典，且每個key的長度是8字符。這樣ms有點浪費內存。。

于是想到，所有的長度為n的子序列是有序且連續的，所以可以映射到一個長度為4的n次方的的list里。令 A=0, C=1, G=2, T=3，則把子序列 ACGT 轉換成 0*4^3 + 1*4^2 + 2*4 + 3 = 27, 映射到list的第27位。如此，list的index對應子序列，而list這個index位置則儲存這個子序列出現的頻率。

于是我們先要建立2個字典，表示ACGT和0123一一對應的關系：

i2mD = {0:'A', 1:'C', 2:'G', 3:'T'}  
m2iD = dict(A=0,C=1,G=2,T=3)  
# This is just another way to initialize a 
dictionary

以及下面的子序列映射成整數函數：

def motif2int(motif):  
'''convert a sub-sequence/motif to a non-negative 
integer'''  
total = 0 
for i, letter in enumerate(motif):  
total += m2iD[letter]*4**(len(motif)-i-1)  
return total  
Test:  
>>> motif2int('ACGT')

雖然我們內部把子序列當成正整數來存儲（確切地說，其實這個整數是沒有存在內存里的，而是由其在list的index表示的），為了方便生物學家們看，輸出時還是轉換回子序列比較好。

于是有了下面的整數映射成子序列函數，其中調用了另外一個函數baseN()，來源在此，感謝作者～

def baseN(n,b):  
'''convert non-negative decimal integer n to  
equivalent in another base b (2-36)'''  
return ((n == 0) and '0' ) or ( baseN(n // b, b).lstrip('0') + \  
"0123456789abcdefghijklmnopqrstuvwxyz"[n % b])  
def int2motif(n, motifLen):  
'''convert non-negative integer n to a sub-sequence/motif with length motifLen'''  
intBase4 = baseN(n,4)  
return ''.join(map(lambda x: i2mD[int(x)],'0'*(motifLen-len(intBase4))+intBase4))  
Test:  
>>> int2motif(27,4)  
'ACGT'

以下代碼從命令行讀入一個存有DNA序列的fasta文件，以及子序列長度，并輸出子序列和頻率。注意以下代碼需要Biopython module。

if __name__ == '__main__':  
import sys  
from Bio import SeqIO  
# read in the fasta file name and motif length  
# from command line parameters  
fastafile = sys.argv[1]  
motifLen = int(sys.argv[2])  
# list to store subsequence frequency  
frequencyL = [0]*4**motifLen  
# go over each DNA sequence in the fasta file  
# and count the frequency of subsequences  
it = SeqIO.parse(open(fastafile),'fasta')  
for rec in it:  
chrom = rec.seq.tostring()  
for i in range(len(chrom)-motifLen+1):  
motif = chrom[i:i+motifLen]  
frequencyL[motif2int(motif)] += 1  
# print frequency result to screen  
for i, frequency in enumerate(frequencyL):  
print int2motif(i, motifLen), frequency

以上就是對Python DNA序列的相關介紹。

【編輯推薦】

簡介Python程序的校園網的認證方法
Python文本亂碼發生時的解決方案
Python編程語言總體性能優點評測
Python編程語言具有相當高的適應能力
Python編程語言維和受到眾人的追捧

責任編輯：張浩來源： IT168

Python DNA序列

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

Python DNA序列中子序列出現頻率