生信 - 序列比对 | Gaaising's Blog

笔记，简单写写，有误望指正。

1. 两种比对方式

全局比对：用于两个长度相似的序列，使用Needleman-Wunsch算法

局部比对：用于两个长度不一的序列，使用Smith–Waterman算法（由Needleman-Wunsch算法扩展得到）

Needleman-Wunsch算法

\begin{array}{l} \begin{array}{l} s(0,0) = 0 \\ s(0, j) = \text { gap } * j, \quad 1< = j< = m \\ s(i, 0) = \text { gap } * i, \quad 1< = i< = n \end{array}\\ s(i, j) = \max \left\{\begin{array}{l} s(i-1, j-1)+w(i, j) \\ s(i-1, j)+\text { gap } \\ s(i, j-1)+\text { gap } \end{array}\right. \end{array}

Smith–Waterman算法

\begin{array}{l} \begin{array}{l} s(0,0) = 0 \\ s(0, j) = 0, \quad 1< = j< = m \\ s(i, 0) = 0, \quad 1< = i< = n \end{array}\\ s(i, j) = \max \left\{\begin{array}{l} 0 \\ s(i-1, j-1)+w(i, j) \\ s(i-1, j)+\text { gap } \\ s(i, j-1)+\text { gap } \end{array}\right. \end{array}

其中，gap是空位罚分

对于两个序列字符串p和q

m=length(p)

n=length(q)

w(i,j)是根据替换记分矩阵得到字符q[i]和p[j]对应的得分

2. 使用JavaScript实现Smith–Waterman算法

本人写的一小段demo，用于辅助完成我的生信作业（老师是要求手算的😡，但我选择偷懒😋）

作业要求如下：

序列x：AGCTAAGT

序列y：GCAA

空位罚分：-5

替换记分矩阵：

A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8

根据替换积分矩阵，完成两条序列的局部比对。

	A	G	C	T
A	10	-1	-3	-4
G	-1	7	-5	-3
C	-3	-5	9	0
T	-4	-3	0	8

其实这个作业手算起来难度并不大，主要是步骤有点繁琐，纯体力劳动，一走神可能就搞错了，所以当然得用电脑代劳。

虽然有很多现成的序列比对工具，但作业要求用箭头标出回溯路径，一时间我也找不到能做这个的，所以干脆自己造了这个轮子。

为啥不用python？原本是想用numpy搞的，打开vscode的瞬间就改变想法了，我还是对js熟悉一点，py太久没碰已经很生疏了到时候估计还得翻文档。（其实js的数学库，比如math.js，我也不熟，但是矩阵计算本就不算复杂，所以就手搓吧）

function smithWaterman(x, y, substitutionMatrix, gap) {
  const rows = x.length + 1
  const cols = y.length + 1

  const scoreMatrix = Array.from({ length: rows }, () => Array(cols).fill(0))
  const pathMatrix = Array.from({ length: rows }, () => Array(cols).fill(null))

  let maxScore = 0
  let maxPos = null

  for (let i = 1; i < rows; i++) {
    for (let j = 1; j < cols; j++) {
      const upper_left = scoreMatrix[i - 1][j - 1] + substitutionMatrix[x[i - 1]][y[j - 1]]
      const upper = scoreMatrix[i - 1][j] + gap
      const left = scoreMatrix[i][j - 1] + gap

      const score = Math.max(0, upper_left, upper, left)
      scoreMatrix[i][j] = score

      if (score > maxScore) {
        maxScore = score
        maxPos = [i, j]
      }

      switch (score) {
        case 0:
          break
        case upper_left:
          pathMatrix[i][j] = [i - 1, j - 1]
          break
        case upper:
          pathMatrix[i][j] = [i - 1, j]
          break
        case left:
          pathMatrix[i][j] = [i, j - 1]
          break
      }
    }
  }

  let current = maxPos
  const tracebackArray = []

  while (current) {
    tracebackArray.push(current)
    const [i, j] = current
    current = pathMatrix[i][j]
  }

  return {
    scoreMatrix, //得分矩阵
    pathMatrix, //路径矩阵
    tracebackArray //回溯路径
  }
}

const x = 'AGCTAAGT' //序列x
const y = 'GCAA' //序列y
const substitutionMatrix = { //替换矩阵
  A: { A: 10, G: -1, C: -3, T: -4 },
  G: { A: -1, G: 7, C: -5, T: -3 },
  C: { A: -3, G: -5, C: 9, T: 0 },
  T: { A: -4, G: -3, C: 0, T: 8 },
}
const gap = -5 //空位罚分

const result = smithWaterman(x, y, substitutionMatrix, gap)

console.dir(result, { depth: null })

解释一下输出结果：得分矩阵scoreMatrix顾名思义；路径矩阵pathMatrix是每个位点最高值的来源位点（即，每个位点上的箭头指向的位点）；回溯路径tracebackArray是从最大值位点开始回溯到值为0的位点的路径。