求解最小编辑距离并通过对矩阵回溯展示两个字符串的对齐

题目来自*Speech and Language Processing
Daniel-Jurafsky Stanford-University
*

Augment the minimum edit distance algorithm to output an alignment; you will need to store pointers and add a stage to compute the backtrace.

# To extend the edit distance algorithm to produce an alignment, we can start by
# visualizing an alignment as a path through the edit distance matrix.
# In the first step, we augment the minimum edit distance algorithm to store backpointers in each cell.
# The backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell.
# Some cells have multiple backpointers because the minimum extension could have come from multiple previous cells.
# In the second step, we perform a backtrace. In a backtrace, we start from the last cell (at the final row and column),
# and follow the pointers back through the dynamic programming matrix.
# Each complete path between the final cell and the initial cell is a minimum distance alignment.

import numpy as np


def edit(str1, str2):
    # 计算两个单词之间的最小编辑距离,并返回一个从源字符串转换到到目标字符串的操作列表(其中0代表无操作,1代表替换,2代表插入,3代表删除)
    # 初始化矩阵的行、列长度
    len_str1 = len( str1 ) + 1
    len_str2 = len( str2 ) + 1
    # 初始化矩阵
    matrix = np.zeros( shape=(len_str1, len_str2), dtype=int )  # np.zeros 创建指定大小的数组,数组元素以 0 来填充
    # numpy 产生三维矩阵,i行j列,每个单元2个元素,这两个元素代表matrix中的一个坐标,用于存储matrix中(i,j)位置的回溯,回溯即当前位置的编辑距离由min(x,y,z)决定
    matrix_pointer = np.zeros( shape=(len_str1, len_str2, 3), dtype=int )
    print( "编辑距离矩阵和回溯矩阵的性状:", matrix.shape, matrix_pointer.shape )
    # 初始化x维度
    for i in range( 1, len_str1 ):
        matrix[i][0] = i
        matrix_pointer[i][0] = i - 1, 0, 3
    # 初始化y维度
    for j in range( 1, len_str2 ):
        matrix[0][j] = j
        matrix_pointer[0][j] = 0, j - 1, 2
    # 动态规划,计算最小编辑距离
    for i in range( 1, len_str1 ):
        for j in range( 1, len_str2 ):
            # 当前i,j字符是否相同
            if str1[i - 1:i] == str2[j - 1:j]:  # 字符串截取,左闭右开
                flag = 0
            else:
                flag = 2  # substitute操作,权重设置为2,其他版本的算法也将该权重设置为1
            # 选择距离
            matrix[i][j] = min( matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + flag )
            # 如果数值相同,优先级是斜上、左和上
            if str1[i - 1:i] == str2[j - 1:j]:  # 若两字符串在该位置的字符相同,则直接向左上方回溯(none 0)
                matrix_pointer[i][j] = i - 1, j - 1, 0
            elif matrix[i][j] - 2 == matrix[i - 1][j - 1]:  # 若最小距离是由左上+2得来(substitute 1),则向左上方回溯
                matrix_pointer[i][j] = i - 1, j - 1, 1
            #  If two boldfaced cells occur in the same row, there will be an insertion in going from the source to the target;
            #  two boldfaced cells in the same column indicate a deletion.
            elif matrix[i][j] - 1 == matrix[i][j - 1]:  # 若最小距离是由左边+1得来(insert 2),则向左边回溯
                matrix_pointer[i][j] = i, j - 1, 2
            elif matrix[i][j] - 1 == matrix[i - 1][j]:  # 若最小距离是由上方+1得来(delete 3),则向上方回溯
                matrix_pointer[i][j] = i - 1, j, 3
    i, j, k = matrix_pointer[len_str1 - 1][len_str2 - 1]  # i,j代表回溯路径中的坐标,k代表操作
    op_list = [k]  # 初始化操作列表
    print("编辑距离矩阵:")
    print(matrix)
    print( "回溯路径的各单元坐标(从右下角单元回溯到左上角单元,其间的这条路径展示了两个字符串的对齐方式):" )
    while True:
        if i != 0 or j != 0:
            # print( i, j, k, matrix[i][j] )
            print( i, j )
            i, j, k = matrix_pointer[i][j]
            op_list.append( k )  # 扩展操作列表
        elif i == 0 and j == 0:
            # print( i, j, k )
            break
    op_list.reverse()
    # print( matrix, matrix.shape, matrix_pointer )
    # 返回最小编辑距离矩阵的右下角元素(最小编辑距离),源字符串对应的操作列表
    return matrix[len_str1 - 1][len_str2 - 1], list( enumerate( op_list ) )


def print_alignment(src, tar, op_list):
    #  根据源字符串和目标字符串以及操作序列输出对齐方式
    list_src = list( src )
    list_tar = list( tar )
    list_ope = []
    # I N T E * N T I O N
    # * E X E C U T I O N  0n 1s 2i 3d
    # d s s   i s

    for i, j in op_list:
        if j == 0:
            list_ope.append( " " )
        elif j == 3:
            list_tar.insert( i, "*" )
            list_ope.append( "d" )  # delete
        elif j == 1:
            list_ope.append( "s" )  # substitute
        elif j == 2:
            list_src.insert( i, "*" )
            list_ope.append( "i" )  # insert 

    print( ''.join( list_src ) )
    print( ''.join( list_tar ) )
    print( ''.join( list_ope ) )


if __name__ == '__main__':
    src = "intention"
    tar = "execution"

    res, op_list = edit( src, tar )
    print( "intention和execution的最小编辑距离是:", res )
    print( "对源字符串的操作序列:", op_list )
    print( "源字符串和目标字符串的对齐方式如下:" )
    print_alignment( src, tar, op_list )

output:
编辑距离矩阵和回溯矩阵的性状: (10, 10) (10, 10, 3)
编辑距离矩阵:
[[ 0  1  2  3  4  5  6  7  8  9]
 [ 1  2  3  4  5  6  7  6  7  8]
 [ 2  3  4  5  6  7  8  7  8  7]
 [ 3  4  5  6  7  8  7  8  9  8]
 [ 4  3  4  5  6  7  8  9 10  9]
 [ 5  4  5  6  7  8  9 10 11 10]
 [ 6  5  6  7  8  9  8  9 10 11]
 [ 7  6  7  8  9 10  9  8  9 10]
 [ 8  7  8  9 10 11 10  9  8  9]
 [ 9  8  9 10 11 12 11 10  9  8]]
回溯路径的各单元坐标(从右下角单元回溯到左上角单元,其间的这条路径展示了两个字符串的对齐方式):
8 8
7 7
6 6
5 5
4 4
4 3
3 2
2 1
1 0
intention和execution的最小编辑距离是: 8
对源字符串的操作序列: [(0, 3), (1, 1), (2, 1), (3, 0), (4, 2), (5, 1), (6, 0), (7, 0), (8, 0), (9, 0)]
源字符串和目标字符串的对齐方式如下:
inte*ntion
*execution
dss is 

版权声明:本文为qq_43552032原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。