如何在 bash 中找到两个字符串的重叠部分？

Question 1

您可以想到这样的函数，添加一些错误检查

common_prefix() {
  local n=0
  while [[ "${1:n:1}" == "${2:n:1}" ]]; do
    ((n++))
  done
  echo "${1:0:n}"
}

Answer

您可以想到这样的函数，添加一些错误检查

common_prefix() {
  local n=0
  while [[ "${1:n:1}" == "${2:n:1}" ]]; do
    ((n++))
  done
  echo "${1:0:n}"
}

Question 2

这可以完全在 bash 内完成。虽然在 bash 中循环进行字符串操作很慢，但有一个简单的算法，它的 shell 操作数量是对数的，因此即使对于长字符串，纯 bash 也是一个可行的选择。

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

标准工具箱包括cmp比较二进制文件。默认情况下，它表示第一个不同字节的字节偏移量。当一个字符串是另一个字符串的前缀时，存在一种特殊情况：cmp在 STDERR 上产生不同的消息；处理这个问题的一个简单方法是采用最短的字符串。

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

请注意，它对cmp字节进行操作，但 bash 的字符串操作对字符进行操作。这在多字节区域设置中会有所不同，例如使用 UTF-8 字符集的区域设置。上面的函数打印字节字符串的最长前缀。为了用这种方法处理字符串，我们可以先将字符串转换为定宽编码。假设区域设置的字符集是 Unicode 的子集，则 UTF-32 符合要求。

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=$LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32) \
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Answer

这可以完全在 bash 内完成。虽然在 bash 中循环进行字符串操作很慢，但有一个简单的算法，它的 shell 操作数量是对数的，因此即使对于长字符串，纯 bash 也是一个可行的选择。

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

标准工具箱包括cmp比较二进制文件。默认情况下，它表示第一个不同字节的字节偏移量。当一个字符串是另一个字符串的前缀时，存在一种特殊情况：cmp在 STDERR 上产生不同的消息；处理这个问题的一个简单方法是采用最短的字符串。

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

请注意，它对cmp字节进行操作，但 bash 的字符串操作对字符进行操作。这在多字节区域设置中会有所不同，例如使用 UTF-8 字符集的区域设置。上面的函数打印字节字符串的最长前缀。为了用这种方法处理字符串，我们可以先将字符串转换为定宽编码。假设区域设置的字符集是 Unicode 的子集，则 UTF-32 符合要求。

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=$LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32) \
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Question 3

在 sed 中，假设字符串不包含任何换行符：

string1="test toast"
string2="test test"
printf "%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

Answer

在 sed 中，假设字符串不包含任何换行符：

string1="test toast"
string2="test test"
printf "%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

Question 4

这对我来说似乎很粗糙，但你可以通过暴力来做到这一点：

#!/bin/bash

string1="test toast"
string2="test test"

L=1  # Prefix length

while [[ ${string1:0:$L} == ${string2:0:$L} ]]
do
    ((L = L + 1))
done

echo Overlap: ${string1:0:$((L - 1))}

我希望存在一些聪明的算法，但我无法通过简短的搜索找到任何算法。

Answer

这对我来说似乎很粗糙，但你可以通过暴力来做到这一点：

#!/bin/bash

string1="test toast"
string2="test test"

L=1  # Prefix length

while [[ ${string1:0:$L} == ${string2:0:$L} ]]
do
    ((L = L + 1))
done

echo Overlap: ${string1:0:$((L - 1))}

我希望存在一些聪明的算法，但我无法通过简短的搜索找到任何算法。

如何在 bash 中找到两个字符串的重叠部分？

答案1

答案2

答案3

答案4

相关内容