对列表进行数字排序

Question 1

您没有指定明确的语言要求，因此这是 python 3.8 中的一个肮脏的解决方案。我确信其他人可以想出更好的方法，但这应该足够了。

该代码假设文本位于当前目录中名为 list.txt 的文件中，并将创建一个名为 new-list.txt 的新文件

它也不处理“-La isla del tesoro”中缺失的空格

import re

booklist = []
bookcount = 0
entry = ''
line_numbers = []

# Find and return the volume number for a book
def get_volnum(book):
        volstring = ''
        volstring = re.search('\\(volume (\d+)\\)', book)
        volnum = volstring.group(1)
        return volnum

# Read file and put in doc variable
doc = open('list.txt', 'r').readlines()

# Group each book in a single string and append in a booklist
for line in doc:
    # if line begins with three decimals followed by 'G.', put line in a new entry. 
    if re.match("(\d\d\d)G.*", line): 
        #read the line number and append to a list
        line_numbers.append(line.split('G.')[0])
        # Add previous entry to booklist (without the three decimals and G.)
        if bookcount > 0:
            booklist.append(entry.split('G.')[1])  

        entry = line
        bookcount +=1
    # If line begins with a '- ', concatenate the line into the current entry.
    if line.startswith('- '):
        entry += line

#Append last line
booklist.append(entry.split('G.')[1])  
# Make a list (booktable) that contains [volnum, book]
booktable = []
[booktable.append([get_volnum(book), book]) for book in booklist]

# Sort that list by volnum (index 0 of each list item of booktable)
booktable.sort(key=lambda x: int(x[0]))

line_numbers.sort()

# Write result to file
f = open("new-list.txt", "w")
for b in booktable:
    f.write(line_numbers.pop(0) + 'G.' + b[1])
    f.write('\n')

f.close()

Answer

您没有指定明确的语言要求，因此这是 python 3.8 中的一个肮脏的解决方案。我确信其他人可以想出更好的方法，但这应该足够了。

该代码假设文本位于当前目录中名为 list.txt 的文件中，并将创建一个名为 new-list.txt 的新文件

它也不处理“-La isla del tesoro”中缺失的空格

import re

booklist = []
bookcount = 0
entry = ''
line_numbers = []

# Find and return the volume number for a book
def get_volnum(book):
        volstring = ''
        volstring = re.search('\\(volume (\d+)\\)', book)
        volnum = volstring.group(1)
        return volnum

# Read file and put in doc variable
doc = open('list.txt', 'r').readlines()

# Group each book in a single string and append in a booklist
for line in doc:
    # if line begins with three decimals followed by 'G.', put line in a new entry. 
    if re.match("(\d\d\d)G.*", line): 
        #read the line number and append to a list
        line_numbers.append(line.split('G.')[0])
        # Add previous entry to booklist (without the three decimals and G.)
        if bookcount > 0:
            booklist.append(entry.split('G.')[1])  

        entry = line
        bookcount +=1
    # If line begins with a '- ', concatenate the line into the current entry.
    if line.startswith('- '):
        entry += line

#Append last line
booklist.append(entry.split('G.')[1])  
# Make a list (booktable) that contains [volnum, book]
booktable = []
[booktable.append([get_volnum(book), book]) for book in booklist]

# Sort that list by volnum (index 0 of each list item of booktable)
booktable.sort(key=lambda x: int(x[0]))

line_numbers.sort()

# Write result to file
f = open("new-list.txt", "w")
for b in booktable:
    f.write(line_numbers.pop(0) + 'G.' + b[1])
    f.write('\n')

f.close()

Question 2

这（使用 GNU awk 将第三个参数设置为match()、gensub()、sorted_in和FPAT）只会对您想要的部分进行排序（即序列号为“292”或更大的集合“one”），可以处理包含任何字符或字符串的标题包括;、(、)或(volume <N>)，并将在未排序的周围部分中的原始位置输出已排序的部分：

$ cat tst.awk
BEGIN {
    RS = ""
    ORS = "\n\n"
    FPAT = "[^;]*(\"[^\"]*\")*[^;]*"
    tgtColl = "one"
    begSeqNr = 292
    maxSeqs = 100
}
match($2,/Collection (.*) \(volume ([0-9]+))/,a) {
    coll  = a[1]
    volNr = a[2]
    seqNr = $1+0
}
(coll == tgtColl) && (seqNr >= begSeqNr) && (++seqCnt <= maxSeqs) {
    vols[volNr] = $0
    next
}
{
    prtVols()
    print
}
END { prtVols() }

function prtVols(       volNr, seqNr, vol) {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    seqNr = begSeqNr
    for (volNr in vols) {
        vol = vols[volNr]
        sub(/[0-9]+/,seqNr++,vol)
        print vol
    }
    delete vols
}

例如，假设此输入是根据问题中的晴天案例修改的，以添加几个有用的测试用例：

$ cat file
  100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST earlier collection ID

  200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - TEST earlier sequence number

  292G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - I have to download more ancient greek texts.
  - Another note line.

  293G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf
  - Masterpiece.

  294G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf
  - Russian masterpiece.

  295G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST quoted title with sparator chars and target string

  296G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - I read this one as a kid.

  300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST later collecion ID

它将输出：

$ awk -f tst.awk file
  100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST earlier collection ID

  200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - TEST earlier sequence number

  292G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf
  - Masterpiece.

  293G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - I read this one as a kid.

  294G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - I have to download more ancient greek texts.
  - Another note line.

  295G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf
  - Russian masterpiece.

  296G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST quoted title with sparator chars and target string

  300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST later collecion ID

由于它是字段分隔符，因此;标题中出现的任何内容都必须在双引号内，无论是单独的Kill Bill";" Bury Him还是作为整个带引号的标题的一部分（如上例所示），标题中的其他字符或字符串都不需要任何特殊处理。

如果您实际上想要所有集合one，而不仅仅是从序列号开始，反之亦然，那么这是一个非常微不足道的调整，并且显然不测试其中一个或另一个，类似地，如果您希望所有集合从给定的位置开始begSeqNr排序如果只有 100 个，则不包含 for 的文本seqCnt，如果您不想打印周围的集合/序列，则只需删除独立print语句即可。

Answer

这（使用 GNU awk 将第三个参数设置为match()、gensub()、sorted_in和FPAT）只会对您想要的部分进行排序（即序列号为“292”或更大的集合“one”），可以处理包含任何字符或字符串的标题包括;、(、)或(volume <N>)，并将在未排序的周围部分中的原始位置输出已排序的部分：

$ cat tst.awk
BEGIN {
    RS = ""
    ORS = "\n\n"
    FPAT = "[^;]*(\"[^\"]*\")*[^;]*"
    tgtColl = "one"
    begSeqNr = 292
    maxSeqs = 100
}
match($2,/Collection (.*) \(volume ([0-9]+))/,a) {
    coll  = a[1]
    volNr = a[2]
    seqNr = $1+0
}
(coll == tgtColl) && (seqNr >= begSeqNr) && (++seqCnt <= maxSeqs) {
    vols[volNr] = $0
    next
}
{
    prtVols()
    print
}
END { prtVols() }

function prtVols(       volNr, seqNr, vol) {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    seqNr = begSeqNr
    for (volNr in vols) {
        vol = vols[volNr]
        sub(/[0-9]+/,seqNr++,vol)
        print vol
    }
    delete vols
}

例如，假设此输入是根据问题中的晴天案例修改的，以添加几个有用的测试用例：

$ cat file
  100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST earlier collection ID

  200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - TEST earlier sequence number

  292G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - I have to download more ancient greek texts.
  - Another note line.

  293G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf
  - Masterpiece.

  294G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf
  - Russian masterpiece.

  295G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST quoted title with sparator chars and target string

  296G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - I read this one as a kid.

  300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST later collecion ID

它将输出：

$ awk -f tst.awk file
  100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST earlier collection ID

  200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - TEST earlier sequence number

  292G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf
  - Masterpiece.

  293G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - I read this one as a kid.

  294G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf
  - I have to download more ancient greek texts.
  - Another note line.

  295G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf
  - Russian masterpiece.

  296G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST quoted title with sparator chars and target string

  300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf
  - TEST later collecion ID

由于它是字段分隔符，因此;标题中出现的任何内容都必须在双引号内，无论是单独的Kill Bill";" Bury Him还是作为整个带引号的标题的一部分（如上例所示），标题中的其他字符或字符串都不需要任何特殊处理。

如果您实际上想要所有集合one，而不仅仅是从序列号开始，反之亦然，那么这是一个非常微不足道的调整，并且显然不测试其中一个或另一个，类似地，如果您希望所有集合从给定的位置开始begSeqNr排序如果只有 100 个，则不包含 for 的文本seqCnt，如果您不想打印周围的集合/序列，则只需删除独立print语句即可。

Question 3

通过awk和GNU-特征（！) 定义数组遍历。注意：将整个文件存储在 RAM 中一次，但你说“超过 100 卷”，所以我认为该文件并不是非常大。

这个想法是

用空行分隔记录（一行中两个换行符，假设没有制表符）
使用括号作为字段分隔符：将行放入数组中，以卷号作为索引标识符。因此，需要将数字与sub
按“卷 X”索引对输出进行排序
只需以排序的方式替换每个条目的数字（293G 等）

脚本：

BEGIN { RS="" ; ORS="\n\n" ; FS="[()]" }

{id=$2 ; sub(/volume /,"",id) ; vol[id]=$0}    

END {PROCINFO["sorted_in"]="@ind_num_asc"
    n=292
    for ( id in vol ) { gsub(/^\t.../,"\t"n++,vol[id]) ; print vol[id] } }

运行通过

awk -f script inputfile

Answer

通过awk和GNU-特征（！) 定义数组遍历。注意：将整个文件存储在 RAM 中一次，但你说“超过 100 卷”，所以我认为该文件并不是非常大。

这个想法是

用空行分隔记录（一行中两个换行符，假设没有制表符）
使用括号作为字段分隔符：将行放入数组中，以卷号作为索引标识符。因此，需要将数字与sub
按“卷 X”索引对输出进行排序
只需以排序的方式替换每个条目的数字（293G 等）

脚本：

BEGIN { RS="" ; ORS="\n\n" ; FS="[()]" }

{id=$2 ; sub(/volume /,"",id) ; vol[id]=$0}    

END {PROCINFO["sorted_in"]="@ind_num_asc"
    n=292
    for ( id in vol ) { gsub(/^\t.../,"\t"n++,vol[id]) ; print vol[id] } }

运行通过

awk -f script inputfile

Question 4

<infile awk -F';' -v RS= '
        /Collection one/{ n=$2; gsub(/[^0-9]*/, "", n); sub(/[0-9]+/, 292+n-1) }
        { print sep $0; sep="\0" }' |sort -z |tr '\0' '\n'

Answer

<infile awk -F';' -v RS= '
        /Collection one/{ n=$2; gsub(/[^0-9]*/, "", n); sub(/[0-9]+/, 292+n-1) }
        { print sep $0; sep="\0" }' |sort -z |tr '\0' '\n'

对列表进行数字排序

答案1

答案2

答案3

答案4

相关内容