Microsoft Word 提供了搜索格式化文本并替换它和格式的功能。此功能对我的工作非常有用,因为我需要将 Word 文档转换为在线调查。一个简单的例子是找到一个bold
单词并将其替换为<strong>bold</strong>
。
然而,有些情况下,我们收到的文档在上述格式之间有多余的、未格式化的空白。这使得查找和替换所有粗体文本的过程有点棘手。此外,有些情况下,空白被格式化了,而实际上不应该被格式化。
什么是宏或通配符(正则表达式)搜索/替换,用于查找并用格式正确的空格替换所有格式不正确的空格?
“不正确”的两个标准是,行上的最后一个空格必须是未格式化的,并且两个格式化单词之间的空格必须是格式化的。本质上,我试图创建最干净尽可能全部替换。
以以下截图为例:
粉色/紫色高亮表示空白,这是正常风格,但应该用斜体表示。
红色/橙色突出显示表示加粗的空白,但应为正常的、非加粗的样式。
在这两种情况下,我都需要宏或通配符查找/替换来知道将一个转换为斜体,并从另一个中完全删除粗体样式。
进一步阐述:
目前,如果我仅利用 Microsoft Word 的字体样式格式进行查找和替换,则会<em>
在某些行中产生三个元素,例如:
<em>The average American expects the rate of deflation (opposite</em> <em>of</em> <em>inflation)</em> will be between 0% and 2%
理想的结果应该是一个<em>
元素:
<em>The average American expects the rate of deflation (opposite of inflation)</em> will be between 0% and 2%
(请注意,我使用斜体和粗体作为示例,但下划线文本也是如此。)
答案1
Word 的“通配符”查找和替换使用(非常)有限的、非标准的正则表达式形式。再加上您还想查找和替换格式,这意味着无论是否使用通配符,仅使用内置的查找和替换都无法完成您的要求。
然而,它是可以利用 Word 的宏中的查找/替换功能来实现智能空格转换。还可以仅使用 VBA 可用的正确正则表达式来编写宏,而无需访问 Word 的查找/替换功能。
以下解决方案采用前者,使用Find
对象以编程方式执行 Word 的查找/替换,而无需使用通配符。但它确实在几个辅助函数中使用了 VBA(或更严格地说是 VBScript)的正则表达式,以使它们更简单。
该脚本并非只是适当地转换空格,这样仍然需要进一步的查找和替换所有步骤,而是有效地将空格转换为和同时进行 HTML 换行和格式删除。
'============================================================================================
' Module : <in any standard module>
' Version : 0.1.4
' Part : 1 of 1
' References : Microsoft VBScript Regular Expressions 5.5 [VBScript_RegExp_55]
' Source : https://superuser.com/a/1321448/763880
'============================================================================================
Option Explicit
Private Const s_BoldReplacement = "<strong>^&</strong>"
Private Const s_ItalicReplacement = "<em>^&</em>"
Private Const s_UnderlineReplacement = "<u>^&</u>"
Private Enum FormatType
Bold
Italic
Underline
End Enum
Public Sub ConvertFormattedTextToHTML()
With Application
.ScreenUpdating = True ' Set to False to speed up execution for large documents
ConvertTextToHTMLIf Bold
ConvertTextToHTMLIf Italic
ConvertTextToHTMLIf Underline
.ScreenUpdating = True
End With
End Sub
Private Sub ConvertTextToHTMLIf _
( _
ByVal peFormatType As FormatType _
)
' Create/setup a Find object
Dim rngFound As Range: Set rngFound = ActiveDocument.Content
With rngFound.Find
.MatchCase = True ' Required, otherwise an all-caps found chunk's replacement is converted to all-caps
.Format = True
Select Case peFormatType
Case FormatType.Bold:
.Font.Bold = True
.Replacement.Font.Bold = False
.Replacement.Text = s_BoldReplacement
Case FormatType.Italic:
.Font.Italic = True
.Replacement.Font.Italic = False
.Replacement.Text = s_ItalicReplacement
Case FormatType.Underline:
.Font.Underline = True
.Replacement.Font.Underline = False
.Replacement.Text = s_UnderlineReplacement
End Select
End With
' Main "chunk" loop:
' - Finds the next chunk (contiguous appropriately formatted text);
' - Expands it to encompass the following chunks if only separated by unformatted grey-space (white-space + punctuation - vbCr - VbLf)
' - Removes (and unformats) leading and trailing formatted grey-space from the expanded-chunk
' - Converts the trimmed expanded-chunk to unformatted HTML
Do While rngFound.Find.Execute() ' (rngFound is updated to the "current" chunk if the find succeeds)
If rngFound.End = rngFound.Start Then Exit Do ' ## bug-workaround (Bug#2 - see end of sub) ##
' Create a duplicate range in order to track the endpoints for the current chunk's expansion
Dim rngExpanded As Range: Set rngExpanded = rngFound.Duplicate
rngFound.Collapse wdCollapseEnd ' ## bug-workaround (Bug#2 - see end of sub) ##
' Expansion loop
Do
' If more chunks exist ~> the current chunk is fully expanded
If Not rngFound.Find.Execute() Then Exit Do ' (rngFound is updated to the next chunk if the find succeeds)
If rngFound.End = rngFound.Start Then Exit Do ' ## bug-workaround (Bug#2 - see end of sub) ##
' If the formatting continues across a line boundary ~> terminate the current chunk at the boundary
If rngFound.Start = rngExpanded.End And rngExpanded.Characters.Last.Text = vbCr Then Exit Do ' ## requiring the vbCr check is a bug-workaround (Bug#1 - see end of sub) ##
' If the intervening (unformatted) text doesn't just consist of grey-space ~> the current chunk is fully expanded
' (Note that since vbCr & vbLf aren't counted as grey-space, chunks don't expand across line boundaries)
If NotJustGreySpace(rngFound.Parent.Range(rngExpanded.End, rngFound.Start)) Then Exit Do
' Otherwise, expand the current chunk to encompass the inter-chunk (unformatted) grey-space and the next chunk
rngExpanded.SetRange rngExpanded.Start, rngFound.End
rngFound.Collapse wdCollapseEnd ' ## bug-workaround (Bug#2 - see end of sub) ##
Loop
With rngExpanded.Font
' Clear the appropriate format for the expanded-chunk
Select Case peFormatType
Case FormatType.Bold: .Bold = False
Case FormatType.Italic: .Italic = False
Case FormatType.Underline: .Underline = False
End Select
End With
With TrimRange(rngExpanded) ' (rngExpanded also gets updated as a side-effect)
With .Font
' Restore the appropriate format for the trimmed expanded-chunk
Select Case peFormatType
Case FormatType.Bold: .Bold = True
Case FormatType.Italic: .Italic = True
Case FormatType.Underline: .Underline = True
End Select
' (Leading and trailing grey-space is now unformatted wrt the appropriate format)
End With
' Unformat the trimmed expanded-chunk and convert it to HTML
If .Start = .End _
Then ' ~~ Grey-space Only ~~
' Don't convert. (Has already been unformatted by the previous trim)
Else ' ~~ Valid Text ~~
' Need to copy the trimmed expanded-chunk endpoints back to rngFound as we can't use rngExpanded for the replace
' since a duplicate's Find object gets reset upon duplication.
rngFound.SetRange .Start, .Start ' ## Second .Start instead of .End is a bug-workaround (Bug#2 - see below) ##
rngFound.Find.Text = rngExpanded.Text ' ## bug-workaround (Bug#2 - see end of sub) ##
rngFound.Find.Execute Replace:=wdReplaceOne
rngFound.Find.Text = vbNullString ' ## bug-workaround (Bug#2 - see end of sub) ##
End If
rngFound.Collapse wdCollapseStart ' ## bug-workaround (Bug#1 & Bug#2 - see end of sub) ##
End With
Loop
' ## Bug#1 ## Normally, after a range has been updated as a result of performing the Execute() method to *find*
' something, performing a second "find" will continue the search in the rest of the document. If, however, the range
' is modified in such a way that the same find would not succeed in the range (as is what typically happens when using
' Execute() to perform a find/replace), then a second "find" will *NOT* continue the search in the rest of the
' document and fails instead. The solution is to "collapse" the range to zero width. See the following for more info:
' http://web.archive.org/web/20180512034406/https://gregmaxey.com/word_tip_pages/words_fickle_vba_find_property.html
' ## Bug#2 ## Good ol' buggy Word sometimes decides to split a chunk up even though it doesn't cross a line boundary.
' Also, even when the Find object's wrap property is set to wdFindStop (default value), it sometimes behaves as if the
' property is set to wdFindContinue, which is also buggy, resulting in Execute() not returning False when no more
' chunks exist after wrapping (and *correctly* not updating rngFound). This requires a few work-arounds to cater for
' all the resulting combination of edge cases.
' See the following for a example doc reproducing this bug:
' https://drive.google.com/open?id=11Z9fpxllk2ZHAU90_lTedhYSixQQucZ5
' See the following for more details on when this occurs:
' https://chat.stackexchange.com/rooms/77370/conversation/word-bug-finding-formats-in-line-before-table
End Sub
' Note that vbCr & vbLf are NOT treated as white-space.
' Also note that "GreySpace" is used to indicate it is not purely white-space, but also includes punctuation.
Private Function IsJustGreySpace _
( _
ByVal TheRange As Range _
) _
As Boolean
Static rexJustWhiteSpaceExCrLfOrPunctuation As Object '## early binding:- As VBScript_RegExp_55.RegExp
If rexJustWhiteSpaceExCrLfOrPunctuation Is Nothing Then
Set rexJustWhiteSpaceExCrLfOrPunctuation = CreateObject("VBScript.RegExp") ' ## early binding:- = New VBScript_RegExp_55.RegExp
rexJustWhiteSpaceExCrLfOrPunctuation.Pattern = "^(?![^\r\n]*?[\r\n].*$)[\s?!.,:;-]*$" ' ## the last * instead of + is a bug-workaround (Bug#2 - see end of main sub) ##
End If
IsJustGreySpace = rexJustWhiteSpaceExCrLfOrPunctuation.test(TheRange.Text)
End Function
Private Function NotJustGreySpace _
( _
ByVal TheRange As Range _
) _
As Boolean
NotJustGreySpace = Not IsJustGreySpace(TheRange)
End Function
Private Function TrimRange _
( _
ByRef TheRange As Range _
) _
As Range
Static rexTrim As Object '## early binding:- As VBScript_RegExp_55.RegExp
If rexTrim Is Nothing Then
Set rexTrim = CreateObject("VBScript.RegExp") ' ## early binding:- = New VBScript_RegExp_55.RegExp
rexTrim.Pattern = "(^[\s?!.,:;-]*)(.*?)([\s?!.,:;-]*$)"
End If
With rexTrim.Execute(TheRange.Text)(0)
If Len(.SubMatches(1)) = 0 _
Then ' ~~ Grey-space Only ~~
TheRange.Collapse wdCollapseEnd
Else
TheRange.SetRange TheRange.Start + Len(.SubMatches(0)), TheRange.End - Len(.SubMatches(2))
End If
End With
Set TrimRange = TheRange
End Function
标准:
我擅自扩展/推断了空白转换的标准。如果这些标准不符合您的确切要求,您可以进行修改。目前这些标准如下:
- 转换针对每种格式类型独立进行,例如粗体、斜体、下划线。目前脚本仅处理这三种类型。可以轻松添加/删除类型。
- 转换按行进行。不会跨越行边界。这是因为将回车符和换行符视为非空白字符,并利用 Word 的内置查找功能在行边界处终止搜索。
- 根据评论中的请求,标点符号
?!.,:;-
现在被视为与空格相同。 - 任何连续的空格/标点符号序列,如果序列前面的非空格/标点符号与序列后面的字符具有相同的格式,则将转换为该格式。请注意,这会导致未格式化单词之间的空格/标点符号的格式被删除,以及“扩展”格式化文本以包含未格式化的空格/标点符号。
- 如果连续的空格/标点符号序列的前一个和后一个字符格式不同,则强制取消格式化该空格/标点符号序列。结合每行转换,结果如下:
- 行首或行末的空格/标点符号未格式化;
- 格式化文本部分开始或结束处的空格/标点符号未被格式化。
笔记:
该脚本有相当好的文档记录,因此应该是不言自明的。
它使用后期绑定,因此不需要设置引用。
编辑: 根据评论更新了新版本。