提问人:Moses Harding 提问时间:10/16/2023 更新时间:10/16/2023 访问量:11
使用 SwiftSoup 和 SSZipArchive 解析 docx 会产生由于“gramstart”标签而产生意外结果
Parsing docx with SwiftSoup and SSZipArchive yields unexpected results due to "gramstart" tag
问:
我通过 Web 服务调用检索 docx。然后我用SSZipArchive解压缩它。从那里,我用 SwiftSoup 解析它。请参阅下面的代码
if let xmlURL = self.extractDocxAttachment(data: data) {
let string = self.getStringFrom(docURL: xmlURL)
print(string)
} else {
print("Could not convert doc")
}
func extractDocxAttachment(data: Data) -> URL? {
print(#function)
do {
// Save ZIP data to a temporary file
let tempZipURL = URL(fileURLWithPath: NSTemporaryDirectory()).appendingPathComponent("temp.zip")
try data.write(to: tempZipURL)
// Extract ZIP archive using SSZipArchive
let destinationDir = URL(fileURLWithPath: NSTemporaryDirectory()).appendingPathComponent("extracted-docx")
let success = SSZipArchive.unzipFile(atPath: tempZipURL.path, toDestination: destinationDir.path)
// Clean up temporary ZIP file
try FileManager.default.removeItem(at: tempZipURL)
if success {
// Get URL of word/document.xml file
let documentXMLFileURL = destinationDir.appendingPathComponent("word").appendingPathComponent("document.xml")
return documentXMLFileURL
} else {
print("Failed to extract DOCX file.")
return nil
}
} catch {
print("Error extracting DOCX file: \(error)")
return nil
}
}
func getStringFrom(docURL: URL) -> String {
print(#function)
// Initialize an ordered set to store unique text content while preserving order
var uniqueTexts = OrderedSet<String>()
do {
// Read XML file as string
let xmlString = try String(contentsOf: docURL, encoding: .utf8)
// Parse the XML string using SwiftSoup
let document = try SwiftSoup.parse(xmlString)
// Extract text content from XML document, preserving newline characters
let elements = try document.select("body *").array() // Select all elements inside <body>
for element in elements {
// Get text content of the element
let elementText = try element.text()
// Insert non-empty element texts into the ordered set
if !elementText.isEmpty {
uniqueTexts.insert(elementText)
}
}
} catch {
// Handle any parsing or file reading errors and print an error message
print("Error parsing XML file \(docURL.lastPathComponent): \(error)")
}
// Return the concatenated XML text with newline characters
return uniqueTexts.arrayRepresentation().joined(separator: "\n")
}
struct OrderedSet<T: Hashable> {
private var array = [T]()
private var set = Set<T>()
mutating func insert(_ element: T) {
if !set.contains(element) {
array.append(element)
set.insert(element)
}
}
func arrayRepresentation() -> [T] {
return array
}
我遇到的问题是,由于称为“gramStart”的东西,解析的数据被奇怪地拆分了。当我查看 XML 时,我可以看到它有时会将给定行的最后一个单词或单词分开。例如,请参见下文:
<w:t xml:space="preserve">4 cups cooked white </w:t>
</w:r>
<w:proofErr w:type="gramStart"/>
<w:r w:rsidRPr="0052776F">
<w:rPr>
<w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica"/>
<w:kern w:val="0"/>
<w:sz w:val="27"/>
<w:szCs w:val="27"/>
<w14:ligatures w14:val="none"/>
</w:rPr>
<w:t>rice</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
</w:p>
从我从文档中可以看出,当 Word 出于某种原因将短语标记为需要语法检查时,就会出现该标记。这破坏了我的解析,因为它将它们解释为单独的行 - 即我收到:
4杯煮熟的白葡萄酒 米
我的问题是,我怎样才能避免这种情况?我可以吗,1)。让 SwiftSoup 忽略该标签, 2).get ZipArchive,在解压缩时忽略该标签,或 3)。使用不同的扩展来执行此操作?
谢谢!
答: 暂无答案
评论