使用 SwiftSoup 和 SSZipArchive 解析 docx 会产生由于“gramstart”标签而产生意外结果

Parsing docx with SwiftSoup and SSZipArchive yields unexpected results due to "gramstart" tag

提问人:Moses Harding 提问时间:10/16/2023 更新时间:10/16/2023 访问量:11

问:

我通过 Web 服务调用检索 docx。然后我用SSZipArchive解压缩它。从那里,我用 SwiftSoup 解析它。请参阅下面的代码

    if let xmlURL = self.extractDocxAttachment(data: data) {
        let string = self.getStringFrom(docURL: xmlURL)
        print(string)
    } else {
        print("Could not convert doc")
    }

 func extractDocxAttachment(data: Data) -> URL? {
        print(#function)
        
        do {
            // Save ZIP data to a temporary file
            let tempZipURL = URL(fileURLWithPath: NSTemporaryDirectory()).appendingPathComponent("temp.zip")
            try data.write(to: tempZipURL)
            
            // Extract ZIP archive using SSZipArchive
            let destinationDir = URL(fileURLWithPath: NSTemporaryDirectory()).appendingPathComponent("extracted-docx")
            let success = SSZipArchive.unzipFile(atPath: tempZipURL.path, toDestination: destinationDir.path)
            
            // Clean up temporary ZIP file
            try FileManager.default.removeItem(at: tempZipURL)
            
            if success {
                // Get URL of word/document.xml file
                let documentXMLFileURL = destinationDir.appendingPathComponent("word").appendingPathComponent("document.xml")
                return documentXMLFileURL
            } else {
                print("Failed to extract DOCX file.")
                return nil
            }
        } catch {
            print("Error extracting DOCX file: \(error)")
            return nil
        }
    }
    
    func getStringFrom(docURL: URL) -> String {
        print(#function)
        
        // Initialize an ordered set to store unique text content while preserving order
        var uniqueTexts = OrderedSet<String>()
        
        do {
            // Read XML file as string
            let xmlString = try String(contentsOf: docURL, encoding: .utf8)
            
            // Parse the XML string using SwiftSoup
            let document = try SwiftSoup.parse(xmlString)
            
            // Extract text content from XML document, preserving newline characters
            let elements = try document.select("body *").array() // Select all elements inside <body>
            for element in elements {
                // Get text content of the element
                let elementText = try element.text()
                
                // Insert non-empty element texts into the ordered set
                if !elementText.isEmpty {
                    uniqueTexts.insert(elementText)
                }
            }
        } catch {
            // Handle any parsing or file reading errors and print an error message
            print("Error parsing XML file \(docURL.lastPathComponent): \(error)")
        }
        
        // Return the concatenated XML text with newline characters
        return uniqueTexts.arrayRepresentation().joined(separator: "\n")
    }
struct OrderedSet<T: Hashable> {
private var array = [T]()
private var set = Set<T>()

mutating func insert(_ element: T) {
    if !set.contains(element) {
        array.append(element)
        set.insert(element)
    }
}

func arrayRepresentation() -> [T] {
    return array

}

我遇到的问题是,由于称为“gramStart”的东西,解析的数据被奇怪地拆分了。当我查看 XML 时,我可以看到它有时会将给定行的最后一个单词或单词分开。例如,请参见下文:

<w:t xml:space="preserve">4 cups cooked white </w:t>
</w:r>
<w:proofErr w:type="gramStart"/>
<w:r w:rsidRPr="0052776F">
<w:rPr>
<w:rFonts w:ascii="Helvetica" w:hAnsi="Helvetica"/>
<w:kern w:val="0"/>
<w:sz w:val="27"/>
<w:szCs w:val="27"/>
<w14:ligatures w14:val="none"/>
</w:rPr>
<w:t>rice</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
</w:p>

从我从文档中可以看出,当 Word 出于某种原因将短语标记为需要语法检查时,就会出现该标记。这破坏了我的解析,因为它将它们解释为单独的行 - 即我收到:

4杯煮熟的白葡萄酒 米

我的问题是,我怎样才能避免这种情况?我可以吗,1)。让 SwiftSoup 忽略该标签, 2).get ZipArchive,在解压缩时忽略该标签,或 3)。使用不同的扩展来执行此操作?

谢谢!

Swift 解析 docx ssziparchive swiftsoup

评论


答: 暂无答案