Python实现Word DOC或DOCX与HTML格式互转

作者：System 时间：2024年08月19日分类：所有,html 字数：1000

这篇文章距离上次修改已过706天，其中的内容可能已经有所变动。




from docx import Document
import os
from bs4 import BeautifulSoup
import re
 
def docx_to_html(file_path):
    """将Word DOCX文件转换为HTML字符串"""
    doc = Document(file_path)
    html_content = ''
    for para in doc.paragraphs:
        if para.text:
            html_content += f'<p>{para.text}</p>'
        for run in para.runs:
            html_content += f'<p>{run.text}</p>'
    return html_content
 
def html_to_docx(html_content, output_file):
    """将HTML内容转换为Word DOCX文件"""
    document = Document()
    soup = BeautifulSoup(html_content, 'html.parser')
    for p in soup.find_all('p'):
        text = p.get_text()
        document.add_paragraph(text)
    document.save(output_file)
 
# 示例使用
html_content = docx_to_html('example.docx')
print(html_content)
html_to_docx(html_content, 'example_converted.docx')

这段代码提供了两个函数docx_to_html和html_to_docx，分别用于将Word DOCX文件转换为HTML字符串，以及将HTML内容转换回DOCX文件。这个例子假设Word文档中段落之间没有复杂的结构或样式。对于更复杂的转换，可能需要更详细地处理文档中的段落样式、图片、列表等元素。

Python实现Word DOC或DOCX与HTML格式互转

评论已关闭

推荐阅读