Python 实战 | 从 PDF 中提取（框线不全的）表格

作者：System 时间：2024年08月16日分类：所有,python 字数：576

这篇文章距离上次修改已过690天，其中的内容可能已经有所变动。




import camelot
 
# 从 PDF 文件中提取表格
def extract_tables(pdf_path):
    tables = camelot.read_pdf(pdf_path, pages='1', flavor='stream', line_scale=40.0, strip_text='\n', kind='flowable')
    return tables
 
# 示例使用
pdf_path = 'example.pdf'  # 替换为你的 PDF 文件路径
tables = extract_tables(pdf_path)
 
# 输出提取的表格
for idx, table in enumerate(tables):
    print(f"Table {idx+1}")
    print(table)

确保你已经安装了 camelot 库，可以使用 pip install camelot-py[cv] 来安装。这段代码会尝试从指定的 PDF 文件中提取页码为 '1' 的表格，并输出它们。如果 PDF 文件中的表格框线不全，可以通过调整 line_scale 和 strip_text 参数来改善识别效果。

Python 实战 | 从 PDF 中提取（框线不全的）表格

评论已关闭

推荐阅读