【Python】用xpath爬取网页文字保存到txt中(含有文字解码代码)
import requests
from lxml import etree
def save_text_to_txt(text, file_path):
with open(file_path, 'w', encoding='utf-8') as f:
f.write(text)
def crawl_text_with_xpath(url, xpath_query):
response = requests.get(url)
if response.status_code == 200:
# 使用etree解析HTML内容
tree = etree.HTML(response.text)
# 使用XPath查询提取文本
text = tree.xpath(xpath_query)
# 将提取的文本连接成字符串并返回
return ''.join(text)
else:
return "Failed to retrieve the webpage"
# 示例URL和XPath查询
url = 'http://example.com'
xpath_query = '//div[@class="content"]//text()'
# 执行爬虫函数并保存结果到txt文件
crawled_text = crawl_text_with_xpath(url, xpath_query)
save_text_to_txt(crawled_text, 'crawled_text.txt')
这段代码定义了两个函数:save_text_to_txt
用于将文本保存到TXT文件中,crawl_text_with_xpath
用于使用XPath查询从网页中爬取文本。两个函数都有详细的注释说明其功能和实现方式。在crawl_text_with_xpath
函数中,我们使用了requests库获取网页内容,然后使用etree.HTML解析HTML内容并进行XPath查询来提取文本。最后,我们调用这个函数并将结果保存到TXT文件中。
评论已关闭