Day:005 | Python爬虫:高效数据抓取的编程技术(爬虫效率)
import requests
from bs4 import BeautifulSoup
import time
def get_html(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
response.encoding = response.apparent_encoding
return response.text
except Exception as e:
print(f"An error occurred: {e}")
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
return soup.find_all('div', class_='post-text')[0].text
def main():
start_time = time.time()
url = 'https://www.example.com/some-article'
html = get_html(url)
text = parse_html(html)
print(f"The article text is: {text}")
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")
if __name__ == "__main__":
main()
这段代码展示了如何使用Python的requests库和BeautifulSoup库高效地抓取网页数据。同时,代码中加入了异常处理和请求头信息,模拟了一个正规的网络请求,提高了爬虫的合法性和成功率。通过time模块,代码还可以统计爬取数据的耗时,为爬虫效率的评估提供数据支持。
评论已关闭