以下是一个简单的Python爬虫代码示例,使用requests
库和BeautifulSoup
进行网页爬取,并使用threading
库实现多线程下载。
import requests
from bs4 import BeautifulSoup
import threading
def download_page(url, page_number):
print(f"正在下载第{page_number}页...")
response = requests.get(f"{url}/list-{page_number}.html")
if response.status_code == 200:
return response.text
return None
def parse_page(html_content, page_number):
soup = BeautifulSoup(html_content, 'html.parser')
for link in soup.find_all('a', href=True):
if 'video' in link['href']:
print(f"第{page_number}页发现视频链接: {link['href']}")
def main():
base_url = 'http://www.hahavip.com'
start_page = 1
end_page = 100
threads = []
for page_number in range(start_page, end_page + 1):
t = threading.Thread(target=parse_page, args=(download_page(base_url, page_number), page_number))
threads.append(t)
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()
请注意,在实际应用中,应遵守网站的robots.txt
规则,并在允许的范围内进行爬取,避免对网站造成过大压力,同时应考虑网站反爬机制,可能需要设置代理、使用Cookies、加入合适的请求间隔等策略。