最全Python 基于Xpath和lxml库解析网站_招聘网站 爬虫lxml+xpath,2024年最新GitHub上标星13k的《Python面试突击版》
import requests
from lxml import etree
def get_job_info(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
def parse_jobs(html):
tree = etree.HTML(html)
job_list = tree.xpath('//div[@class="job-list"]/div')
for job in job_list:
job_title = job.xpath('.//h3/text()')[0]
company_name = job.xpath('.//div[@class="company-name"]/a/text()')[0]
location = job.xpath('.//div[@class="location"]/text()')[0].strip()
job_salary = job.xpath('.//div[@class="money"]/text()')[0]
print(f'职位名称: {job_title}, 公司名称: {company_name}, 地点: {location}, 薪资: {job_salary}')
def main():
url = 'https://www.lagou.com/jobs/list_%E8%BD%AF%E4%BB%B6%E7%BC%96%E7%A8%8B%E5%B8%88?labelWords=label'
html = get_job_info(url)
if html:
parse_jobs(html)
if __name__ == '__main__':
main()
这段代码首先定义了一个获取网页内容的函数get_job_info
,使用了Requests库来发送HTTP请求。然后定义了一个解析网页并提取职位信息的函数parse_jobs
,使用了lxml库和Xpath表达式来定位和提取数据。最后,在main
函数中,我们调用了这两个函数来获取拉勾网上软件开发工程师的职位信息并打印出来。
评论已关闭