Tech Blog

2024-08-16

在Python中，使用XPath解析HTML内容通常涉及到lxml库。以下是一个使用XPath解析网页内容的例子：

首先，安装lxml库（如果尚未安装）：




pip install lxml

然后，使用以下代码解析网页并提取所需的数据：




from lxml import etree
import requests
 
# 发送HTTP请求获取网页内容
url = 'http://example.com'
response = requests.get(url)
 
# 检查请求是否成功
if response.status_code == 200:
    # 解析网页内容为XPath可以处理的对象
    html = etree.HTML(response.text)
    
    # 定义XPath表达式来选取需要的数据
    # 例如：//div[@class="content"]/ul/li/a
    xpath_expression = 'YOUR_XPATH_EXPRESSION_HERE'
    
    # 使用XPath表达式提取数据
    results = html.xpath(xpath_expression)
    
    # 处理提取的数据
    for result in results:
        # 提取每个li标签下的a标签的href属性和文本内容
        href = result.get('href')
        text = result.text
        print(f'Href: {href}, Text: {text}')
else:
    print("Failed to retrieve the webpage")

在这个例子中，你需要替换YOUR_XPATH_EXPRESSION_HERE为你想要提取的数据的XPath表达式。这个表达式会告诉解析器如何在HTML文档中定位你感兴趣的元素。

请注意，XPath表达式的编写需要你对HTML结构有足够的了解。如果你想要提取的数据深藏在嵌套的标签之中，你可能需要编写一个复杂的XPath表达式来精确地定位它们。

System

2024-08-16

所有,爬虫




import tkinter as tk
import requests
from bs4 import BeautifulSoup
 
def fetch_quotes(symbol):
    url = f"https://finance.yahoo.com/quote/{symbol}"
    response = requests.get(url)
    if response.ok:
        soup = BeautifulSoup(response.text, 'html.parser')
        name = soup.find(id='quote-header-info').find('h1').text
        price = soup.find(class_='My<span class="katex">\(b\)</span>').text
        return f"{name} - {price}"
    else:
        return "Error fetching quote"
 
def update_quote(symbol):
    quote = fetch_quotes(symbol.get())
    quote_label.config(text=quote)
 
root = tk.Tk()
root.title("Yahoo Finance Stock Quote Fetcher")
root.geometry("400x200")
 
symbol_label = tk.Label(root, text="Enter stock symbol:")
symbol_label.pack()
 
symbol_entry = tk.Entry(root)
symbol_entry.pack()
 
fetch_button = tk.Button(root, text="Fetch Quote", command=lambda: update_quote(symbol_entry))
fetch_button.pack()
 
quote_label = tk.Label(root, text="")
quote_label.pack()
 
root.mainloop()

这段代码使用了tkinter库来创建一个简单的用户界面，允许用户输入股票代码，并在点击按钮后获取该股票的名称和价格。这个例子教会了如何将爬虫逻辑集成到图形用户界面中，并展示了如何使用Python进行简单的GUI编程。

- 阅读更多 -

认识爬虫：如何使用 requests 模块模拟浏览器请求爬取网页信息？

System

2024-08-16

所有,爬虫




import requests
 
# 目标URL
url = 'http://example.com'
 
# 使用requests模块发送GET请求
response = requests.get(url)
 
# 检查请求是否成功
if response.status_code == 200:
    # 请求成功，打印网页内容
    print(response.text)
else:
    # 请求失败，打印错误信息
    print(f"请求失败，状态码：{response.status_code}")

这段代码演示了如何使用Python的requests模块来发送一个简单的GET请求并获取网页内容。首先，我们导入requests模块，然后指定我们想要访问的URL。接着，我们使用requests模块的get方法来发送一个GET请求到这个URL。最后，我们检查响应状态码，如果是200，则表示请求成功，我们打印出响应的文本内容；如果不是200，则表示请求失败，我们打印出状态码。

- 阅读更多 -

Python爬虫从入门到入狱看了这篇文章或许会改变你的看法_爬虫为什么坐牢

System

2024-08-16

所有,爬虫

爬虫是一种自动提取网页数据的程序，用于数据采集。它可以帮助我们快速获取大量的信息，但同时也有可能触犯网站的反爬机制，导致被封号甚至法律问题。

安装requests和BeautifulSoup库




pip install requests
pip install beautifulsoup4

简单的爬虫示例




import requests
from bs4 import BeautifulSoup
 
url = 'http://example.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
 
for link in soup.find_all('a'):
    print(link.get('href'))

这个例子中，我们使用requests库获取了网页内容，然后用BeautifulSoup进行解析，找出所有的a标签的href属性，也就是链接。

反爬虫策略

网站可能会通过各种方式阻止爬虫，如JavaScript动态渲染、Cookies验证、IP封禁等。

处理JavaScript渲染的页面

对于现今许多网站使用JavaScript动态渲染的情况，可以使用Selenium或者Pyppeteer等工具来处理。

使用代理和设置请求头




proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:3128',
}
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
}
 
response = requests.get('http://example.com', headers=headers, proxies=proxies)

应对反爬策略

如果网站设置了复杂的反爬机制，可以尝试以下策略：

使用代理轮换IP
限制爬取速度
使用用户代理(User-Agent)来模拟不同的浏览器访问
使用登录认证

爬虫与法律

遵守网站的robots.txt协议，不爬取不允许爬取的页面。未授权获取数据时，应确保你有权限获取数据，并遵守相关的法律法规。

以上是爬虫入门的基本概念和策略，实际应用中可能需要根据具体网站的反爬策略来调整策略和代码。

- 阅读更多 -

进阶网络爬虫实践内容---微博网页内容爬取

System

2024-08-16

所有,爬虫




import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
 
# 设置请求头，模拟浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
 
# 获取微博用户主页的网页内容
def get_page_content(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except requests.RequestException:
        return None
 
# 解析微博数据，提取微博正文
def parse_weibo_data(html):
    soup = BeautifulSoup(html, 'lxml')
    weibo_data = []
    for container in soup.find_all('div', class_='c'):
        weibo = {}
        # 提取微博正文
        text = container.find('span', class_='ctt')
        if text:
            weibo['text'] = text.text.strip()
        else:
            continue
        # 提取转发数和评论数
        for info in container.find_all('span', class_='cc'):
            if '转发' in info.text:
                weibo['retweet_count'] = re.search('\d+', info.text).group(0)
            elif '评论' in info.text:
                weibo['comment_count'] = re.search('\d+', info.text).group(0)
        weibo_data.append(weibo)
    return weibo_data
 
# 保存微博数据到CSV文件
def save_weibo_data_to_csv(weibo_data, file_name):
    df = pd.DataFrame(weibo_data)
    df.to_csv(file_name + '.csv', index=False, encoding='utf-8-sig')
 
# 主函数
def main(user_id, since_date, count):
    url = f'https://weibo.com/p/100505{user_id}/home?is_search=0&visible=0&is_all=1&since_id=0&sort=time&page={count}'
    html = get_page_content(url)
    if html:
        weibo_data = parse_weibo_data(html)
        save_weibo_data_to_csv(weibo_data, f'weibo_data_{user_id}_{since_date}_{count}')
        print(f'微博数据已保存到 weibo_data_{user_id}_{since_date}_{count}.csv')
    else:
        print('网页内容获取失败')
 
# 示例用户ID为1234567890，起始日期为2023-01-01，爬取第1页的数据
main(1234567890, '2023-01-01', 1)

在这个代码示例中，我们定义了一个main函数，它接受用户ID、起始日期和页数作为参数，并将微博数据保存到CSV文件中。这个示例展示了如何使用Python进行网络爬取，提取特定的数据，并将其保存到文件中。需要注意

- 阅读更多 -

Python爬虫学习教程：天猫商品数据爬虫

System

2024-08-16

所有,爬虫




import requests
from lxml import etree
 
class TianMaoSpider:
    def __init__(self, keyword, page_num):
        self.keyword = keyword
        self.page_num = page_num
        self.base_url = "https://s.taobao.com/search?q="
 
    def get_response(self, url):
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
 
    def parse_page(self, page_url):
        html = self.get_response(page_url)
        if html:
            parser = etree.HTML(html)
            items = parser.xpath('//div[@class="item J_MouserOnverReq  "]')
            for item in items:
                title = item.xpath('.//div[@class="title"]/a/text()')[0]
                price = item.xpath('.//div[@class="price"]/strong/text()')[0]
                print(f"商品名称: {title}, 价格: {price}")
 
    def run(self):
        for page in range(1, self.page_num + 1):
            page_url = f"{self.base_url}{self.keyword}&s={(page-1)*44}"
            self.parse_page(page_url)
 
if __name__ == "__main__":
    spider = TianMaoSpider("口罩", 2)
    spider.run()

这段代码实现了一个简单的天猫商品数据爬虫。首先定义了一个TianMaoSpider类，包含初始化方法、获取响应、解析页面和运行爬虫的方法。在__init__方法中，设置了搜索关键词和需要爬取的页数。get_response方法用于发送请求并获取响应。parse_page方法解析页面，提取商品标题和价格，并打印输出。最后，在run方法中通过循环遍历页面，并调用parse_page方法来爬取数据。这个简单的爬虫示例教会了如何使用Python进行基本的网页爬取。

- 阅读更多 -

Python3 爬虫模拟登录

System

2024-08-16

所有,爬虫

为了模拟登录，你需要使用Python的requests库来发送HTTP请求，并处理cookies和session对象。以下是一个模拟登录的基本例子：




import requests
 
# 登录的URL
login_url = 'http://example.com/login'
 
# 创建一个Session对象，以便能够保持会话状态
session = requests.Session()
 
# 登录需要的参数，比如用户名和密码
login_params = {
    'username': 'your_username',
    'password': 'your_password'
}
 
# 发送POST请求进行登录
response = session.post(login_url, data=login_params)
 
# 检查是否登录成功
if response.ok:
    print('登录成功')
    # 登录成功后，可以继续使用session对象来发送后续请求
    # 例如获取用户的个人信息
    user_url = 'http://example.com/user'
    user_response = session.get(user_url)
    if user_response.ok:
        print('用户信息:', user_response.text)
    else:
        print('获取用户信息失败')
else:
    print('登录失败')

请注意，实际的登录可能涉及到处理cookies、CSRF tokens、headers等安全措施，可能需要你手动处理或者使用专门的库来帮助你完成。上面的代码只是一个简单的例子，实际使用时需要根据目标网站的具体情况进行调整。

- 阅读更多 -

Puppeteer！程序员还不会爬虫？

System

2024-08-16

所有,爬虫

Puppeteer 是一个 Node 库，它提供了高级 API 来控制 Chrome 或 Chromium。如果你能用 Puppeteer 做爬虫，说明你已经掌握了这个工具的基本使用。

以下是一个简单的使用 Puppeteer 进行网页抓取的例子：




const puppeteer = require('puppeteer');
 
async function fetchHTML(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    const html = await page.evaluate(() => document.body.innerHTML);
    browser.close();
    return html;
}
 
// 使用方法:
fetchHTML('https://example.com').then(html => console.log(html));

这段代码会启动一个新的浏览器实例，打开一个新页面，访问指定的 URL，然后获取页面的 HTML 内容，并在浏览器实例关闭后返回。这是一个非常基础的爬虫示例，实际的爬虫可能需要处理更复杂的情况，比如需要登录、处理 AJAX 请求、应对反爬虫策略等。

- 阅读更多 -

如何结合Node和Puppeteer做网络爬虫

System

2024-08-16

所有,爬虫




const puppeteer = require('puppeteer');
 
async function crawlWithPuppeteer(url) {
    // 启动浏览器
    const browser = await puppeteer.launch();
    // 打开新页面
    const page = await browser.newPage();
    // 导航到URL
    await page.goto(url);
    // 等待内容加载完成
    await page.waitForSelector('body');
 
    // 提取页面标题
    const title = await page.title();
    console.log('页面标题:', title);
 
    // 关闭浏览器
    await browser.close();
}
 
// 使用方法: crawlWithPuppeteer('https://example.com');

这段代码演示了如何使用Puppeteer启动一个浏览器实例，打开一个新页面，导航到指定的URL，并等待页面加载。然后它提取页面标题并在控制台输出，最后关闭浏览器。这是爬虫过程中的一个基本步骤，可以扩展以抓取更多的数据。

- 阅读更多 -

Python爬虫实战——爬取新闻数据（简单的深度爬虫）

System

2024-08-16

所有,爬虫




import requests
from bs4 import BeautifulSoup
import time
import random
import csv
 
def get_html(url, headers):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except requests.RequestException:
        return None
 
def parse_html(html, keyword):
    soup = BeautifulSoup(html, 'lxml')
    news_list = soup.select('div.news-box > ul > li > a')
    for news in news_list:
        title = news.select_one('h3').text
        if keyword.lower() in title.lower():
            url = news['href']
            print(f'正在爬取新闻: {title}')
            yield url
 
def get_news_detail(url, headers):
    html = get_html(url, headers)
    soup = BeautifulSoup(html, 'lxml')
    content = soup.select_one('div.article-content').text
    return content
 
def save_to_csv(data, filename):
    with open(filename, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(data)
 
def main(keyword):
    base_url = 'https://news.sina.com.cn/china/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
        'Cookie': 'your_cookie_here'
    }
    filename = f'{keyword}_news.csv'
    for url in parse_html(get_html(base_url, headers), keyword):
        content = get_news_detail(url, headers)
        save_to_csv([url, content], filename)
        time.sleep(random.uniform(1, 3))  # 添加随机等待防止被封禁IP
 
if __name__ == '__main__':
    keyword = '科技'
    main(keyword)

这段代码实现了一个简单的深度爬虫，它首先获取新闻列表页面的HTML内容，然后解析提取新闻标题和URL，筛选含有指定关键词的新闻，并将新闻的详情页URL交给一个新的函数来获取新闻详情。新闻详情被获取后，保存到CSV文件中。为了避免被目标网站封禁，爬虫在请求新的URL之前引入随机的等待时间。

- 阅读更多 -