分类爬虫下的文章

《Python3网络爬虫开发实战》-第3章网页数据的解析提取之pyquery的使用

2024-08-13




from pyquery import PyQuery as pq
 
# 示例HTML字符串
html = '''
<div id="container">
    <ul class="list">
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
    </ul>
</div>
'''
 
# 使用pyquery解析HTML字符串
doc = pq(html)
 
# 提取所有的li标签中的文本内容
items = [item.text() for item in doc('.list .item-0').items()]
print(items)  # 输出: ['first item', 'third item']
 
# 提取所有的a标签的href属性
links = [link.attr('href') for link in doc('.list .item-1 a').items()]
print(links)  # 输出: ['link2.html', 'link4.html']

这个代码实例展示了如何使用pyquery库来解析HTML字符串，并提取特定元素的文本内容或属性。代码首先定义了一个HTML字符串，然后使用pyquery的pq()函数进行解析。接下来，使用CSS选择器来定位特定的元素，并通过.items()方法迭代这些元素，最后使用.text()或.attr()方法来提取文本内容或属性。

2024-08-13




import requests
from lxml import etree
import csv
import time
 
# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
}
 
# 设置CSV文件头
csv_headers = ['商品名称', '商品价格', '商品评分', '销售商信息', '商品链接']
 
# 设置代理服务器
proxy = {'http': 'http://120.77.138.138:80', 'https': 'https://120.77.138.138:80'}
 
# 设置请求超时时间
timeout = 10
 
# 设置请求失败的重试次数
max_retries = 5
 
# 初始化请求计数器
retries_count = 0
 
# 初始化CSV文件
with open('temu_data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_headers)
    writer.writeheader()
 
    # 设置爬取页数
    for page in range(1, 11):
        print(f'正在爬取第{page}页数据...')
        url = f'https://www.temu.com/search/?q=%E6%B5%8B%E8%AF%95&sort=rank&page={page}'
 
        # 实施重试机制
        while retries_count < max_retries:
            try:
                response = requests.get(url, headers=headers, proxies=proxy, timeout=timeout)
                response.raise_for_status()  # 检查是否请求成功
                retries_count = 0  # 重置请求计数器
                break
            except requests.exceptions.RequestException as e:
                print(f'请求失败，原因：{e}')
                retries_count += 1
                time.sleep(5)  # 等待5秒后重试
 
        # 解析HTML内容
        tree = etree.HTML(response.text)
        product_items = tree.xpath('//div[@class="product-item"]')
 
        for product_item in product_items:
            name = product_item.xpath('.//h3/a/text()')[0].strip()
            price = product_item.xpath('.//div[@class="product-price"]/span/text()')[0].strip()
            score = product_item.xpath('.//div[@class="product-score"]/text()')[0].strip()
            seller = product_item.xpath('.//div[@class="product-seller"]/text()')[0].strip()
            link = product_item.xpath('.//h3/a/@href')[0].strip()
 
            # 将数据写入CSV文件
            writer.writerow({
                '商品名称': name,
                '商品价格': price,
                '商品评分': score,
                '销售商信息': seller,
                '商品链接': link
            })
 
        print(f'第{page}页数据爬取完成。\n')
        time.sleep(2)  # 为了避免对服务器造成过大压力，设置2秒钟的间隔
 
print('所有页面数据爬取完成。')

这段代码使用了requests库来发送HTTP请求，并使用lxml库来解析HTML内容。同时，使用了CSV库来保存爬取的数据。代码中的重试机制能够处理网络请求失败的情况，并在请求失败时进行重试。最后，代码会在控制台输出爬取的状态信息，并在完成

- 阅读更多 -

【小沐学Python】网络爬虫之lxml

System

2024-08-12

所有,爬虫

在Python中，lxml是一种快速，灵活且易于使用的HTML和XML处理库。它建立在libxml2库的基础上，因此具有高性能和高效的XML解析功能。

以下是使用lxml库进行网络爬虫的一个基本示例：




from lxml import etree
import requests
 
# 发送HTTP请求
url = 'http://example.com'
response = requests.get(url)
 
# 检查我们是否得到了一个有效的响应
if response.status_code == 200:
    # 解析HTML内容
    html = etree.HTML(response.text)
    
    # 使用XPath选择所需的数据
    # 例如，选择所有的段落
    paragraphs = html.xpath('//p/text()')
    
    # 打印段落内容
    for p in paragraphs:
        print(p)
 
# 注意：这只是一个简单的示例，实际的网络爬虫可能需要处理更多复杂的情况，例如处理JavaScript动态渲染的页面、处理AJAX请求、处理登录验证等。

在这个例子中，我们首先使用requests库获取了网页内容，然后使用etree.HTML将其转换成lxml的Element对象。接着，我们使用XPath选取了所有段落标签内的文本。最后，我们遍历并打印这些段落。

XPath是一种在XML和HTML文档中查找信息的语言，它可以用来在XML和HTML文档中对元素和属性进行导航。

这只是lxml库在网络爬虫中的一个基本用法，实际应用中可能需要结合多种技术和策略来处理更复杂的网站和数据。

- 阅读更多 -

PYthon进阶--网页采集器(基于百度搜索的Python3爬虫程序)

System

2024-08-12

所有,爬虫

以下是一个简化的、基于Python3的网络爬虫示例，用于从百度搜索结果中抓取特定关键词的网页。请注意，实际的网络爬虫可能需要遵守robots.txt协议，以及处理更复杂的情况，比如网页的动态加载、登录验证等。




import requests
from bs4 import BeautifulSoup
 
def crawl_web(keyword):
    # 构造请求头，模拟浏览器访问
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    # 搜索请求的URL
    base_url = 'https://www.baidu.com/s?wd='
    url = base_url + keyword
 
    try:
        # 发送GET请求
        response = requests.get(url, headers=headers)
        # 检查响应状态
        if response.status_code == 200:
            # 解析网页
            soup = BeautifulSoup(response.text, 'html.parser')
            # 提取搜索结果中的网页链接
            links = soup.find_all('h3', class_='t')
            for link in links:
                print(link.a.get('href'))
        else:
            print('Failed to retrieve search results for:', keyword)
    except requests.exceptions.RequestException as e:
        print('Failed to crawl web. Error:', e)
 
# 使用关键词作为参数调用函数
crawl_web('Python')

这段代码使用了requests库来发送HTTP请求，以及BeautifulSoup库来解析HTML并提取数据。代码中的crawl_web函数接收一个搜索关键词，构造请求URL，发送请求，并解析返回的HTML内容以提取搜索结果中的链接。

请注意，这个示例没有处理网页内容的详细解析，只是提取了搜索结果标题下的链接。实际的爬虫可能需要进一步提取有价值的数据，并可能需要处理网页中的AJAX、JavaScript渲染的内容。

- 阅读更多 -

网络爬虫之urllib库

System

2024-08-12

所有,爬虫

urllib库是Python内置的网络请求库，提供了几个模块用于处理URLs，包括urllib.request用于打开和读取URLs，urllib.parse用于解析URLs，以及urllib.error用于处理请求过程中引发的异常。

以下是使用urllib.request模块下载网页内容的示例代码：




import urllib.request
 
# 指定要下载的网页URL
url = 'http://www.example.com/'
 
# 使用urlopen方法打开并读取URL指定的网页内容
response = urllib.request.urlopen(url)
 
# 读取网页内容
html = response.read()
 
# 将网页内容转换为字符串
html_string = html.decode('utf-8')
 
# 打印网页内容
print(html_string)

以上代码演示了如何使用urllib.request来打开和读取一个网页的内容，并把它转换为字符串。在实际应用中，你可能还需要处理HTTPS请求、POST数据、设置请求头部等更复杂的情况，但这个示例提供了基本的使用方法。

- 阅读更多 -

爬虫结果写入MySQL

System

2024-08-12

所有,爬虫

要将爬虫结果写入MySQL数据库，你需要执行以下步骤：

安装MySQL数据库和Python数据库驱动（如mysql-connector-python或pymysql）。
创建数据库和表以存储爬虫数据。
使用Python代码连接到MySQL数据库。
将爬取的数据插入到数据库表中。

以下是一个简单的Python脚本示例，展示了如何使用mysql-connector-python库将爬虫结果写入MySQL数据库：




import mysql.connector
 
# 连接到MySQL数据库
db_connection = mysql.connector.connect(
    host="localhost",
    user="your_username",
    password="your_password",
    database="your_database"
)
cursor = db_connection.cursor()
 
# 假设你已经有了爬取的数据
data_to_insert = ("value1", "value2", "value3")
 
# 插入数据的SQL命令
sql_insert_query = """
INSERT INTO your_table_name (column1, column2, column3)
VALUES (%s, %s, %s)
"""
 
# 执行SQL命令
cursor.execute(sql_insert_query, data_to_insert)
 
# 提交到数据库执行
db_connection.commit()
 
# 关闭数据库连接
cursor.close()
db_connection.close()

确保替换your_username, your_password, your_database, your_table_name, column1, column2, column3以及data_to_insert为你的实际数据库信息和数据。

- 阅读更多 -

Pandas爬虫-只需要一行

System

2024-08-12

所有,爬虫

如果你想使用Pandas来爬取网页数据，并且只需要保留一行数据，你可以使用Pandas的read_html函数结合BeautifulSoup来提取所需的数据。以下是一个简单的例子，假设我们只需要表格中的第一行数据。

首先，安装所需的库（如果尚未安装）：




pip install pandas beautifulsoup4 lxml requests

然后，使用Pandas的read_html函数和BeautifulSoup来解析HTML并提取表格数据：




import pandas as pd
from bs4 import BeautifulSoup
import requests
 
# 网页URL
url = 'http://example.com/table.html'
 
# 发送HTTP请求
response = requests.get(url)
 
# 检查请求是否成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(response.text, 'lxml')
    # 找到表格标签
    table = soup.find('table')
    
    # 将表格标签转换为字符串
    table_html = str(table)
    
    # 使用Pandas的read_html函数读取字符串中的表格数据
    df = pd.read_html(table_html)[0]  # 假设我们只关心第一个表格
    
    # 打印第一行数据
    print(df.iloc[0])
else:
    print("Failed to retrieve the webpage")

请注意，这个例子假定网页上的表格是用常规HTML标签定义的，并且没有使用JavaScript动态生成内容。如果表格是通过JavaScript动态加载的，你可能需要使用像Selenium这样的工具来直接与浏览器交互，获取动态渲染后的内容。

- 阅读更多 -

爬虫selenium获取元素定位方法总结（动态获取元素）

System

2024-08-12

所有,爬虫

在使用Selenium进行Web自动化测试时，我们通常需要定位到页面上的元素以进行交互操作，如点击、输入文本等。以下是一些常用的元素定位方法及其在Python中的使用示例：

通过ID定位：




element = driver.find_element_by_id("element_id")

通过类名定位：




elements = driver.find_elements_by_class_name("class_name")

通过名称定位：




element = driver.find_element_by_name("element_name")

通过标签名定位：




elements = driver.find_elements_by_tag_name("tag_name")

通过链接文本定位：




link = driver.find_element_by_link_text("Link Text")

通过部分链接文本定位：




links = driver.find_elements_by_partial_link_text("Partial Text")

通过CSS选择器定位：




element = driver.find_element_by_css_selector("css_selector")

通过XPath定位：




element = driver.find_element_by_xpath("xpath_expression")

在实际应用中，可能需要根据页面的动态内容使用JavaScript来动态生成XPath或CSS选择器。例如：




# 使用JavaScript生成动态XPath
dynamic_xpath = driver.execute_script("return arguments[0].getAttribute('data-xpath');", element)
dynamic_element = driver.find_element_by_xpath(dynamic_xpath)




# 使用JavaScript生成动态CSS选择器
dynamic_css = driver.execute_script("return arguments[0].getAttribute('data-css');", element)
dynamic_element = driver.find_element_by_css_selector(dynamic_css)

以上代码展示了如何使用Selenium定位页面元素以及如何利用JavaScript动态获取元素的定位信息。

- 阅读更多 -

java 实现爬虫的方法

System

2024-08-12

所有,爬虫

Java实现爬虫通常需要使用HttpClient来发送HTTP请求，解析HTML内容，以及处理JavaScript渲染的内容。以下是一个简单的Java爬虫示例，使用了HttpClient和Jsoup库来抓取网页内容。

首先，添加依赖到你的pom.xml中：




<dependencies>
    <!-- jsoup HTML parser library @ https://jsoup.org/ -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.13.1</version>
    </dependency>
    <!-- Apache HttpClient @ https://hc.apache.org/ -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.13</version>
    </dependency>
</dependencies>

然后，使用以下代码实现一个简单的Java爬虫：




import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
public class SimpleCrawler {
 
    public static void main(String[] args) throws Exception {
        HttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet("http://example.com"); // 替换为你想爬取的网址
        HttpResponse response = httpClient.execute(httpGet);
 
        if (response.getStatusLine().getStatusCode() == 200) {
            String content = EntityUtils.toString(response.getEntity(), "UTF-8");
            Document doc = Jsoup.parse(content);
 
            // 使用doc进行DOM操作，例如获取页面上的某个元素
            String title = doc.title();
            System.out.println("Title: " + title);
        }
 
        httpClient.getConnectionManager().shutdown();
    }
}

这个简单的爬虫示例使用了HttpClient来发送HTTP请求，并使用Jsoup来解析HTML内容。你可以根据需要对这段代码进行扩展，比如添加更复杂的爬取逻辑、处理AJAX请求、处理登录认证、爬取动态内容等。

- 阅读更多 -

Python如何正确将“爬虫数据”以json格式进行保存

System

2024-08-12

所有,爬虫

在Python中，将爬虫数据以JSON格式保存通常涉及到两个主要的库：json和pandas。以下是一个示例，展示如何将爬虫获取的数据保存为JSON格式。




import json
import pandas as pd
 
# 假设你已经有了一个DataFrame格式的数据
data = {
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c']
}
df = pd.DataFrame(data)
 
# 将DataFrame保存为JSON文件
df.to_json('data.json', orient='records')
 
# 如果你想直接使用json库，你需要将数据转换为列表的字典格式
data_list_dict = df.to_dict(orient='records')
 
# 将数据写入JSON文件
with open('data.json', 'w') as f:
    json.dump(data_list_dict, f)

这两种方法都可以将数据保存为JSON格式，第一种方法使用了pandas库，第二种方法使用了Python标准库中的json模块。选择哪种方法取决于你的数据格式和个人偏好。如果你已经有了一个DataFrame，使用to_json方法会更直接；如果你手头的是其他类型的数据结构，可能需要先将其转换为列表的字典格式。

- 阅读更多 -