分类爬虫下的文章

Python学习之路-爬虫提高:框架功能升级

2024-08-12




import requests
from bs4 import BeautifulSoup
import re
import os
 
class DoubanSpider:
    def __init__(self, url):
        self.url = url
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
 
    def get_page(self, url):
        response = requests.get(url, headers=self.headers)
        if response.status_code == 200:
            return response.text
        return None
 
    def parse_index(self, html):
        soup = BeautifulSoup(html, 'lxml')
        movie_list = soup.find('ol', 'grid_view')
        if movie_list:
            for movie in movie_list.find_all('li'):
                yield {
                    'name': movie.find('span', 'title').getText().strip(),
                    'score': movie.find('span', 'rating_num').getText()
                }
 
    def save_to_file(self, data):
        with open('douban.txt', 'a', encoding='utf-8') as f:
            f.write(f"{data['name']}: {data['score']}\n")
 
    def run(self):
        html = self.get_page(self.url)
        for data in self.parse_index(html):
            print(f"{data['name']}: {data['score']}")
            self.save_to_file(data)
 
if __name__ == '__main__':
    url = 'https://movie.douban.com/tag/#/?sort=T&range=2,10&tags=%E7%83%AD%E9%97%A8'
    DoubanSpider(url).run()

这段代码实现了一个简单的基于Python的豆瓣电影爬虫，它能够抓取豆瓣电影标签页（这里使用的是国产电影标签作为示例）上的电影名称和评分，并将结果输出到控制台和保存到本地文件。这个示例展示了如何组织爬虫的结构，并包含了请求头的添加，以及对HTML内容的解析和提取。

Python 爬虫快速入门

2024-08-12




import requests
from bs4 import BeautifulSoup
 
# 发送HTTP请求
def get_html(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        else:
            return "页面无法访问"
    except requests.RequestException:
        return "发生错误"
 
# 解析HTML并提取数据
def parse_soup(soup):
    data = []
    for item in soup.select('div.item'):   # 使用CSS选择器选取元素
        title = item.select('a')[0].text  # 提取标题
        link = item.select('a')[0]['href']  # 提取链接
        data.append((title, link))
    return data
 
# 主函数
def main(url):
    html = get_html(url)
    soup = BeautifulSoup(html, 'html.parser')  # 使用HTML解析器
    parsed_data = parse_soup(soup)
    for title, link in parsed_data:
        print(f'标题: {title}, 链接: {link}')
 
# 用例
if __name__ == "__main__":
    url = 'https://example.com'  # 替换为目标网站的URL
    main(url)

这段代码展示了如何使用Python的requests库发送HTTP请求，以及如何使用BeautifulSoup库解析HTML并提取数据。代码中的get_html函数负责发送请求并获取页面内容，parse_soup函数负责解析页面并提取标题和链接，最后在main函数中调用这两个函数来完成整个爬取过程。

2024-08-12

这个报错信息通常出现在浏览器的开发者工具中，当你使用JavaScript在控制台（console）尝试访问某个DOM元素时，如果你使用了$0这个变量，而当前没有任何元素被选中，你会看到这样的提示。$0通常代表最近在元素查看器里选中的DOM元素。

解决方法：

确保在你尝试访问$0之前，你已经在开发者工具的元素查看器里选中了一个元素。
如果你是在编写爬虫脚本，并且遇到了获取不到数据的问题，请检查你的选择器和XPath是否正确，确保你的脚本在正确的DOM节点上执行操作。
如果你在控制台中看到这个提示，但实际上你的爬虫脚本是按预期工作的，那么这个提示只是提醒你在控制台中没有选中任何元素，并不影响你的爬虫数据抓取。

如果你在编写爬虫脚本时遇到实际的获取数据问题，请提供更具体的代码和上下文信息，以便进一步诊断和解决问题。

【爬虫实战】利用代理爬取电商数据

2024-08-12

以下是一个简化的示例代码，展示了如何使用代理服务器爬取电商网站的商品信息。




import requests
from lxml import etree
 
# 设置代理服务器
proxies = {
    'http': 'http://user:password@proxy_address:port',
    'https': 'https://user:password@proxy_address:port'
}
 
# 目标网站URL
url = 'https://www.example.com/shop/products'
 
# 发送HTTP请求
response = requests.get(url, proxies=proxies)
 
# 检查请求是否成功
if response.status_code == 200:
    # 解析网页
    html = etree.HTML(response.text)
    
    # 提取商品信息
    products = html.xpath('//div[@class="product"]')
    for product in products:
        name = product.xpath('.//h3/text()')[0].strip()
        price = product.xpath('.//span[@class="price"]/text()')[0].strip()
        print(f'Name: {name}, Price: {price}')
else:
    print("Failed to retrieve the webpage")

这段代码首先设置了代理服务器的参数，然后向电商网站发送了一个GET请求，并检查了响应状态。如果响应成功，它会使用lxml库解析HTML内容，并提取商品信息（例如名称和价格），然后打印出来。这个例子展示了如何使用代理服务器进行网络爬取，并简单提取了网页中的数据。

2024-08-12




import redis
import requests
from lxml import etree
 
# 连接Redis数据库
redis_conn = redis.StrictRedis(host='localhost', port=6379, db=0)
 
def get_page_source(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None
 
def parse_page(html):
    tree = etree.HTML(html)
    news_items = tree.xpath('//div[@class="news-list"]/ul/li')
    for item in news_items:
        title = item.xpath('.//a/@title')[0]
        href = item.xpath('.//a/@href')[0]
        yield {
            'title': title,
            'href': href
        }
 
def save_to_redis(data):
    pipe = redis_conn.pipeline()
    for item in data:
        pipe.sadd('gushi_news:items', item['href'])
    pipe.execute()
 
def main():
    url = 'https://www.guge.name/news/'
    html = get_page_source(url)
    if html:
        for item in parse_page(html):
            print(item)
        save_to_redis(list(parse_page(html)))
 
if __name__ == '__main__':
    main()

这段代码首先定义了连接Redis数据库的函数，然后定义了获取页面源码、解析页面源码、保存数据到Redis的函数。在main函数中，它首先获取了网页的源码，然后解析出新闻标题和链接，并打印出来。最后，它将解析出的每条新闻数据保存到Redis的集合类型数据结构中，以此来实现去重存储的功能。

Python 爬虫 BeautifulSoup(bs4)库

2024-08-12

BeautifulSoup是一个Python库，用于从HTML或XML文件中提取数据。以下是一个使用BeautifulSoup库的简单示例，该示例从一个网页下载HTML内容，并使用BeautifulSoup解析该内容以提取数据。

首先，你需要安装BeautifulSoup库，如果还没有安装，可以使用pip安装：




pip install beautifulsoup4

然后，你可以使用以下代码来提取网页数据：




import requests
from bs4 import BeautifulSoup
 
# 下载网页
url = 'http://example.com'
response = requests.get(url)
 
# 检查网页是否成功下载
if response.status_code == 200:
    # 使用BeautifulSoup解析网页内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取数据
    # 例如，提取标题
    title = soup.title.text
    print(title)
    
    # 提取所有的段落
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        print(p.text)
else:
    print("Failed to download the webpage")

这段代码首先使用requests库下载了一个网页，然后使用BeautifulSoup库解析HTML内容，并提取了标题和所有段落文字。你可以根据需要提取其他数据，例如提取所有的链接、图片或表格等。

【易语言】爬虫原理（静态网页）

2024-08-12

在易语言中实现一个简单的网页爬虫来提取静态网页中的信息，可以使用“内置文本解析”功能。以下是一个简单的示例代码：




.版本 3
.局部变量 网页文本, 文本型
.局部变量 解析器, 文本解析器
.局部变量 节点列表, 节点集合
.局部变量 节点, 节点
.局部变量 文本内容, 文本型
.局部变量 URL, 文本型
 
URL ＝ “http://example.com” ' 替换为你要爬取的网页地址
网页文本 ＝ 读入文本 (URL, #GBK) ' 使用GBK编码读取网页内容
创建文本解析器 (解析器, #HTML, 网页文本)
解析器.获取所有节点 (节点列表, “div”) ' 假设我们要提取的信息在div标签中，这里改为相应的标签
 
循环首 (节点列表.计数 (), 节点)
    文本内容 ＝ 节点.内容 () ' 获取节点内容
    输出 (文本内容) ' 输出节点内容
    节点.销毁 () ' 释放节点资源
循环尾 ()
 
解析器.销毁 () ' 释放解析器资源

这段代码创建了一个文本解析器，解析了给定的HTML文档，并获取了所有<div>标签的节点集合。然后遍历这些节点，输出它们的内容。这个例子是一个基本的开始，实际应用中可能需要根据具体网页结构进行调整。

Python：这是一个朴实无华的爬虫教程，就是有点养眼，Python技术图谱

2024-08-12




import requests
from bs4 import BeautifulSoup
 
# 爬取网页的函数
def crawl_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None
 
# 解析网页并提取信息的函数
def parse_soup(soup):
    title = soup.find('h1', class_='post-title').get_text()
    content = soup.find('div', class_='post-content').get_text()
    return title, content
 
# 主函数，组装URL并调用爬取和解析函数
def main(url):
    html = crawl_page(url)
    if html:
        soup = BeautifulSoup(html, 'html.parser')
        title, content = parse_soup(soup)
        print(f"标题: {title}")
        print(f"内容: {content}")
    else:
        print("网页爬取失败")
 
# 示例URL
example_url = 'https://www.example.com/some-post'
 
# 运行主函数
main(example_url)

这段代码使用了requests库来爬取网页，使用BeautifulSoup库来解析HTML，并提取了一个假设的博客文章页面的标题和内容。这个例子简单且直接，适合作为教学使用。

Beautiful Soup4爬虫速成

2024-08-12




import requests
from bs4 import BeautifulSoup
 
# 设置代理服务器
proxies = {
    'http': 'http://user:password@proxy.server.com:port',
    'https': 'https://user:password@proxy.server.com:port'
}
 
# 发送请求
url = 'http://example.com'
response = requests.get(url, proxies=proxies)
 
# 检查请求是否成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取页面标题
    title = soup.title.text
    print(title)
    
    # 查找所有的段落
    paragraphs = soup.find_all('p')
    for p in paragraphs:
        print(p.text)
else:
    print('请求失败，状态码：', response.status_code)

这段代码使用了requests库来发送一个HTTP GET请求，并使用BeautifulSoup来解析返回的HTML内容。它还演示了如何设置代理服务器，以及如何检查请求是否成功，并处理了请求失败的情况。这是学习如何使用Beautiful Soup进行Web爬虫的一个很好的速成指南。