分类爬虫下的文章

2024-08-08

以下是一个使用Node.js和Playwright库进行网站爬虫的基本示例代码。此代码将启动一个浏览器实例，导航到指定的URL，并截�屏幕截图。

首先，确保安装了Playwright依赖：




npm install playwright

然后，使用以下Node.js脚本进行网站爬虫：




const { chromium } = require('playwright');
 
async function run() {
  // 启动浏览器实例
  const browser = await chromium.launch();
  // 打开新页面
  const page = await browser.newPage();
  // 导航到指定的URL
  await page.goto('https://example.com');
  // 截取页面屏幕截图
  await page.screenshot({ path: 'example.png' });
  // 关闭浏览器实例
  await browser.close();
}
 
run().catch(error => console.error('An error occurred:', error));

这段代码会启动Chromium浏览器，打开一个新页面，导航到https://example.com，并将页面的内容截图保存为example.png文件。

如果你需要进一步操作页面（例如，点击按钮、填写表单等），可以使用Playwright提供的API来编写相应的操作代码。例如，你可以使用page.$来获取DOM元素，page.click来点击元素，page.fill来填写表单字段等。

- 阅读更多 -

基于Python+爬虫的股票量化交易分析平台设计与实现

System

2024-08-08

所有,爬虫




import pandas as pd
from jqdata import *
 
# 初始化函数，设定可交易日期等
def initialize(context):
    set_option('use_real_price', True)
    set_order_cost(OrderCost(close_tax=0.001, open_commission=0.0003, close_commission=0.0003, min_commission=5), type='stock')
    set_benchmark('000300.XSHG')
    # 设定我们要分析的股票池
    g.security = '000001.XSHE'
 
# 每天开始时执行的函数
def handle_data(context, data):
    # 获取当前可用资金
    cash = context.portfolio.cash
    # 获取当前 security 的价格
    price = data[g.security].mavg(20, 'day')
    # 如果可以买入
    if cash > 10000:
        order_value(g.security, 10000 // price)
    # 如果股票持有量大于0，且当前价格低于20日均线，可以卖出
    if context.portfolio.positions[g.security].holding_quantity > 0 and price < data[g.security].mavg(20, 'day'):
        order_target(g.security, 0)
 
# 运行函数，初始化并运行回测
if __name__ == '__main__':
    run_backtest(initialize=initialize, handle_data=handle_data, start_date='2015-01-01', end_date='2020-01-01')

这个简易的例子展示了如何使用jqdata库来进行股票的价格平均线交易策略回测。在这个例子中，我们设定了一个基准股票池，并在每天开始时检查当前的可用资金和股票价格。如果可以购买并且当前资金大于10000，我们就买入股票。如果持有股票并且价格落到了20日线以下，我们就清空股票位置。这个例子简单明了，但是它展示了如何将实际的交易策略嵌入到一个自动化的回测环境中。

- 阅读更多 -

Python中的爬虫实战：58同城爬虫

System

2024-08-08

所有,爬虫




import requests
from lxml import etree
import csv
 
def get_content(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None
 
def parse_content(html):
    html = etree.HTML(html)
    job_list = html.xpath('//ul[@class="item_con_list"]/li')
    for job in job_list:
        job_title = job.xpath('./div[1]/h3/a/@title')[0]
        job_link = job.xpath('./div[1]/h3/a/@href')[0]
        job_company = job.xpath('./div[2]/div[1]/a[1]/text()')[0]
        job_location = job.xpath('./div[2]/div[1]/span[1]/text()')[0]
        job_salary = job.xpath('./div[2]/div[2]/div[1]/text()')[0]
        job_info = job.xpath('./div[2]/div[2]/div[2]/p/text()')[0]
        job_info = job_info.strip().replace('\n', '').replace(' ', '')
        yield {
            'title': job_title,
            'link': job_link,
            'company': job_company,
            'location': job_location,
            'salary': job_salary,
            'info': job_info
        }
 
def save_to_csv(data):
    with open('58_jobs.csv', 'a', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['title', 'link', 'company', 'location', 'salary', 'info'])
        writer.writerow(data)
 
def main(offset):
    url = f'https://www.58.com/jobs/?&cl=1&cityId=489&offset={offset}'
    html = get_content(url)
    for job in parse_content(html):
        print(job)
        save_to_csv(job)
 
if __name__ == '__main__':
    for i in range(0, 30, 30):  # 分页参数，这里只抓取3页的数据，实际可以根据需要抓取更多
        main(i)

这段代码实现了58同城招聘信息的爬取，并将抓取的数据保存到CSV文件中。代码中使用了requests库来发送HTTP请求，使用lxml库来解析HTML内容，并通过XPath表达式来提取特定的元素。数据爬取的过程被封装在parse_content函数中，该函数返回一个生成器，它能够高效地生成每一个职位的信息，并通过save_to_csv函数保存到CSV文件中。最后，在main函数中通过一个循环来迭代不同的分页，完成对多页数据的爬取。

- 阅读更多 -

Python进阶--查询商品历史价格（基于慢慢买比价网的爬虫）

System

2024-08-08

所有,爬虫




import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# 设置请求头信息，模拟浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
 
def get_price_history(url):
    # 发送GET请求
    response = requests.get(url, headers=headers)
    # 解析网页
    soup = BeautifulSoup(response.text, 'lxml')
    # 找到价格历史数据所在的表格
    table = soup.find('table', class_='tb_list')
    # 提取表格中的数据
    rows = table.find_all('tr')[1:]  # 跳过表头
    data = [[td.text.strip() for td in row.find_all('td')] for row in rows]
    # 将数据转换为DataFrame
    df = pd.DataFrame(data, columns=['日期', '价格', '销售商'])
    return df
 
# 示例URL
url = 'https://www.hh360.com/product/293460.html'
# 获取价格历史数据
price_history = get_price_history(url)
print(price_history)

这段代码首先导入了必要的模块，设置了请求头信息，定义了一个函数get_price_history来获取商品的价格历史。然后发送GET请求，解析网页，提取出价格历史数据，并将其转换为pandas的DataFrame格式，最后打印出价格历史数据。这个例子展示了如何使用Python进行网页数据抓取，并处理数据以供分析使用。

System

2024-08-08

所有,爬虫




import requests
 
# 要上传的文件路径
file_path = '/path/to/your/file.txt'
 
# API的URL
url = 'http://example.com/api/upload'
 
# 使用with打开文件，这样可以确保文件在上传后被正确关闭
with open(file_path, 'rb') as file:
    # 使用requests.post发送文件
    response = requests.post(url, files={'file': file})
 
# 检查请求是否成功
if response.status_code == 200:
    print('文件上传成功')
else:
    print('文件上传失败')

这段代码演示了如何使用Python的requests库来上传文件。在上传文件时，我们使用了requests提供的files参数，它允许你指定一个字典，字典的key为表单中字段的名称，value为文件对象或者文件的路径。这种方法适用于发送multipart/form-data类型的POST请求。

- 阅读更多 -

Python实验项目9 ：网络爬虫与自动化

System

2024-08-08

所有,爬虫




import requests
from bs4 import BeautifulSoup
import re
import os
 
def download_image(image_url, directory):
    """
    下载图片到指定目录
    """
    response = requests.get(image_url)
    image_name = image_url.split('/')[-1]
    with open(os.path.join(directory, image_name), 'wb') as file:
        file.write(response.content)
 
def crawl_images(url, directory):
    """
    爬取指定网页中的图片链接，并下载到本地目录
    """
    if not os.path.exists(directory):
        os.makedirs(directory)
 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    image_urls = [image['src'] for image in soup.find_all('img') if re.match(r'http[s]?://', image['src'])]
 
    for image_url in image_urls:
        print(f"Downloading image: {image_url}")
        download_image(image_url, directory)
 
if __name__ == '__main__':
    base_url = 'http://example.webscraping.com/places/default/view/Afghanistan-1'
    directory = 'images'
    crawl_images(base_url, directory)

这段代码实现了一个简单的网络爬虫，用于下载特定网页上的所有图片。首先，定义了一个下载图片的函数download_image，它接受图片链接和保存目录作为参数，然后使用requests库获取图片内容，并将其写入到指定目录。

其次，定义了一个爬取图片的函数crawl_images，它接受网页链接和保存目录作为参数，调用requests获取网页内容，使用BeautifulSoup解析网页，并通过正则表达式筛选出完整的图片链接。然后遍历这些链接，调用download_image函数进行下载。

最后，在if __name__ == '__main__':块中，设置了基础网页链接和图片保存目录，并调用crawl_images函数开始爬取过程。

- 阅读更多 -

PHP 爬虫如何配置代理 IP（CURL 函数）

System

2024-08-08

所有,爬虫

在PHP中，使用cURL函数配置代理IP很简单。以下是一个示例代码，展示了如何设置cURL选项以使用代理服务器：




<?php
// 初始化cURL会话
$ch = curl_init();
 
// 设置要访问的URL
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
 
// 设置代理服务器
curl_setopt($ch, CURLOPT_PROXY, "http://your-proxy-ip:port");
 
// 如果需要，设置代理认证
// curl_setopt($ch, CURLOPT_PROXYUSERPWD, "username:password");
 
// 执行cURL会话
$result = curl_exec($ch);
 
// 关闭cURL会话
curl_close($ch);
?>

确保替换your-proxy-ip和port为你的代理服务器的IP地址和端口号。如果代理服务器需要认证，则可以使用CURLOPT_PROXYUSERPWD选项来提供用户名和密码。

System

2024-08-08

所有,爬虫




# 导入Django模型
from django.db import models
 
# 定义哈尔滨二手房数据模型
class HrljHouseData(models.Model):
    title = models.CharField(max_length=255, verbose_name="标题")
    size = models.CharField(max_length=50, verbose_name="面积")
    floor = models.CharField(max_length=50, verbose_name="楼层")
    direction = models.CharField(max_length=50, verbose_name="方向")
    year = models.CharField(max_length=50, verbose_name="年份")
    decorate = models.CharField(max_length=50, verbose_name="装修情况")
    house_type = models.CharField(max_length=50, verbose_name="房屋类型")
    address = models.CharField(max_length=255, verbose_name="地址")
    total_price = models.CharField(max_length=50, verbose_name="总价")
    unit_price = models.CharField(max_length=50, verbose_name="单价")
    publish_date = models.DateField(verbose_name="发布日期")
 
    class Meta:
        verbose_name = "哈尔滨二手房数据"
        verbose_name_plural = verbose_name
 
    def __str__(self):
        return f"{self.title} - {self.publish_date}"

这段代码定义了一个用于存储哈尔滨二手房数据的Django模型。它包含了标题、面积、楼层、方向、年份、装修情况、房屋类型、地址、总价和单价等字段，并有一个发布日期字段。代码示例中的字段和类名需要根据实际需求进行调整。这个模型可以用于数据库的创建和数据的存储。

- 阅读更多 -

【爬虫实战】——Python爬取天气信息

System

2024-08-08

所有,爬虫




import requests
from bs4 import BeautifulSoup
import re
import json
 
def get_weather_data(city):
    # 设置代理，避免被服务器识别为爬虫
    proxies = {
        'http': 'http://120.77.252.146:80',
        'https': 'https://120.77.252.146:80',
    }
 
    # 天气API接口
    api_url = 'http://wthrcdn.etouch.cn/weather_mini?city=' + city
 
    # 发送请求
    response = requests.get(api_url, proxies=proxies)
 
    # 检查请求是否成功
    if response.status_code == 200:
        # 解析JSON数据
        data = json.loads(response.text)
        return data
    else:
        return None
 
def print_weather(data):
    if data:
        # 获取天气信息
        forecast = data.get('data').get('forecast')
        # 打印城市和今天的天气
        print(f"城市：{data.get('data').get('city')}")
        print(f"今天天气：{forecast[0].get('type')}，温度：{forecast[0].get('low')} ~ {forecast[0].get('high')}")
 
# 调用函数获取北京市的天气信息
weather_data = get_weather_data('北京')
print_weather(weather_data)

这段代码使用了一个公开的天气API接口，并通过GET请求发送城市名称来获取天气数据。然后，它使用json模块解析响应并打印出北京市的天气信息。这个例子简单明了，展示了如何使用Python网络爬虫获取天气信息的基本方法。

System

2024-08-07

所有,爬虫




import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
public class SimpleCrawlerExample {
    public static void main(String[] args) {
        String url = "https://www.example.com"; // 替换为你想爬取的网站
        try {
            Document doc = Jsoup.connect(url).get();
            String title = doc.title();
            String bodyText = doc.body().text();
 
            System.out.println("网页标题：" + title);
            System.out.println("网页内容：\n" + bodyText);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

这段代码使用了jsoup库来简单实现一个网络爬虫的例子。它连接到指定的URL，获取页面的标题和文本内容，并打印出来。这个例子只是展示了如何使用jsoup库进行基本的网页爬取，并没有处理更复杂的逻辑，如处理JavaScript渲染的页面、多页面爬取、处理重复爬取、网络请求限制等。

- 阅读更多 -