标签 python 下的文章

2024-08-16




import uiautomator2 as u2
 
def dump_texts(d: u2.Device, max_depth: int = 3):
    """递归遍历UI元素并打印文本内容"""
    def _dump_texts_recursive(element, depth):
        if depth > max_depth:
            return
        if element.text:
            print(f"{'  ' * depth}{element.text}")
        for child in element.children():
            _dump_texts_recursive(child, depth + 1)
 
    # 获取根元素并开始递归遍历
    root_element = d.dump()
    _dump_texts_recursive(root_element, 0)
 
# 连接到设备
d = u2.connect('127.0.0.1:7912')  # 假设uiautomator服务地址为127.0.0.1:7912
 
# 遍历设备上的所有文本
dump_texts(d)

这段代码使用了uiautomator2库来连接到一个Android设备，并使用dump_texts函数递归遍历设备上的所有UI元素，打印出包含的文本内容。这是一个很好的学习示例，展示了如何使用这个库来获取和分析设备上的UI布局和文本信息。

- 阅读更多 -

浅谈Python两大爬虫库——urllib库和requests库区别

System

2024-08-16

所有,爬虫

urllib和requests都是Python用于网络请求的库，但它们有一些主要的区别：

接口不同：urllib提供了 rich, consistent, and easy-to-use interfaces，适合进阶使用；requests提供了简洁的API，更适合初学者和日常使用。
功能不同：urllib是Python内置的HTTP请求模块，支持从URLs读取数据，包括HTTP, HTTPS, FTP等协议；requests不仅支持HTTP请求，还可以发送各种HTTP请求，如GET，POST，PUT，DELETE等，并且还可以处理cookies和session。
异步支持：urllib本身不支持异步，而requests支持异步请求，可以使用aiohttp库进行异步请求。
第三方库：requests更加第三方库，如需要安装，使用pip install requests即可，而urllib则是Python标准库，无需额外安装。

例子：

使用urllib发送GET请求：




import urllib.request
 
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

使用requests发送GET请求：




import requests
 
response = requests.get('http://www.example.com/')
html = response.text

在实际应用中，由于requests API更简洁和易用，开发者通常更倾向于使用requests库。而urllib更适合需要更高级或更低级控制的场景，例如需要处理复杂的HTTP请求或者需要接入异步IO的场景。

- 阅读更多 -

实验平台-Python-Scrapy爬虫之拉勾网招聘数据分析（第4，5关）

System

2024-08-16

所有,爬虫




# 导入必要的模块
import scrapy
 
class LagouSpider(scrapy.Spider):
    name = 'lagou'
    allowed_domains = ['lagou.com']
    start_urls = ['https://www.lagou.com/jobs/list_%E6%B5%8B%E8%AF%95?labelWords=label']
 
    # 解析列表页的方法
    def parse(self, response):
        # 提取每个职位的详情页URL
        urls = response.css('.position_link::attr(href)').getall()
        for url in urls:
            yield response.follow(url, self.parse_detail)
 
        # 提取下一页的URL并进行爬取
        next_page_url = response.css('.pager_next::attr(href)').get()
        if next_page_url:
            yield response.follow(next_page_url, self.parse)
 
    # 解析详情页的方法
    def parse_detail(self, response):
        item = {
            'job_title': response.css('.job-name::text').get(),
            'company_name': response.css('.company_name::text').get(),
            'salary': response.css('.salary::text').get(),
            'city': response.css('.work_addr::text').get(),
            'experience_edu': response.css('.detail-company li::text').getall(),
            'job_desc': response.css('.job-detail::text').getall(),
        }
        yield item

这段代码修复了原代码中的错误，并且提供了一个简洁的爬虫示例，用于从拉勾网提取职位信息。它定义了一个名为LagouSpider的爬虫类，包括了爬取列表页和详情页的方法，并且使用scrapy.Spider作为基类。这个爬虫会根据提取到的URL进行页面爬取，并通过yield返回解析后的数据项。

- 阅读更多 -

python，爬虫突破登录校验之获取手机短信验证码

System

2024-08-16

所有,爬虫

要实现这个功能，你需要使用Python来模拟用户登录，并获取短信验证码。以下是一个简化的代码示例，展示了如何使用Python的requests库来完成这个过程：




import requests
 
# 登录URL
login_url = 'https://www.example.com/login'
 
# 短信验证码API
sms_code_api = 'https://www.example.com/api/sms_code'
 
# 用户名和密码
username = 'your_username'
password = 'your_password'
 
# 创建Session对象，用于保持会话状态
with requests.Session() as session:
    # 发起登录请求
    login_response = session.post(login_url, data={'username': username, 'password': password})
    
    # 检查登录是否成功
    if login_response.ok:
        # 成功登录后，获取短信验证码
        sms_code_response = session.get(sms_code_api)
        
        if sms_code_response.ok:
            # 输出短信验证码
            print("短信验证码:", sms_code_response.json())
        else:
            print("获取短信验证码失败")
    else:
        print("登录失败")

请注意，这个示例假定登录和获取短信验证码的API是在同一个会话中调用的。实际使用时，你需要替换login\_url、sms\_code\_api以及用户名和密码为你要登录的网站的实际信息。

此外，具体实现可能需要处理cookies、headers、以及对应网站的特定的登录参数和验证码API的参数，这些细节可能需要根据目标网站的具体情况进行调整。

- 阅读更多 -

Python逆向爬虫入门教程07: 某网站视频数据加密参数逆向解析

System

2024-08-16

所有,爬虫




import requests
import re
import execjs
 
# 使用execjs加载JavaScript运行环境，并执行预先准备好的解密代码
context = execjs.compile(js_content)
 
# 假设我们已经通过网络请求获取到了视频数据的加密参数
encrypted_sign = "JTdCJTIyYWRkJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KDEsTG9nJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KDEsTG9nJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KDEsTG9nJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2KHN0cmlwJTIyJTNBJTIyYWRkJTIyYWRkUUVCKjIxMjM0NTY2K

System

2024-08-16

所有,爬虫




import requests
import pymysql
import time
 
# 连接MySQL数据库
def connect_db():
    connection = pymysql.connect(host='localhost',
                                 user='your_username',
                                 password='your_password',
                                 database='your_database',
                                 charset='utf8mb4',
                                 cursorclass=pymysql.cursors.DictCursor)
    return connection
 
# 将数据插入MySQL数据库
def insert_db(data, connection):
    try:
        with connection.cursor() as cursor:
            sql = "INSERT INTO btc_trade (trade_id, amount, price, time) VALUES (%s, %s, %s, %s)"
            cursor.execute(sql, data)
        connection.commit()
    except pymysql.MySQLError as e:
        print(e)
 
# 获取ok链上bitcoin大额交易数据
def get_btc_trade(url):
    response = requests.get(url)
    return response.json()
 
# 主程序
def main():
    url = 'https://www.okcoin.com/api/v1/btc_cny/trades?since=0'
    connection = connect_db()
    while True:
        trades = get_btc_trade(url)
        for trade in trades:
            data = (trade['tid, trade['amount'], trade['price'], trade['time']))
            insert_db(data, connection)
        time.sleep(10)  # 间隔10秒
 
if __name__ == "__main__":
    main()

在这个代码实例中，我们首先定义了连接MySQL数据库的函数connect_db，然后定义了将数据插入数据库的函数insert_db。get_btc_trade函数负责从OKEx获取交易数据。最后，在main函数中，我们连接数据库，进入一个无限循环，每10秒获取一次数据并插入数据库。这个例子展示了如何将数据实时地从一个API抓取并存储到数据库中。

System

2024-08-16

所有,爬虫

为了保证答案的精简，以下是一个简化版的代码实例，展示如何使用Python爬取拼多多上的商品详情数据、商品列表数据以及商品优惠券数据。




import requests
from pyquery import PyQuery as pq
 
# 请求头部信息，模拟浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
 
# 获取商品详情页的数据
def get_item_details(item_url):
    try:
        response = requests.get(item_url, headers=headers)
        if response.status_code == 200:
            doc = pq(response.text)
            # 提取商品详情页的数据，例如商品标题、价格等
            title = doc('.product-title').text()
            price = doc('.price').text()
            return {
                'title': title,
                'price': price
            }
    except requests.exceptions.RequestException:
        return None
 
# 获取商品列表页的数据
def get_item_list(list_url):
    try:
        response = requests.get(list_url, headers=headers)
        if response.status_code == 200:
            doc = pq(response.text)
            # 提取商品列表页的数据，例如商品链接等
            items = doc('.item').items()
            for item in items:
                item_url = item('.pic').attr('href')
                yield item_url
    except requests.exceptions.RequestException:
        return None
 
# 获取商品优惠券数据
def get_item_coupons(item_url):
    try:
        response = requests.get(item_url, headers=headers)
        if response.status_code == 200:
            doc = pq(response.text)
            # 提取商品优惠券数据
            coupons = doc('.coupon-item').items()
            for coupon in coupons:
                title = coupon('.coupon-title').text()
                condition = coupon('.coupon-rule').text()
                yield {
                    'title': title,
                    'condition': condition
                }
    except requests.exceptions.RequestException:
        return None
 
# 使用示例
if __name__ == '__main__':
    item_url = 'https://www.tmall.com/item.htm?id=525388747711'
    list_url = 'https://list.tmall.com/search_product.htm?q=%E5%B0%8F%E7%B1%B3'
    
    # 获取商品详情数据
    item_details = get_item_details(item_url)
    print(item_details)
    
    # 获取商品列表中的所有商品链接
    for item_url in get_item_list(list_url):
        print(item_url)
    
    # 获取商品优惠券数据
    for coupon in get_item_coupons(item_url):
        print(coupon)

这段代码展示了如何使用Python爬取网页数据。首先定义了请求头部信息，模拟

- 阅读更多 -

Python Web 爬虫实用指南

System

2024-08-16

所有,爬虫




import requests
from bs4 import BeautifulSoup
 
def get_soup(url):
    """
    获取指定URL的BeautifulSoup对象
    """
    response = requests.get(url)
    if response.status_code == 200:
        return BeautifulSoup(response.text, 'html.parser')
    else:
        return None
 
def get_download_urls(soup):
    """
    从BeautifulSoup对象中提取所有图片下载链接
    """
    # 假设图片链接都在<a>标签的href属性中，且图片扩展名为.jpg
    download_urls = [tag['href'] for tag in soup.find_all('a') if tag['href'].endswith('.jpg')]
    return download_urls
 
def download_images(download_urls, path='images/'):
    """
    将下载链接列表中的图片保存到本地
    """
    for index, url in enumerate(download_urls):
        response = requests.get(url)
        if response.status_code == 200:
            with open(f'{path}image_{index}.jpg', 'wb') as file:
                file.write(response.content)
 
# 示例用法
url = 'http://example.com/gallery'
soup = get_soup(url)
download_urls = get_download_urls(soup)
download_images(download_urls)

这段代码提供了一个简化的示例，展示了如何使用requests和BeautifulSoup库来获取网页内容，解析图片链接，并将这些图片保存到本地文件夹。这个过程是爬虫技术的基本应用，适合作为初学者理解和实践Web爬虫的入门教程。

- 阅读更多 -

Python高效爬虫，从基础到实战再到高阶爬虫，看这一篇就够了

System

2024-08-16

所有,爬虫

Python高级和实战的爬虫通常涉及到异步IO、分布式爬虫、反爬虫策略、数据解析、存储管理等方面。以下是一些常见的高级和实战爬虫技术的示例：

异步IO：使用asyncio和aiohttp库实现异步网络请求。




import asyncio
import aiohttp
 
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
 
async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://httpbin.org/get')
        print(html)
 
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

分布式爬虫：使用Scrapy框架，它支持分布式爬取，通过多个爬虫节点协同工作。




import scrapy
 
class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/']
 
    def parse(self, response):
        # 解析响应数据
        pass

反爬虫策略：实现适当的延时、使用代理、 cookie/session 管理等策略。




import requests
 
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
 
response = requests.get('http://httpbin.org/get', proxies=proxies)

数据解析：使用BeautifulSoup或lxml解析HTML/XML数据。




from bs4 import BeautifulSoup
 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
"""
 
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.string)

存储管理：使用SQLAlchemy进行数据库管理或pymongo用于MongoDB存储。




from sqlalchemy import create_engine
 
engine = create_engine('sqlite:///example.db')
with engine.connect() as con:
    con.execute("CREATE TABLE users (id INTEGER, name VARCHAR)")

这些示例只是爬虫技术的一角，实际的爬虫项目可能还涉及更多复杂的技术，如动态页面解析、OCR处理、自然语言处理、机器学习等。

- 阅读更多 -

Python爬虫：汽车之家爬虫（完整代码）

System

2024-08-16

所有,爬虫




import requests
from lxml import etree
import csv
 
def get_car_info(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    html = etree.HTML(response.text)
    return html
 
def parse_car_info(html):
    # 汽车名称
    name = html.xpath('//div[@class="car-intro-cont"]/h1/text()')[0]
    # 汽车价格
    price = html.xpath('//div[@class="car-price-cont"]/p[@class="car-price-num"]/text()')[0]
    # 汽车图片
    image = html.xpath('//div[@class="car-intro-img"]/img/@src')[0]
    # 汽车参数
    parameters = html.xpath('//div[@class="car-param-cont"]/p/text()')
    return {
        'name': name,
        'price': price,
        'image': image,
        'parameters': parameters
    }
 
def save_car_info(car_info, file_name):
    with open(file_name, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['name', 'price', 'image', 'parameters'])
        writer.writerow([car_info['name'], car_info['price'], car_info['image'], ','.join(car_info['parameters'])])
 
def main():
    url = 'https://www.autohome.com.cn/156148/#pvareaid=201571'
    html = get_car_info(url)
    car_info = parse_car_info(html)
    save_car_info(car_info, 'car_info.csv')
 
if __name__ == '__main__':
    main()

这段代码实现了汽车之家网站特定汽车信息的抓取，并将其保存到CSV文件中。代码首先定义了获取网页、解析网页以及保存数据的函数。然后在main函数中，我们调用这些函数，完成数据抓取和保存的流程。在实际应用中，你需要根据目标网站的具体结构调整XPath表达式。

- 阅读更多 -