标签 python 下的文章

2024-08-19




import requests
import execjs
 
# 请求网页
url = 'http://example.com/path/to/page'
response = requests.get(url)
 
# 解析JS代码，找到加密函数并调用
js_code = """
function encrypt(data) {
    // 这里是加密函数的代码
    // ...
}
"""
 
# 使用execjs执行JS代码
ctx = execjs.compile(js_code)
encrypted_data = ctx.call('encrypt', 'your_data_here')
 
# 使用加密后的数据发起POST请求
post_url = 'http://example.com/path/to/post/endpoint'
post_data = {
    'encryptedField': encrypted_data
}
post_response = requests.post(post_url, data=post_data)
 
# 打印结果
print(post_response.text)

这个示例展示了如何使用Python的requests库来获取网页内容，以及如何使用execjs库来执行提供的JavaScript加密函数，并将加密后的数据用于POST请求。这是进行Web爬虫开发时了解和应用JavaScript加密的一个基本例子。

- 阅读更多 -

Python |浅谈爬虫的由来

System

2024-08-19

所有,爬虫

网络爬虫，又称为网页爬虫，是一种按照一定规则自动抓取互联网网页信息的程序或脚本。

爬虫的由来可以追溯到早期的搜索引擎发展阶段，早期的搜索引擎要收集大量的网页信息，为了实现这个目标，就需要有能够自动获取网页的程序。这样的程序最早的应用是在1990年左右的ARPANET（早期的互联网前身）上，用于传递新闻组帖子。随着互联网的发展，爬虫在各种场景中的应用也越来越广泛，包括但不限于数据分析、商业智能、机器学习等领域。

以下是一个简单的Python爬虫示例，使用requests和BeautifulSoup库：




import requests
from bs4 import BeautifulSoup
 
def get_html(url):
    response = requests.get(url)
    return response.text
 
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.find_all('a')
 
def print_links(links):
    for link in links:
        print(link.get('href'))
 
url = 'https://www.example.com'
html = get_html(url)
links = parse_html(html)
print_links(links)

这个简单的爬虫示例首先定义了一个获取HTML内容的函数，然后定义了一个解析HTML并找出所有<a>标签的函数，最后定义了一个打印所有链接的函数。这个简单的爬虫可以作为爬虫编写的入门示例。

- 阅读更多 -

【Python爬虫】Python爬取喜马拉雅，爬虫教程！

System

2024-08-19

所有,爬虫

以下是一个简单的Python爬虫示例，用于爬取喜马拉雅网站上的音频播放链接。




import requests
from bs4 import BeautifulSoup
import re
 
# 音频信息类
class AudioInfo:
    def __init__(self, title, url):
        self.title = title
        self.url = url
 
# 获取音频信息
def get_audio_info(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        audio_list = soup.find_all('a', class_='audio-title')
        audio_infos = []
        for audio in audio_list:
            title = audio.text
            audio_url = audio['href']
            audio_infos.append(AudioInfo(title, audio_url))
        return audio_infos
    else:
        return []
 
# 正则表达式提取音频链接
def extract_audio_url(text):
    pattern = re.compile(r'play_url:"(.*?)"')
    urls = re.findall(pattern, text)
    return urls
 
# 测试函数
def test_get_audio_info(url):
    audio_infos = get_audio_info(url)
    for info in audio_infos:
        print(info.title, info.url)
        audio_urls = extract_audio_url(info.url)
        for url in audio_urls:
            print(url)
 
# 主函数
def main():
    # 喜马拉雅电台分类页面URL
    category_url = 'https://www.ximalaya.com/category/500033/'
    test_get_audio_info(category_url)
 
if __name__ == '__main__':
    main()

这段代码首先定义了一个AudioInfo类来存储音频的标题和URL。get_audio_info函数用于获取指定URL下的所有音频信息，并返回一个AudioInfo对象列表。extract_audio_url函数使用正则表达式来提取每个音频播放的实际URL。test_get_audio_info函数用于测试这些功能，并打印结果。最后，main函数定义了喜马拉雅电台分类页面的URL，并调用test_get_audio_info函数来运行爬虫。

注意：由于喜马拉雅有反爬策略，实际运行时可能需要设置合适的请求头、处理登录验证、使用代理等。此代码仅为示例，用于演示基本的爬虫逻辑。

- 阅读更多 -

python 爬虫 m3u8 视频文件加密解密整合mp4

System

2024-08-19

所有,爬虫

要解析和整合m3u8格式的加密视频文件为MP4格式，你可以使用pyshaka.hls模块来处理加密的m3u8文件。以下是一个简单的Python脚本，展示了如何使用pyshaka.hls来下载和整合加密的m3u8视频文件。

首先，你需要安装pyshaka库：




pip install pyshaka

然后，使用以下代码整合加密的m3u8视频文件：




import os
from pyshaka.hls import HLSMediaPlaylist, MediaSegment
from pyshaka.utils import download_media_segment
 
# 设置m3u8文件的URL和保存路径
m3u8_url = "your_encrypted_m3u8_file_url"
base_url = os.path.dirname(m3u8_url)
save_path = "output.mp4"
 
# 下载m3u8播放列表
playlist = HLSMediaPlaylist.from_url(m3u8_url)
 
# 创建保存MP4文件的文件对象
with open(save_path, "wb") as mp4_file:
    # 遍历播放列表中的每个媒体段
    for segment in playlist.segments:
        # 下载媒体段
        segment_data = download_media_segment(segment.uri)
        
        # 将下载的数据写入MP4文件
        mp4_file.write(segment_data)
 
# 完成后，你将得到一个整合了所有m3u8媒体段的MP4文件。

请确保替换your_encrypted_m3u8_file_url为你的加密m3u8文件的实际URL。此脚本假设你已经有了解密视频文件所需的密钥和其他认证机制（如果有的话）。如果你需要处理认证，你可能需要扩展这个脚本以包含相关的逻辑。

- 阅读更多 -

【Python学习】网络爬虫-爬取斗鱼车模视频

System

2024-08-19

所有,爬虫

要使用Python爬取斗鱼车模视频，你可以使用requests库来下载网页内容，并使用BeautifulSoup来解析网页。以下是一个简单的例子，展示了如何抓取一个车模视频列表页面，并获取视频的URL。

首先，确保安装了所需的库：




pip install requests beautifulsoup4 lxml

然后，你可以使用以下代码来爬取视频列表：




import requests
from bs4 import BeautifulSoup
 
# 车模视频列表页面的URL
url = 'https://www.dajia.com/video/list/1-1-1'
 
# 发送HTTP请求
response = requests.get(url)
 
# 确保请求成功
if response.status_code == 200:
    # 解析网页
    soup = BeautifulSoup(response.text, 'lxml')
    
    # 找到包含视频信息的元素
    videos = soup.find_all('div', class_='video-info')
    
    # 遍历视频信息，获取视频标题和URL
    for video in videos:
        title = video.find('a', class_='video-title').text
        video_url = video.find('a', class_='video-title')['href']
        print(f"Title: {title}")
        print(f"Video URL: {video_url}")
        # 这里可以添加代码下载视频
else:
    print("Failed to retrieve the webpage")

请注意，大多数网站都有防爬虫策略。你可能需要处理cookies、headers、代理、登录认证等问题，这取决于该网站的安全措施。

此外，爬取内容时应尊重网站版权和robot.txt规则，不要进行大规模抓取以免造成不必要的负担。

以上代码只是一个简单的示例，实际使用时可能需要进行更多的错误处理和适应性爬取策略。

- 阅读更多 -

python 笔记——request、爬虫、socket、多线程

System

2024-08-19

所有,爬虫

这是一个关于Python网络请求、简单爬虫、Socket编程和多线程的概述和代码示例。

网络请求使用requests库：




import requests
 
response = requests.get('https://www.example.com')
print(response.text)

简单爬虫使用BeautifulSoup解析HTML内容：




from bs4 import BeautifulSoup
import requests
 
url = 'https://www.example.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
 
# 提取所有的链接
for link in soup.find_all('a'):
    print(link.get('href'))

Socket编程示例：




import socket
 
# 创建一个socket对象
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 
# 建立连接
s.connect(('www.example.com', 80))
 
# 发送数据
s.send(b'GET / HTTP/1.1\r\nHost: www.example.com\r\nConnection: close\r\n\r\n')
 
# 接收响应
response = s.recv(1024)
 
print(response)
 
# 关闭连接
s.close()

多线程示例：




from threading import Thread
 
def task(n):
    print(f'Thread {n} is running')
 
# 创建并启动线程
t1 = Thread(target=task, args=(1,))
t2 = Thread(target=task, args=(2,))
 
t1.start()
t2.start()
 
t1.join()
t2.join()
 
print('All threads completed')

这些代码片段展示了如何在Python中进行网络请求、简单爬虫、Socket编程和多线程编程。这些技术在数据科学、网络编程和分布式系统等领域有着广泛的应用。

- 阅读更多 -

python爬虫6—高性能异步爬虫

System

2024-08-19

所有,爬虫




import asyncio
import aiohttp
 
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
 
async def main():
    urls = ['http://httpbin.org/delay/1', 'http://httpbin.org/delay/2'] * 100
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        html_list = await asyncio.gather(*tasks)
        for html in html_list:
            print(len(html))
 
if __name__ == '__main__':
    asyncio.run(main())

这段代码使用了asyncio库和aiohttp库来实现高性能的异步网络请求。fetch函数负责发起对单个URL的请求并获取响应文本。main函数则是程序的入口点，它创建了一个ClientSession对象，并用它来并发地获取多个URL的内容。通过asyncio.gather函数，我们能够并发地执行多个任务，并在所有任务完成后收集结果。这样的爬虫模型在处理大量网络请求时能够显著提高效率。

System

2024-08-19

所有,爬虫




from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
 
# 初始化浏览器驱动
driver_path = '你的webdriver路径'
browser = webdriver.Chrome(executable_path=driver_path)
 
# 打开QQ腾讯文档登录页面
browser.get('https://docs.qq.com/')
 
# 等待登录按钮可点击
login_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, '#login-button'))
)
login_button.click()
 
# 输入账号密码并登录
username = browser.find_element(By.CSS_SELECTOR, '#switchAccount > div.login-input-con.account-login-input-con > input[type=text]')
password = browser.find_element(By.CSS_SELECTOR, '#switchAccount > div.login-input-con.password-login-input-con > input[type=password]')
username.send_keys('你的QQ账号')
password.send_keys('你的密码')
password.send_keys(Keys.RETURN)
 
# 等待登录后的页面加载
my_files = WebDriverWait(browser, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '#my-files'))
)
 
# 执行打卡操作
# 假设你已经有一个打卡的函数，这里只是示例
def clock_in():
    # 定位到打卡按钮并点击
    clock_in_button = browser.find_element(By.CSS_SELECTOR, '#clock-in-button')
    clock_in_button.click()
 
# 执行打卡操作
clock_in()
 
# 关闭浏览器
browser.quit()

这个示例代码展示了如何使用Selenium库来打开QQ腾讯文档的登录页面，输入账号密码登录，并且执行打卡操作。这个过程中使用了等待（WebDriverWait）来确保页面元素加载完成后再进行操作。最后，在操作完成后关闭浏览器。这个代码提供了一个基本框架，可以根据实际情况进行调整和扩展。

- 阅读更多 -

Python | 海表面温度(SST) | 长期趋势和异常分析

System

2024-08-19

所有,python




import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
from scipy import stats
 
# 假设data是一个包含SST数据的xarray.Dataset或xarray.DataArray
# 这里只是示例，实际情况下你需要加载数据
# data = ...
 
# 计算每个站点的持续异常情况
def calculate_persistent_anomalies(data, window):
    running_mean = data.rolling(time=window).mean()
    return data - running_mean
 
# 加载数据并进行异常分析
# 示例数据使用np.random.normal生成
data = np.random.normal(size=(365, 100)).cumsum(axis=0)  # 示例数据，表示每天每个站点的SST
data = xr.DataArray(data, dims=['time', 'station'], coords={'time': np.arange(data.shape[0]), 'station': np.arange(data.shape[1])})
 
# 设置窗口大小，例如10年
window = 365 * 10
 
# 计算持续异常
persistent_anomalies = calculate_persistent_anomalies(data, window)
 
# 找到异常的阈值，这里使用了Z分数法
threshold = stats.zscore(persistent_anomalies).mean() * 2
 
# 识别异常值
anomalies = persistent_anomalies > threshold
 
# 可视化结果
fig, ax = plt.subplots()
ax.imshow(anomalies.T, cmap='viridis', aspect='auto', vmin=0, vmax=1)
ax.set_yticks(np.arange(data.station.size))
ax.set_yticklabels(data.station.values)
ax.set_xticks(np.arange(data.time.size))
ax.set_xticklabels(data.time.dt.strftime('%Y'))
ax.set_xlabel('Year')
ax.set_ylabel('Station')
ax.set_title('Persistent Anomalies')
plt.show()
 
# 注意：这个示例假设数据是平稳的，并且没有提供实际的数据加载方式。
# 在实际应用中，你需要替换数据加载部分，并根据具体情况调整异常分析的细节。

这个代码示例展示了如何计算海表面温度数据的持续异常，并使用Z分数法确定异常的阈值。然后，它将异常与阈值进行比较，并通过图形方式展示结果。这个过程可以作为持续异常分析的一个基础模板。

- 阅读更多 -

基于Python的电影票房爬取与可视化系统的设计与实现

System

2024-08-19

所有,python




import requests
import pandas as pd
from pyecharts.charts import Bar
from pyecharts import options as opts
 
# 获取电影票房数据
def get_movie_boxoffice_data(url):
    response = requests.get(url)
    data = response.json()
    return data['boxoffice']
 
# 保存票房数据到CSV文件
def save_boxoffice_data_to_csv(data, filename):
    df = pd.DataFrame(data).T
    df.to_csv(filename, index=False)
 
# 根据票房数据生成柱状图
def generate_bar_chart(data, title):
    c = (
        Bar()
        .add_xaxis(list(data.keys()))
        .add_yaxis("票房", list(data.values()))
        .set_global_opts(title_opts=opts.TitleOpts(title=title))
    )
    return c
 
# 示例使用
if __name__ == "__main__":
    # 电影票房数据API URL
    url = 'http://api.example.com/boxoffice'
    # 获取数据
    boxoffice_data = get_movie_boxoffice_data(url)
    # 保存到CSV
    save_boxoffice_data_to_csv(boxoffice_data, 'boxoffice_data.csv')
    # 生成柱状图
    chart = generate_bar_chart(boxoffice_data, '电影票房数据柱状图')
    # 渲染图表到文件
    chart.render('boxoffice_chart.html')

这个代码示例展示了如何使用Python进行简单的数据抓取、数据保存和数据可视化。首先，我们定义了一个获取电影票房数据的函数，然后定义了一个将票房数据保存到CSV文件的函数，最后定义了一个根据票房数据生成柱状图的函数。代码中的API URL应该替换为实际的API服务地址。

- 阅读更多 -