分类后端技术下的文章

2024-08-25




import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
 
# 定义一个函数来获取房源信息
def get_source_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    info_list = soup.select('.info-list li')
    info_dict = {}
    for info in info_list:
        key = info.select('span')[0].text
        value = info.select('a|span')[1].text if len(info.select('a|span')) > 1 else ''
        info_dict[key] = value
    return info_dict
 
# 定义一个函数来获取房源详细信息
def get_source_details(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.select('.title-bar-01')[0].text
    info_list = soup.select('.house-parameter li')
    info_dict = {}
    for info in info_list:
        key = info.select('span')[0].text
        value = info.select('span')[1].text
        info_dict[key] = value
    return title, info_dict
 
# 定义一个函数来获取房源数据
def get_source_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    data_list = soup.select('.house-list-wrap .house-list-item')
    source_data = []
    for data in data_list:
        info_dict = get_source_info(data.select('a')[0]['href'])
        info_dict['title'] = data.select('.house-title')[0].text
        info_dict['price'] = data.select('.price')[0].text
        source_data.append(info_dict)
    return source_data
 
# 获取二手房数据
source_data = get_source_data('http://ershou.jilin.cn/ershoufang/')
df = pd.DataFrame(source_data)
 
# 数据可视化
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.scatter(df['area'], df['price'])
plt.xlabel('Area (平方米)')
plt.ylabel('Price (万元)')
plt.title('二手房面积与价格关系散点图')
 
plt.subplot(1, 2, 2)
plt.hist(df['price'], bins=50)
plt.xlabel('Price (万元)')
plt.ylabel('Count')
plt.title('二手房价格分布直方图')
 
plt.show()

这段代码首先定义了一个函数get_source_info来解析房源列表页的每条房源信息，然后定义了一个函数get_source_details来解析房源详情页的标题和详细信息。最后，定义了一个函数get_source_data来获取整个房源页的数据，并将其存储为DataFrame格式，以便进行数据可视化分析。代码中使用了matplotlib.pyplot库来绘制散点图和直方图，展示了房源面积与价格之间的关系以及房源价格的分布情况。

- 阅读更多 -

深入解析：如何有效处理JavaScript渲染内容的爬虫技术

System

2024-08-25

所有,爬虫




// 假设我们有一个简单的网页，其中包含一个用于渲染内容的JavaScript函数
// 这个函数可能会被爬虫错误地调用，因此我们需要一个方法来检测到这种情况
 
// 检测JavaScript渲染内容的函数
function detectJavascriptRendering(window) {
    // 尝试获取页面上的某些元素，这些元素应该是由服务器渲染的
    const serverRenderedElement = window.document.getElementById('server-rendered-content');
 
    // 如果这些元素不存在，可能是JavaScript渲染的
    if (!serverRenderedElement) {
        console.log('网页内容可能是通过JavaScript渲染的。');
        // 这里可以添加更多的检测逻辑，例如检查特定的事件或变量
    } else {
        console.log('网页内容是由服务器直接渲染的。');
    }
}
 
// 假设我们有一个window对象，它代表了一个模拟的浏览器环境
const window = {
    document: {
        getElementById: function(id) {
            if (id === 'server-rendered-content') {
                // 假设这里有服务器渲染的内容
                return '<div id="server-rendered-content">...</div>';
            }
            return null;
        }
    }
};
 
// 使用我们的函数来检测这个模拟的window对象
detectJavascriptRendering(window);

这个代码示例演示了如何检测一个网页内容是否是通过JavaScript渲染的。它通过查找预期由服务器渲染的元素来实现这一点，如果这些元素不存在，则可以推断内容可能是通过JavaScript动态生成的。这种方法可以用于教育目的，以帮助爬虫技术用户识别和处理JavaScript渲染的内容。

- 阅读更多 -

【爬虫+数据清洗+数据分析+可视化】用Python进行文本挖掘“狂飙“的评论

System

2024-08-25

所有,爬虫




import pandas as pd
import jieba
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
 
# 读取数据
df = pd.read_csv('data.csv', encoding='utf-8')
 
# 使用结巴分词
df['word_seg'] = df['comment'].apply(lambda x: ' '.join(jieba.cut(x)))
 
# 创建词频表
word_series = pd.Series(' '.join(df['word_seg']).split())
word_df = word_series.value_counts()[:1000].sort_values(ascending=False).reset_index()
word_df.columns = ['word', 'count']
 
# 词云可视化
cloud_mask = np.array(plt.imread('star.png'))
wordcloud = WordCloud(background_color='white', mask=cloud_mask, contour_width=3, contour_color='steelblue')
word_frequencies = {key: word_df.loc[i, 'count'] for i, key in enumerate(word_df['word'])}
wordcloud = wordcloud.fit_words(word_frequencies)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

这段代码首先导入了必要的Python库，并读取了数据。接着使用结巴分词库对评论进行了分词处理，并创建了一个词频表。最后，使用词频数据生成了一个词云图，展示了评论中最常见的词汇。这个过程展示了如何进行文本挖掘，分析情感，并以可视化的方式呈现结果。

- 阅读更多 -

探索Python Crawler：高效的数据抓取利器

System

2024-08-25

所有,爬虫




import requests
from bs4 import BeautifulSoup
 
def crawl_data(url):
    """
    从指定的url抓取数据
    """
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # 假设我们要抓取的数据在<div id="content"></div>内
        content = soup.find('div', {'id': 'content'})
        if content:
            return content.get_text()
    return "Failed to crawl data"
 
# 使用方法
url = "http://example.com"
print(crawl_data(url))

这段代码展示了如何使用Python的requests库和BeautifulSoup库来简单地抓取网页上的数据。函数crawl_data接收一个URL，向该URL发送HTTP GET请求，并使用BeautifulSoup解析返回的页面。然后它会尝试找到一个特定的HTML元素（这里是一个id为"content"的div标签），并返回该元素的文本内容。如果抓取失败，则返回一个错误消息。

- 阅读更多 -

python写爬虫爬取京东商品信息

System

2024-08-25

所有,爬虫

要使用Python写一个爬虫来爬取京东的商品信息，你可以使用requests库来发送HTTP请求，以及BeautifulSoup库来解析HTML页面。以下是一个简单的例子，展示了如何获取一个商品的基本信息。

首先，确保安装了所需的库：




pip install requests beautifulsoup4

然后，你可以使用以下代码作为爬虫的基本框架：




import requests
from bs4 import BeautifulSoup
 
# 商品URL
product_url = 'https://item.jd.com/100012043978.html'
 
# 发送HTTP请求
response = requests.get(product_url)
response.raise_for_status()  # 检查请求是否成功
 
# 解析HTML内容
soup = BeautifulSoup(response.text, 'html.parser')
 
# 提取商品信息
product_name = soup.find('div', class_='sku-name').text.strip()
product_price = soup.find('div', class_='pin').text.strip()
product_img = soup.find('img', class_='J_Img')['src']
 
# 打印商品信息
print(f"商品名称: {product_name}")
print(f"商品价格: {product_price}")
print(f"商品图片: {product_img}")

请注意，为了获取更多信息，你可能需要分析HTML结构，并根据需要提取更多的元素。此外，很多网站采取了反爬措施，例如JavaScript渲染的内容、动态加载的数据，或者需要登录才能访问的内容，这些情况下你可能需要处理更复杂的情况，比如使用Selenium等工具来模拟浏览器行为。

此外，应遵守网站的robots.txt规则以及法律法规，不进行滥用。

- 阅读更多 -

Pycharm的汉化方法（pycharm改为中文版）

System

2024-08-25

所有,python

PyCharm 汉化的方法非常简单，你只需下载一个 PyCharm 中文语言包，并在 PyCharm 中指定使用它即可。以下是具体步骤：

打开 PyCharm，然后选择 File > Settings 或者直接点击 Settings 图标（齿轮图标）。
在 Settings 窗口左侧的导航菜单中，选择 Plugins。
在 Plugins 界面中，在搜索框中输入 Chinese，然后选择 Chinese (Simplified) Language Pack 插件并点击 Install。
安装完成后，重启 PyCharm，此时 PyCharm 的界面应该已经变成中文了。

如果你的 PyCharm 已经是最新版本，通常它会自带中文语言支持。如果没有，按照上述步骤进行操作即可。

System

2024-08-25

所有,python

解释：

ModuleNotFoundError: No module named 'tensorflow' 表示Python解释器无法找到名为tensorflow的模块。这通常发生在尝试导入一个未安装在当前Python环境中的库时。

解决方法：

确认你是否已经安装了tensorflow。可以通过运行以下命令来检查：
```
pip show tensorflow
```
如果没有安装，则需要安装它。
如果你已经确认tensorflow未安装，使用pip安装它：
```
pip install tensorflow
```
如果你需要GPU支持，可以使用：
```
pip install tensorflow-gpu
```
如果你使用的是Python虚拟环境，确保你在正确的虚拟环境中安装了tensorflow。
如果你已经安装了tensorflow，但仍然遇到这个错误，可能是因为你的Python解释器没有指向正确的环境路径。确保你的环境变量和/或你的IDE设置正确。
如果你在使用多个版本的Python，确保你安装tensorflow的Python版本与你尝试运行程序的版本相匹配。
如果你在使用conda环境，可以使用conda来安装：
```
conda install tensorflow
```
或者使用特定于环境的conda指令。
如果以上步骤都不能解决问题，可能需要检查你的Python环境路径或者考虑重新安装Python和pip。

- 阅读更多 -

python-首字母移位（PythonTip）

System

2024-08-25

所有,python

首字母移位是一种加密技术，其中字符串中的每个单词的首字母被移至下一个单词。这里提供一个简单的Python函数来实现这个功能：




def caesar_cipher(text, shift):
    words = text.split()
    caesar_text = ''
    for word in words:
        caesar_word = word[shift:] + word[:shift]
        caesar_text += caesar_word + ' '
    return caesar_text.strip()
 
# 示例使用
text = "hello world"
shift = 1
encrypted_text = caesar_cipher(text, shift)
print(encrypted_text)  # 输出: 'ello worldh'

这个函数caesar_cipher接收两个参数：text是需要进行首字母移位的字符串，shift是移位的数量。函数将字符串按单词拆分，对每个单词进行移位操作，并将移位后的单词重新连接成字符串。

- 阅读更多 -

【Python】进阶学习：一文掌握resize方法，轻松调整图像大小

System

2024-08-25

所有,python

在Python中，通过Pillow库可以轻松调整图像的大小。Pillow是一个强大的图像处理库，可以用于打开、处理和保存各种格式的图片。其中，resize()方法用于调整图像的大小。

以下是一些使用Pillow库的resize()方法的示例：

调整图像到固定大小：




from PIL import Image
 
img = Image.open('path_to_your_image.jpg')
resized_img = img.resize((128, 128))  # 将图像大小调整为128x128
resized_img.save('resized_image.jpg')

调整图像的宽度或高度，保持其宽高比：




from PIL import Image
 
img = Image.open('path_to_your_image.jpg')
resized_img = img.resize((128, 0))  # 将图像的宽度调整为128，高度自动保持宽高比
resized_img.save('resized_image.jpg')

使用Image.BICUBIC或Image.LANCZOS插值方法调整图像大小：




from PIL import Image
 
img = Image.open('path_to_your_image.jpg')
resized_img = img.resize((128, 128), Image.BICUBIC)  # 使用双三次插值方法调整图像大小
resized_img.save('resized_image.jpg')

使用resize()方法保持图像质量：




from PIL import Image
 
img = Image.open('path_to_your_image.jpg')
resized_img = img.resize((128, 128), Image.ANTIALIAS)  # 使用抗锯齿插值方法调整图像大小
resized_img.save('resized_image.jpg')

以上代码演示了如何使用Pillow库中的resize()方法调整图像大小。你可以根据需要选择合适的插值方法和图像大小。

- 阅读更多 -

在Python中使用gmssl包实现SM2加密和解密

System

2024-08-25

所有,python

首先，确保安装了gmssl包。如果没有安装，可以使用pip进行安装：




pip install gmssl

以下是使用gmssl进行SM2加密和解密的示例代码：




import gmssl
from gmssl import sm2
from gmssl import func
 
# 生成SM2密钥对
(sk, pk) = sm2.generate_params()
 
# 将私钥和公钥转换为十六进制字符串
sk_hex = sm2.int_to_hex(sk)
pk_hex = sm2.int_to_hex(pk)
 
# 需要加密的数据
data = b"Hello, SM2!"
 
# 对数据进行加密
cipher_text = sm2.encrypt(pk, data)
cipher_text_hex = sm2.bytes_to_hex(cipher_text)
 
# 对数据进行解密
decrypted_text = sm2.decrypt(sk, cipher_text)
 
print(f"Original Data: {data}")
print(f"Cipher Text (hex): {cipher_text_hex}")
print(f"Decrypted Data: {decrypted_text}")

这段代码展示了如何生成SM2密钥对、如何加密数据、如何将加密后的数据转换为十六进制字符串，以及如何对十六进制字符串进行解密。在实际应用中，私钥应当严格保管，而公钥可以公开给需要通信的其他方。

- 阅读更多 -