爬虫技术之正则提取静态页面数据

这篇文章距离上次修改已过330天，其中的内容可能已经有所变动。




import re
 
# 示例HTML页面内容
html_content = """
<html>
<head><title>Example Page</title></head>
<body>
<p>This is a paragraph with <a href="link1.html">link 1</a> and <a href="link2.html">link 2</a>.</p>
</body>
</html>
"""
 
# 使用正则表达式提取所有链接
links_pattern = re.compile(r'<a\s+href="(.*?)">')
links = links_pattern.findall(html_content)
 
# 打印提取的链接
print(links)  # 输出: ['link1.html', 'link2.html']

这段代码使用Python的re模块来提取HTML中的链接。正则表达式<a\s+href="(.*?)">用于匹配<a>标签中的href属性，并提取该属性的值。findall函数返回所有匹配的列表。

爬虫技术之正则提取静态页面数据

评论已关闭

推荐阅读