java代码实现爬虫功能
Java实现爬虫功能通常使用HttpClient
来发送HTTP请求,以及Jsoup
来解析HTML页面。以下是一个简单的Java爬虫示例,用于抓取一个网页的标题:
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class SimpleCrawler {
public static void main(String[] args) {
HttpClient client = HttpClients.createDefault();
HttpGet request = new HttpGet("http://example.com"); // 替换为你想爬取的网址
try {
HttpResponse response = client.execute(request);
String html = EntityUtils.toString(response.getEntity());
Document doc = Jsoup.parse(html);
String title = doc.title();
System.out.println("Title of the page: " + title);
} catch (Exception e) {
e.printStackTrace();
}
}
}
确保在执行此代码前,你的项目中已经加入了HttpClient
和Jsoup
的依赖。
Maven依赖如下:
<dependencies>
<!-- Jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
</dependencies>
这个例子只是一个简单的开始。实际的爬虫可能需要处理更复杂的情况,比如多线程下载、处理JavaScript动态内容、处理cookie和session、处理网页重定向、爬取动态加载的内容等。
评论已关闭