[万神小栈]SpringBoot结合Jsoup 实现java 爬虫

这篇文章距离上次修改已过183天，其中的内容可能已经有所变动。

为了实现一个Spring Boot结合Jsoup的简单爬虫示例，你可以创建一个Spring Boot应用程序，并使用Jsoup来解析网页和提取数据。以下是一个简单的例子：

首先，你需要在Spring Boot项目中添加Jsoup的依赖。在pom.xml中添加以下内容：




<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

创建一个服务来实现爬虫的逻辑：




import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
 
@Service
public class CrawlerService {
 
    public void crawlAndExtractData(String url) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("div.product-info"); // 选择器根据实际网页结构进行调整
 
            for (Element element : elements) {
                Elements productName = element.select("h3.product-name");
                Elements productPrice = element.select("p.price");
 
                // 提取数据
                String name = productName.text();
                String price = productPrice.text();
 
                // 打印或存储数据
                System.out.println("Product Name: " + name);
                System.out.println("Product Price: " + price);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

创建一个控制器来启动爬虫：




import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
 
@RestController
public class CrawlerController {
 
    @Autowired
    private CrawlerService crawlerService;
 
    @GetMapping("/crawl")
    public void startCrawl() {
        crawlerService.crawlAndExtractData("http://example.com/products"); // 替换为实际的网址
    }
}

最后，创建Spring Boot应用的主类：




import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
 
@SpringBootApplication
public class CrawlerApplication {
 
    public static void main(String[] args) {
        SpringApplication.run(CrawlerApplication.class, args);
    }
}

确保你有合适的网络权限和遵循网站的robots.txt规则，不要进行大规模抓取以免影响网站的正常运营。这只是一个简单的示例，实际爬虫可能需要更复杂的处理，比如处理分页、登录验证、用户代理伪装、异步爬取等。

[万神小栈]SpringBoot结合Jsoup 实现java 爬虫

评论已关闭

推荐阅读