Java开源爬虫框架crawler4j (2)

日期：2021-11-27 栏目：程序人生浏览：次

shouldVisit: 这个方法主要是决定哪些url我们需要抓取，返回true表示是我们需要的，返回false表示不是我们需要的Url第一个参数referringPage封装了当前爬取的页面信息第二个参数url封装了当前爬取的页面url信息

visit: 该功能在URL内容下载成功后调用。
您可以轻松获取下载页面的网址，文本，链接，html和唯一ID。

您还应该实现一个控制器类，指定抓取的种子，抓取数据应该存储在哪个文件夹以及并发线程的数量：

public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = "E:/crawler";// 定义爬虫数据存储位置 int numberOfCrawlers = 7;// 定义了7个爬虫，也就是7个线程 CrawlConfig config = new CrawlConfig();// 定义爬虫配置 config.setCrawlStorageFolder(crawlStorageFolder);// 设置爬虫文件存储位置 /* * 实例化爬虫控制器。 */ PageFetcher pageFetcher = new PageFetcher(config);// 实例化页面获取器 RobotstxtConfig robotstxtConfig = new RobotstxtConfig();// 实例化爬虫机器人配置 // 实例化爬虫机器人对目标服务器的配置，每个网站都有一个robots.txt文件 // 规定了该网站哪些页面可以爬，哪些页面禁止爬，该类是对robots.txt规范的实现 RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); // 实例化爬虫控制器 CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); /* * 对于每次抓取，您需要添加一些种子网址。这些是抓取的第一个URL，然后抓取工具开始跟随这些页面中的链接 */ controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/"); /** * 启动爬虫，爬虫从此刻开始执行爬虫任务，根据以上配置 */ controller.start(MyCrawler.class, numberOfCrawlers); } } 结果(eclipse+Marven测试)：

Java开源爬虫框架crawler4j

使用工厂

使用工厂可以方便地将crawler4j集成到IoC环境中（如Spring，Guice）或者将信息或协作者传递给每个“WebCrawler”实例。

public class CsiCrawlerCrawlerControllerFactory implements CrawlController.WebCrawlerFactory { Map<String, String> metadata; SqlRepository repository; public CsiCrawlerCrawlerControllerFactory(Map<String, String> metadata, SqlRepository repository) { this.metadata = metadata; this.repository = repository; } @Override public WebCrawler newInstance() { return new MyCrawler(metadata, repository); } }

要使用工厂只需要调用CrawlController中的正确方法（如果你在Spring或Guice中，可能会想使用startNonBlocking）：

MyCrawlerFactory factory = new MyCrawlerFactory(metadata, repository); controller.startNonBlocking(factory, numberOfCrawlers); 更多例子

Basic crawler: 上面的例子的完整的源代码更多的细节。

Image crawler: 一个简单的图像爬虫，从爬取网站下载图像内容，并将其存储在一个文件夹中。本示例演示如何使用crawler4j获取二进制内容。

Collecting data from threads:此示例演示了控制器如何从爬取线程收集数据/统计信息。

转载注明出处：https://www.heiqu.com/zwggpg.html

Java开源爬虫框架crawler4j (2)

相关推荐