Ubuntu 13.10下配置Nutch1.7和Solr4.6集成

日期：2020-07-21 栏目：程序人生浏览：次

1。系统准备
安装Ubuntu13.10，设置源，之后sudo apt-get update和sudo apt-get upgrade

2。相关软件准备
（1）安装ant
sudo apt-get install ant1.7,检查安装情况ant -version出现

Apache Ant version 1.7.1 compiled on September 3 2011

表明安装成功。

（2）jdk安装配置
从官网下载jdk，解压到目录/opt/jdk

环境变量配置：sudo gedit /etc/profile文末添加内容

export Java_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

保存推出，source /etc/profile使配置生效。

检验：java -version和java均有内容（内容省了粘贴）

（3）nutch
下载nutch1.7，解压到/opt/nutch

cd /opt/nutch

bin/nutch
此时会出现用法帮助，表示安装成功了。下面进行相关配置。

step1：修改文件conf/nutch-site.xml，设置HTTP请求中agent的名字：
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="https://www.linuxidc.com/configuration.xsl"?>

<configuration>
<property>
<name>http.agent.name</name>
<value>Friendly Crawler</value>
</property>
</configuration>

step2:创建种子文件夹
mkdir -p urls

step3:将种子URL写到文件urls/seed.txt中：sudo gedit seed.txt

step4:配置 conf/regex-urlfilter.txt
# accept anything else
# +.

# added by yoyo
+36kr.com

step5:修改conf/nutch-site.xml，在里面增加一个parser.skip.truncated属性:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>

这是因为用tcpdump或者wireshark抓包发现，该网站的页面内容采用truncate的方式分段返回，而nutch的默认设置是不处理这种方式的，需要打开之，
参考：

step6:爬取实验

bin/nutch crawl urls -dir crawl

（4）Solr安装
下载solr4.6，解压到/opt/solr

cd /opt/solr/example

java -jar start.jar

如能正常打开网页:8983/solr/则说明成功。

3.Nutch与Solr集成
（1）环境变量设置：
sudo gedit /etc/profile 添加

export NUTCH_RUNTIME_HOME=/opt/nutch

export APACHE_SOLR_HOME=/opt/solr

（2）集成
mkdir ${APACHE_SOLR_HOME}/example/solr/conf
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

重启solr：

java -jar start.jar

建立索引：

bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solr:8983/solr/

出错：

Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.Hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

解决方法是参考

类似的还有其他一些字段需要补充，方法是编辑 ~/solr-4.4.0/example/solr/collection1/conf/schema.xml，在<field>…</fields>中增加以下的字段：
<fields> <field type="string" stored="false" indexed="true"/>
<field type="string" stored="true" indexed="false"/>
<field type="string" stored="true" indexed="false"/>
<field type="float" stored="true" indexed="false"/>
<field type="date" stored="true" indexed="false"/></fields>

（3）验证
rm crawl/ -Rf

bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solr:8983/solr/

…………

转载注明出处：http://www.heiqu.com/e6b3bba6f84de6947fc6de71b3f756c2.html

Ubuntu 13.10下配置Nutch1.7和Solr4.6集成

相关推荐