Hadoop vs Spark性能对比(3)

import spark.SparkContext

import SparkContext._

object WordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: wordcount <master> <jar>")

System.exit(1)

}

val sp = new SparkContext(args(0), "wordcount", "/opt/spark", List(args(1)))

val file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt");

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3")

}

}

 

打包成mySpark.jar,上传到Master的/opt/spark/newProgram。

运行程序:

root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar

 

Mesos自动将jar拷贝到执行节点,然后执行。

内存消耗:(10GB输入file + 10GB的flatMap + 15GB的Map中间结果(word,1))

还有部分内存不知道分配到哪里了。

耗时:50 sec(未经过排序)

Hadoop WordCount耗时:120 sec到140 sec

结果未排序

单个节点:

clip_image016

Hadoop测试 Kmeans

运行Mahout里的Kmeans

root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6

 

在运行(320个map,1个reduce)

Canopy Driver running buildClusters over input: output/data

时某个slave的资源消耗情况

clip_image018

clip_image020

Completed Jobs

Jobid

 

Name

 

Map Total

 

Reduce Total

 

Time

 

job_201206050916_0029

 

Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt

 

160

 

0

 

1分2秒

 

job_201206050916_0030

 

KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed

 

160

 

1

 

1分6秒

 

job_201206050916_0031

 

KMeans Driver running runIteration over clustersIn: output/clusters-1

 

160

 

1

 

1分7秒

 

job_201206050916_0032

 

KMeans Driver running runIteration over clustersIn: output/clusters-2

 

160

 

1

 

1分7秒

 

job_201206050916_0033

 

KMeans Driver running runIteration over clustersIn: output/clusters-3

 

160

 

1

 

1分6秒

 

job_201206050916_0034

 

KMeans Driver running runIteration over clustersIn: output/clusters-4

 

160

 

1

 

1分6秒

 

job_201206050916_0035

 

KMeans Driver running runIteration over clustersIn: output/clusters-5

 

160

 

1

 

1分5秒

 

job_201206050916_0036

 

KMeans Driver running clusterData over input: output/data

 

160

 

0

 

55秒

 

job_201206050916_0037

 

Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt

 

320

 

0

 

1分31秒

 

job_201206050916_0038

 

KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed

 

320

 

36

 

1分46秒

 

job_201206050916_0039

 

KMeans Driver running runIteration over clustersIn: output/clusters-1

 

320

 

36

 

1分46秒

 

job_201206050916_0040

 

KMeans Driver running runIteration over clustersIn: output/clusters-2

 

320

 

36

 

1分46秒

 

job_201206050916_0041

 

KMeans Driver running runIteration over clustersIn: output/clusters-3

 

320

 

36

 

1分47秒

 

job_201206050916_0042

 

KMeans Driver running clusterData over input: output/data

 

320

 

0

 

1分34秒

 

运行多次10GB、20GB上的Kmeans,资源消耗

clip_image022

clip_image024

Hadoop WordCount测试

clip_image026

clip_image028

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/3774c64038751b5d9682096ebf946adb.html