import spark.SparkContext
import SparkContext._
object WordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: wordcount <master> <jar>")
System.exit(1)
}
val sp = new SparkContext(args(0), "wordcount", "/opt/spark", List(args(1)))
val file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt");
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3")
}
}
打包成mySpark.jar,上传到Master的/opt/spark/newProgram。
运行程序:
root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar
Mesos自动将jar拷贝到执行节点,然后执行。
内存消耗:(10GB输入file + 10GB的flatMap + 15GB的Map中间结果(word,1))
还有部分内存不知道分配到哪里了。
耗时:50 sec(未经过排序)
Hadoop WordCount耗时:120 sec到140 sec
结果未排序
单个节点:
Hadoop测试 Kmeans运行Mahout里的Kmeans
root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6
在运行(320个map,1个reduce)
Canopy Driver running buildClusters over input: output/data
时某个slave的资源消耗情况
Completed JobsJobid
Name
Map Total
Reduce Total
Time
job_201206050916_0029
Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt
160
0
1分2秒
job_201206050916_0030
KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed
160
1
1分6秒
job_201206050916_0031
KMeans Driver running runIteration over clustersIn: output/clusters-1
160
1
1分7秒
job_201206050916_0032
KMeans Driver running runIteration over clustersIn: output/clusters-2
160
1
1分7秒
job_201206050916_0033
KMeans Driver running runIteration over clustersIn: output/clusters-3
160
1
1分6秒
job_201206050916_0034
KMeans Driver running runIteration over clustersIn: output/clusters-4
160
1
1分6秒
job_201206050916_0035
KMeans Driver running runIteration over clustersIn: output/clusters-5
160
1
1分5秒
job_201206050916_0036
KMeans Driver running clusterData over input: output/data
160
0
55秒
job_201206050916_0037
Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt
320
0
1分31秒
job_201206050916_0038
KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed
320
36
1分46秒
job_201206050916_0039
KMeans Driver running runIteration over clustersIn: output/clusters-1
320
36
1分46秒
job_201206050916_0040
KMeans Driver running runIteration over clustersIn: output/clusters-2
320
36
1分46秒
job_201206050916_0041
KMeans Driver running runIteration over clustersIn: output/clusters-3
320
36
1分47秒
KMeans Driver running clusterData over input: output/data
320
0
1分34秒
运行多次10GB、20GB上的Kmeans,资源消耗
Hadoop WordCount测试