利用Map Reduce 过滤大数据的解决方案(3)

日期：2020-09-01 栏目：程序人生浏览：次

跑了一遍才发行，reduce并没有输出结果。google了一下才知道iterator并不能迭代2次，其中的原因reduce阶段不是把所以的map的输出缓冲到内存中的，其实想想就应该知道。如果都缓存到内存中，数据大很容易内存溢出。

public boolean nextKeyValue() throws IOException, InterruptedException
{
if (!hasMore)
{
key = null;
value = null;
return false;
}
firstValue = !nextKeyIsSame;
DataInputBuffer next = input.getKey();
currentRawKey.set(next.getData(), next.getPosition(), next.getLength()
- next.getPosition());
buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
key = keyDeserializer.deserialize(key);
next = input.getValue();
buffer.reset(next.getData(), next.getPosition(), next.getLength());
value = valueDeserializer.deserialize(value);
hasMore = input.next();
if (hasMore)
{
next = input.getKey();
nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
currentRawKey.getLength(), next.getData(), next
.getPosition(), next.getLength()
- next.getPosition()) == 0;
}
else
{
nextKeyIsSame = false;
}
inputValueCounter.increment(1L);
return true;
}

在想想，想减少reduce阶段的输入，在map阶段减少输出。于是就有了在map阶段把uid分成奇数，偶数分别作为reduce的输出，去跑job。所以要尽量减少reduce的输入，可以通过拆分map输出的方法。

转载注明出处：http://www.heiqu.com/fb2cc95b34ac3e360cfa26bf6ec22eaf.html

利用Map Reduce 过滤大数据的解决方案(3)

相关推荐