通过以上的运行结果可以发现, 114启动了gRcp服务, 但没有关闭, 关于这个问题,stack overflow中已经有人给出解决方法Shut down server in TensorFlow , 关于gRcp详情参见[^using-grpc-in-python]:using-grpc-in-python
备注:
ps和worker可以在同一个host中共存, 这个很好理解,就像hadoop中master和slaver是可以共存的一样. 为了避免出现端口冲突, 在同一个主机上ps的端口和worker端口应该不一样
ps 可以有多个, 书写方式可以参照work
再次强调,由于使用的是 with tf.device(tf.train.replica_device_setter(cluster=XXX) 所以, Worker的启动顺序如果和lags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs') 中书写的顺序不同, 将会导致其产生OS Error
将ps也做成worker进程的方式是:
将第20行: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221', 'Comma-separated list of hostname:port pairs')
添加114的ip和端口号, 修改为: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221,172.16.60.114:22222', 'Comma-separated list of hostname:port pairs')
从新运行即可,注意运行顺序
运行结果:
##############114 ps##################################
h strength 1 edge matrix:
2018-09-12 16:38:41.432822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]
0 1
2018-09-12 16:38:41.432830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y
2018-09-12 16:38:41.432835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N
2018-09-12 16:38:41.433475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1)
2018-09-12 16:38:41.949217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1)
2018-09-12 16:38:42.086615: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}
2018-09-12 16:38:42.086674: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221, 2 -> 172.16.60.114:22222}
2018-09-12 16:38:42.094741: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221
###############107 worker 0##########################
#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0
1536741807.352432: Worker 0: traing step 3305 dome (global step:9997)
1536741807.388893: Worker 0: traing step 3306 dome (global step:10000)
Training ends @ 1536741807.388980
Training elapsed time:80.524482 s
After 10000 training step(s), validation cross entropy = 1127
####################111 worker 1###################################
#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1
1536741807.370341: Worker 1: traing step 3222 dome (global step:9998)
1536741807.398533: Worker 1: traing step 3223 dome (global step:10002)
Training ends @ 1536741807.398634
Training elapsed time:79.786702 s
After 10000 training step(s), validation cross entropy = 1127
#################114 worker2 #############
#CUDA_VISIBLE_DEVICES='0,1' python TestDistributed.py --job_name=worker --task_index=2
1536741807.346162: Worker 2: traing step 3474 dome (global step:9996)
1536741807.359073: Worker 2: traing step 3475 dome (global step:10000)
Training ends @ 1536741807.359174
Training elapsed time:79.858818 s
After 10000 training step(s), validation cross entropy = 1127
结果对比
根据日志可以做出初步对比:
使用两个worker平均耗时69.975s; loss=1141.94, 所需要的时间是 三个worker,平均时间:80.806s;loss=1127
参考文献
[^istributed TensorFlow ]:Distributed TensorFlow