Distributed TensorFlow (2)

日期：2021-05-27 栏目：程序人生浏览：次

通过以上的运行结果可以发现, 114启动了gRcp服务, 但没有关闭, 关于这个问题,stack overflow中已经有人给出解决方法Shut down server in TensorFlow , 关于gRcp详情参见[^using-grpc-in-python]:using-grpc-in-python

备注:

ps和worker可以在同一个host中共存, 这个很好理解,就像hadoop中master和slaver是可以共存的一样. 为了避免出现端口冲突, 在同一个主机上ps的端口和worker端口应该不一样

ps 可以有多个, 书写方式可以参照work

再次强调,由于使用的是 with tf.device(tf.train.replica_device_setter(cluster=XXX) 所以, Worker的启动顺序如果和lags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs') 中书写的顺序不同, 将会导致其产生OS Error

将ps也做成worker进程的方式是:
将第20行: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221', 'Comma-separated list of hostname:port pairs')
添加114的ip和端口号, 修改为: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221,172.16.60.114:22222', 'Comma-separated list of hostname:port pairs')
从新运行即可,注意运行顺序
运行结果:

##############114 ps################################## h strength 1 edge matrix: 2018-09-12 16:38:41.432822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2018-09-12 16:38:41.432830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y 2018-09-12 16:38:41.432835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N 2018-09-12 16:38:41.433475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1) 2018-09-12 16:38:41.949217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1) 2018-09-12 16:38:42.086615: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221} 2018-09-12 16:38:42.086674: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221, 2 -> 172.16.60.114:22222} 2018-09-12 16:38:42.094741: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221 ###############107 worker 0########################## #CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0 1536741807.352432: Worker 0: traing step 3305 dome (global step:9997) 1536741807.388893: Worker 0: traing step 3306 dome (global step:10000) Training ends @ 1536741807.388980 Training elapsed time:80.524482 s After 10000 training step(s), validation cross entropy = 1127 ####################111 worker 1################################### #CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1 1536741807.370341: Worker 1: traing step 3222 dome (global step:9998) 1536741807.398533: Worker 1: traing step 3223 dome (global step:10002) Training ends @ 1536741807.398634 Training elapsed time:79.786702 s After 10000 training step(s), validation cross entropy = 1127 #################114 worker2 ############# #CUDA_VISIBLE_DEVICES='0,1' python TestDistributed.py --job_name=worker --task_index=2 1536741807.346162: Worker 2: traing step 3474 dome (global step:9996) 1536741807.359073: Worker 2: traing step 3475 dome (global step:10000) Training ends @ 1536741807.359174 Training elapsed time:79.858818 s After 10000 training step(s), validation cross entropy = 1127 结果对比

根据日志可以做出初步对比:
使用两个worker平均耗时69.975s; loss=1141.94, 所需要的时间是三个worker,平均时间:80.806s;loss=1127

参考文献

[^istributed TensorFlow ]:Distributed TensorFlow

转载注明出处：https://www.heiqu.com/wpdxpd.html

Distributed TensorFlow (2)

相关推荐