摘要:?TensorFlow作為現(xiàn)在最為流行的深度學(xué)習(xí)代碼庫,在數(shù)據(jù)科學(xué)家中間非常流行,特別是可以明顯加速訓(xùn)練效率的分布式訓(xùn)練更是殺手級的特性。但是如何真正部署和運行大規(guī)模的分布式模型訓(xùn)練,卻成了新的挑戰(zhàn)。
介紹
本系列將介紹如何在阿里云容器服務(wù)上運行Kubeflow, 本文介紹如何使用TfJob運行分布式模型訓(xùn)練。
TensorFlow分布式訓(xùn)練和Kubernetes
TensorFlow作為現(xiàn)在最為流行的深度學(xué)習(xí)代碼庫,在數(shù)據(jù)科學(xué)家中間非常流行,特別是可以明顯加速訓(xùn)練效率的分布式訓(xùn)練更是殺手級的特性。但是如何真正部署和運行大規(guī)模的分布式模型訓(xùn)練,卻成了新的挑戰(zhàn)。 實際分布式TensorFLow的使用者需要關(guān)心3件事情。
尋找足夠運行訓(xùn)練的資源,通常一個分布式訓(xùn)練需要若干數(shù)量的worker(運算服務(wù)器)和ps(參數(shù)服務(wù)器),而這些運算成員都需要使用計算資源。
安裝和配置支撐程序運算的軟件和應(yīng)用
根據(jù)分布式TensorFlow的設(shè)計,需要配置ClusterSpec。這個json格式的ClusterSpec是用來描述整個分布式訓(xùn)練集群的架構(gòu),比如需要使用兩個worker和ps,ClusterSpec應(yīng)該長成下面的樣子,并且分布式訓(xùn)練中每個成員都需要利用這個ClusterSpec初始化tf.train.ClusterSpec對象,建立集群內(nèi)部通信
cluster?=?tf.train.ClusterSpec({"worker":?[":2222",???????????????????????????????????????????":2222"],????????????????????????????????"ps":?[":2223",???????????????????????????????????????":2223"]}) 其中第一件事情是Kubernetes資源調(diào)度非常擅長的事情,無論CPU和GPU調(diào)度,都是直接可以使用;而第二件事情是Docker擅長的,固化和可重復(fù)的操作保存到容器鏡像。而自動化的構(gòu)建ClusterSpec是TFJob解決的問題,讓用戶通過簡單的集中式配置,完成TensorFlow分布式集群拓?fù)涞臉?gòu)建。
應(yīng)該說煩惱了數(shù)據(jù)科學(xué)家很久的分布式訓(xùn)練問題,通過Kubernetes+TFJob的方案可以得到比較好的解決。
利用Kubernetes和TFJob部署分布式訓(xùn)練
修改TensorFlow分布式訓(xùn)練代碼
之前在阿里云上小試TFJob一文中已經(jīng)介紹了TFJob的定義,這里就不再贅述了。可以知道TFJob里有的角色類型為MASTER,?WORKER?和?PS。
舉個現(xiàn)實的例子,假設(shè)從事分布式訓(xùn)練的TFJob叫做distributed-mnist, 其中節(jié)點有1個MASTER, 2個WORKERS和2個PS,ClusterSpec對應(yīng)的格式如下所示:
{??
????"master":[??
????????"distributed-mnist-master-0:2222"
????],????"ps":[??
????????"distributed-mnist-ps-0:2222",????????"distributed-mnist-ps-1:2222"
????],????"worker":[??
????????"distributed-mnist-worker-0:2222",????????"distributed-mnist-worker-1:2222"
????]
}而tf_operator的工作就是創(chuàng)建對應(yīng)的5個Pod, 并且將環(huán)境變量TF_CONFIG傳入到每個Pod中,TF_CONFIG包含三部分的內(nèi)容,當(dāng)前集群ClusterSpec, 該節(jié)點的角色類型,以及id。比如該Pod為worker0,它所收到的環(huán)境變量TF_CONFIG為:
{??
???"cluster":{??
??????"master":[??
?????????"distributed-mnist-master-0:2222"
??????],??????"ps":[??
?????????"distributed-mnist-ps-0:2222"
??????],??????"worker":[??
?????????"distributed-mnist-worker-0:2222",?????????"distributed-mnist-worker-1:2222"
??????]
???},???"task":{??
??????"type":"worker",??????"index":0
???},???"environment":"cloud"}在這里,tf_operator負(fù)責(zé)將集群拓?fù)涞陌l(fā)現(xiàn)和配置工作完成,免除了使用者的麻煩。對于使用者來說,他只需要在這里代碼中使用通過獲取環(huán)境變量TF_CONFIG中的上下文。
這意味著,用戶需要根據(jù)和TFJob的規(guī)約修改分布式訓(xùn)練代碼:
#?從環(huán)境變量TF_CONFIG中讀取json格式的數(shù)據(jù)tf_config_json?=?os.environ.get("TF_CONFIG",?"{}")#?反序列化成python對象tf_config?=?json.loads(tf_config_json)#?獲取Cluster?Speccluster_spec?=?tf_config.get("cluster",?{})
cluster_spec_object?=?tf.train.ClusterSpec(cluster_spec)#?獲取角色類型和id,?比如這里的job_name?是?"worker"?and?task_id?是?0task?=?tf_config.get("task",?{})
job_name?=?task["type"]
task_id?=?task["index"]#?創(chuàng)建TensorFlow?Training?Server對象server_def?=?tf.train.ServerDef(
????cluster=cluster_spec_object.as_cluster_def(),
????protocol="grpc",
????job_name=job_name,
????task_index=task_id)
server?=?tf.train.Server(server_def)#?如果job_name為ps,則調(diào)用server.join()if?job_name?==?'ps':
????server.join()#?檢查當(dāng)前進(jìn)程是否是master,?如果是master,就需要負(fù)責(zé)創(chuàng)建session和保存summary。is_chief?=?(job_name?==?'master')#?通常分布式訓(xùn)練的例子只有ps和worker兩個角色,而在TFJob里增加了master這個角色,實際在分布式TensorFlow的編程模型并沒有這個設(shè)計。而這需要使用TFJob的分布式代碼里進(jìn)行處理,不過這個處理并不復(fù)雜,只需要將master也看做worker_device的類型with?tf.device(tf.train.replica_device_setter(
????worker_device="/job:{0}/task:{1}".format(job_name,task_id),
????cluster=cluster_spec)):具體代碼可以參考示例代碼
2. 在本例子中,將演示如何使用TFJob運行分布式訓(xùn)練,并且將訓(xùn)練結(jié)果和日志保存到NAS存儲上,最后通過Tensorboard讀取訓(xùn)練日志。
2.1 創(chuàng)建NAS數(shù)據(jù)卷,并且設(shè)置與當(dāng)前Kubernetes集群的同一個具體vpc的掛載點。操作詳見文檔
2.2 在NAS上創(chuàng)建?/training的數(shù)據(jù)文件夾, 下載mnist訓(xùn)練所需要的數(shù)據(jù)
mkdir?-p?/nfs mount?-t?nfs?-o?vers=4.0?xxxxxxx.cn-hangzhou.nas.aliyuncs.com:/?/nfs mkdir?-p?/nfs/training umount?/nfs
2.3 創(chuàng)建NAS的PV, 以下為示例nas-dist-pv.yaml
apiVersion:?v1 kind:?PersistentVolume metadata: ??name:?kubeflow-dist-nas-mnist ??labels: ????tfjob:?kubeflow-dist-nas-mnist spec: ??capacity: ????storage:?10Gi ??accessModes: ????-?ReadWriteMany ??storageClassName:?nas ??flexVolume: ????driver:?"alicloud/nas" ????options: ??????mode:?"755" ??????path:?/training ??????server:?xxxxxxx.cn-hangzhou.nas.aliyuncs.com ??????vers:?"4.0"
將該模板保存到nas-dist-pv.yaml, 并且創(chuàng)建pv:
#?kubectl?create?-f?nas-dist-pv.yamlpersistentvolume?"kubeflow-dist-nas-mnist"?created
2.4 利用nas-dist-pvc.yaml創(chuàng)建PVC
kind:?PersistentVolumeClaim apiVersion:?v1metadata: ??name:?kubeflow-dist-nas-mnistspec: ??storageClassName:?nas ??accessModes: ????-?ReadWriteMany ??resources: ????requests: ??????storage:?5Gi ??selector: ????matchLabels: ??????tfjob:?kubeflow-dist-nas-mnist
具體命令:
#?kubectl?create?-f?nas-dist-pvc.yamlpersistentvolumeclaim?"kubeflow-dist-nas-mnist"?created
2.5 創(chuàng)建TFJob
apiVersion:?kubeflow.org/v1alpha1 kind:?TFJob metadata: ??name:?mnist-simple-gpu-dist spec: ??replicaSpecs: ????-?replicas:?1?#?1?Master ??????tfReplicaType:?MASTER ??????template: ????????spec: ??????????containers: ????????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu ??????????????name:?tensorflow ??????????????env: ??????????????-?name:?TEST_TMPDIR ????????????????value:?/training??????????????command:?["python",?"/app/main.py"] ??????????????resources: ????????????????limits: ??????????????????nvidia.com/gpu:?1 ??????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure ????-?replicas:?1?#?1?or?2?Workers?depends?on?how?many?gpus?you?have ??????tfReplicaType:?WORKER ??????template: ????????spec: ??????????containers: ??????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu???????????????????????? ????????????name:?tensorflow ????????????env: ????????????-?name:?TEST_TMPDIR ??????????????value:?/training????????????command:?["python",?"/app/main.py"] ????????????imagePullPolicy:?Always ????????????resources: ??????????????limits: ????????????????nvidia.com/gpu:?1 ????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure ????-?replicas:?1??#?1?Parameter?server ??????tfReplicaType:?PS ??????template: ????????spec: ??????????containers: ??????????-?image:?registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:cpu?????????????????????? ????????????name:?tensorflow????????????command:?["python",?"/app/main.py"] ????????????env: ????????????-?name:?TEST_TMPDIR ??????????????value:?/training ????????????imagePullPolicy:?Always ????????????volumeMounts: ??????????????-?name:?kubeflow-dist-nas-mnist ????????????????mountPath:?"/training" ??????????volumes: ????????????-?name:?kubeflow-dist-nas-mnist ??????????????persistentVolumeClaim: ????????????????claimName:?kubeflow-dist-nas-mnist ??????????restartPolicy:?OnFailure
將該模板保存到mnist-simple-gpu-dist.yaml, 并且創(chuàng)建分布式訓(xùn)練的TFJob:
#?kubectl?create?-f?mnist-simple-gpu-dist.yamltfjob?"mnist-simple-gpu-dist"?created
檢查所有運行的Pod
#?RUNTIMEID=$(kubectl?get?tfjob?mnist-simple-gpu-dist?-o=jsonpath='{.spec.RuntimeId}')#?kubectl?get?po?-lruntime_id=$RUNTIMEIDNAME????????????????????????????????????????READY?????STATUS????RESTARTS???AGE
mnist-simple-gpu-dist-master-z5z4-0-ipy0s???1/1???????Running???0??????????31smnist-simple-gpu-dist-ps-z5z4-0-3nzpa???????1/1???????Running???0??????????31smnist-simple-gpu-dist-worker-z5z4-0-zm0zm???1/1???????Running???0??????????31s查看master的日志,可以看到ClusterSpec已經(jīng)成功的構(gòu)建出來了
#?kubectl?logs?-l?runtime_id=$RUNTIMEID,job_type=MASTER2018-06-10?09:31:55.342689:?I?tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]?Found?device?0?with?properties:
name:?Tesla?P100-PCIE-16GB?major:?6?minor:?0?memoryClockRate(GHz):?1.3285pciBusID:?0000:00:08.0totalMemory:?15.89GiB?freeMemory:?15.60GiB2018-06-10?09:31:55.342724:?I?tensorflow/core/common_runtime/gpu/gpu_device.cc:1195]?Creating?TensorFlow?device?(/device:GPU:0)?->?(device:?0,?name:?Tesla?P100-PCIE-16GB,?pci?bus?id:?0000:00:08.0,?compute?capability:?6.0)
2018-06-10?09:31:55.805747:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?master?->?{0?->?localhost:2222}2018-06-10?09:31:55.805786:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?ps?->?{0?->?mnist-simple-gpu-dist-ps-m5yi-0:2222}2018-06-10?09:31:55.805794:?I?tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215]?Initialize?GrpcChannelCache?for?job?worker?->?{0?->?mnist-simple-gpu-dist-worker-m5yi-0:2222}2018-06-10?09:31:55.807119:?I?tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324]?Started?server?with?target:?grpc://localhost:2222...
Accuracy?at?step?900:?0.9709Accuracy?at?step?910:?0.971Accuracy?at?step?920:?0.9735Accuracy?at?step?930:?0.9716Accuracy?at?step?940:?0.972Accuracy?at?step?950:?0.9697Accuracy?at?step?960:?0.9718Accuracy?at?step?970:?0.9738Accuracy?at?step?980:?0.9725Accuracy?at?step?990:?0.9724Adding?run?metadata?for?9992.6 部署TensorBoard,并且查看訓(xùn)練效果
為了更方便?TensorFlow?程序的理解、調(diào)試與優(yōu)化,可以用?TensorBoard?來觀察?TensorFlow?訓(xùn)練效果,理解訓(xùn)練框架和優(yōu)化算法, 而TensorBoard通過讀取TensorFlow的事件日志獲取運行時的信息。
在之前的分布式訓(xùn)練樣例中已經(jīng)記錄了事件日志,并且保存到文件events.out.tfevents*中
#?tree. └──?tensorflow ????├──?input_data ????│???├──?t10k-images-idx3-ubyte.gz ????│???├──?t10k-labels-idx1-ubyte.gz ????│???├──?train-images-idx3-ubyte.gz ????│???└──?train-labels-idx1-ubyte.gz ????└──?logs ????????├──?checkpoint ????????├──?events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????├──?graph.pbtxt ????????├──?model.ckpt-0.data-00000-of-00001 ????????├──?model.ckpt-0.index ????????├──?model.ckpt-0.meta ????????├──?test ????????│???├──?events.out.tfevents.1528760351.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????│???└──?events.out.tfevents.1528760356.mnist-simple-gpu-dist-worker-fziz-0-9mvsd ????????└──?train ????????????├──?events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9 ????????????└──?events.out.tfevents.1528760355.mnist-simple-gpu-dist-worker-fziz-0-9mvsd5?directories,?14?files
在Kubernetes部署TensorBoard, 并且指定之前訓(xùn)練的NAS存儲
apiVersion:?extensions/v1beta1 kind:?Deployment metadata: ??labels: ????app:?tensorboard ??name:?tensorboard spec: ??replicas:?1 ??selector: ????matchLabels: ??????app:?tensorboard ??template: ????metadata: ??????labels: ????????app:?tensorboard ????spec: ??????volumes: ??????-?name:?kubeflow-dist-nas-mnist ????????persistentVolumeClaim: ????????????claimName:?kubeflow-dist-nas-mnist ??????containers: ??????-?name:?tensorboard ????????image:?tensorflow/tensorflow:1.7.0 ????????imagePullPolicy:?Always????????command: ?????????-?/usr/local/bin/tensorboard ????????args: ????????-?--logdir ????????-?/training/tensorflow/logs ????????volumeMounts: ????????-?name:?kubeflow-dist-nas-mnist ??????????mountPath:?"/training" ????????ports: ????????-?containerPort:?6006 ??????????protocol:?TCP ??????dnsPolicy:?ClusterFirst ??????restartPolicy:?Always
將該模板保存到tensorboard.yaml, 并且創(chuàng)建tensorboard:
#?kubectl?create?-f?tensorboard.yamldeployment?"tensorboard"?created
TensorBoard創(chuàng)建成功后,通過kubectl port-forward命令進(jìn)行訪問
PODNAME=$(kubectl?get?pod?-l?app=tensorboard?-o?jsonpath='{.items[0].metadata.name}')
kubectl?port-forward?${PODNAME}?6006:6006通過http://127.0.0.1:6006登錄TensorBoard,查看分布式訓(xùn)練的模型和效果:
總結(jié)
利用tf-operator可以解決分布式訓(xùn)練的問題,簡化數(shù)據(jù)科學(xué)家進(jìn)行分布式訓(xùn)練工作。同時使用Tensorboard查看訓(xùn)練效果, 再利用NAS或者OSS來存放數(shù)據(jù)和模型,這樣一方面有效的重用訓(xùn)練數(shù)據(jù)和保存實驗結(jié)果,另外一方面也是為模型預(yù)測的發(fā)布做準(zhǔn)備。如何把模型訓(xùn)練,驗證,預(yù)測串聯(lián)起來構(gòu)成機(jī)器學(xué)習(xí)的工作流(workflow), 也是Kubeflow的核心價值,我們在后面的文章中也會進(jìn)行介紹。
本文為云棲社區(qū)原創(chuàng)內(nèi)容,未經(jīng)允許不得轉(zhuǎn)載。
電子發(fā)燒友App





































































評論