TensorFlowOnSpark Overview

Last updated: 2020-07-15 10:40:19

    TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters with support for TensorFlow programs in all types, async/sync training and inferencing, model parallelism, and parallel data processing. For more information, please visit TensorFlowOnSpark official website.

    TensorFlowOnSpark Architecture Diagram


    TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As TensorFlowOnSpark doesn't involve Spark drivers in tensor communication, it can achieve similar scalability as standalone TensorFlow clusters.

    Installing TensorFlowOnSpark

    1. Enter the EMR purchase page and select EMR v2.3.0 or above.
    2. Select the tensorflowonspark 1.4.4 component in the Optional Component list.
    3. TensorFlowOnSpark will be installed in the /usr/local/service/tensorflowonspark directory by default.

      Note:

      The components depended on by TensorFlowOnSpark include Hive, Spark, etc., which will be installed together with TensorFlowOnSpark.

    Use Cases

    There is complete sample code in the directory of the installed TensorFlowOnSpark component. You can use TensorFlowOnSpark in the following steps:

    • Download testing data
      Run the following command in the /usr/local/service/tensorflowonspark directory as the hadoop user:
      sh mnist_download.sh
      cat mnist_download.sh
      mkdir ${HOME}/mnist
      pushd ${HOME}/mnist >/dev/null
      curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
      curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
      curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
      curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
      zip -r mnist.zip *
      popd >/dev/null
    • Upload the original data and dependency packages
      hdfs dfs -mkdir -p /mnist/tools/
      hdfs dfs -put ~/mnist/mnist.zip /mnist/tools
      hdfs dfs -mkdir /tensorflow
      hdfs dfs -put TensorFlowOnSpark/tensorflow-hadoop-1.10.0.jar /tensorflow
    • Prepare the feature data
      sh prepare_mnist.sh
      You can see that the feature data has been prepared:
      hdfs dfs -ls /user/hadoop/mnist
      Found 2 items
      drwxr-xr-x - hadoop supergroup 0 2020-05-21 11:40 /user/hadoop/mnist/csv
      drwxr-xr-x - hadoop supergroup 0 2020-05-21 11:41 /user/hadoop/mnist/tfr
    • Train the model based on InputMode.SPARK
      sh mnist_train_with_spark_cpu.sh
      View the trained model:
      ```
      [hadoop@10 tensorflow-on-spark]$ hdfs dfs -ls /user/hadoop/mnist_model
      Found 10 items
    • rw-r--r-- 1 hadoop supergroup 128 2020-05-21 11:46 /user/hadoop/mnist_model/checkpoint
    • rw-r--r-- 1 hadoop supergroup 243332 2020-05-21 11:46 /user/hadoop/mnist_model/events.out.tfevents.1590032704.10.0.0.114
    • rw-r--r-- 1 hadoop supergroup 164619 2020-05-21 11:45 /user/hadoop/mnist_model/graph.pbtxt
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 11:45 /user/hadoop/mnist_model/model.ckpt-0.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 11:45 /user/hadoop/mnist_model/model.ckpt-0.index
    • rw-r--r-- 1 hadoop supergroup 64658 2020-05-21 11:45 /user/hadoop/mnist_model/model.ckpt-0.meta
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 11:46 /user/hadoop/mnist_model/model.ckpt-595.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 11:46 /user/hadoop/mnist_model/model.ckpt-595.index
    • rw-r--r-- 1 hadoop supergroup 64658 2020-05-21 11:46 /user/hadoop/mnist_model/model.ckpt-595.meta
      drwxr-xr-x - hadoop supergroup 0 2020-05-21 11:46 /user/hadoop/mnist_model/train
      ```
    • Predict the model based on InputMode.SPARK
      sh mnist_inference_with_spark_cpu.sh
      View the prediction result:
      hdfs dfs -cat /user/hadoop/predictions/part-00000 |more 
      2020-05-21T11:49:56.561506 Label: 7, Prediction: 7
      2020-05-21T11:49:56.561535 Label: 2, Prediction: 2
      2020-05-21T11:49:56.561541 Label: 1, Prediction: 1
      2020-05-21T11:49:56.561545 Label: 0, Prediction: 0
      2020-05-21T11:49:56.561550 Label: 4, Prediction: 4
      2020-05-21T11:49:56.561555 Label: 1, Prediction: 1
      2020-05-21T11:49:56.561559 Label: 4, Prediction: 4
      2020-05-21T11:49:56.561564 Label: 9, Prediction: 9
      2020-05-21T11:49:56.561568 Label: 5, Prediction: 6
      2020-05-21T11:49:56.561573 Label: 9, Prediction: 9
      2020-05-21T11:49:56.561578 Label: 0, Prediction: 0
      2020-05-21T11:49:56.561582 Label: 6, Prediction: 6
      2020-05-21T11:49:56.561587 Label: 9, Prediction: 9
      2020-05-21T11:49:56.561603 Label: 0, Prediction: 0
      2020-05-21T11:49:56.561608 Label: 1, Prediction: 1
      2020-05-21T11:49:56.561612 Label: 5, Prediction: 5    
    • Train the model based on InputMode.TENSORFLOW
      sh mnist_train_with_tf_cpu.sh
      View the model:
      ```
      hdfs dfs -ls mnist_model
      Found 25 items
    • rw-r--r-- 1 hadoop supergroup 265 2020-05-21 14:58 mnist_model/checkpoint
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:53 mnist_model/events.out.tfevents.1590044017.10.0.0.144
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:57 mnist_model/events.out.tfevents.1590044221.10.0.0.144
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:57 mnist_model/events.out.tfevents.1590044227.10.0.0.144
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:57 mnist_model/events.out.tfevents.1590044232.10.0.0.144
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:57 mnist_model/events.out.tfevents.1590044238.10.0.0.144
    • rw-r--r-- 1 hadoop supergroup 40 2020-05-21 14:58 mnist_model/events.out.tfevents.1590044303.10.0.0.114
    • rw-r--r-- 1 hadoop supergroup 198078 2020-05-21 14:58 mnist_model/graph.pbtxt
      drwxr-xr-x - hadoop supergroup 0 2020-05-21 14:58 mnist_model/inference
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 14:57 mnist_model/model.ckpt-238.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 14:57 mnist_model/model.ckpt-238.index
    • rw-r--r-- 1 hadoop supergroup 76255 2020-05-21 14:57 mnist_model/model.ckpt-238.meta
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 14:57 mnist_model/model.ckpt-277.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 14:57 mnist_model/model.ckpt-277.index
    • rw-r--r-- 1 hadoop supergroup 76255 2020-05-21 14:57 mnist_model/model.ckpt-277.meta
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 14:57 mnist_model/model.ckpt-315.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 14:57 mnist_model/model.ckpt-315.index
    • rw-r--r-- 1 hadoop supergroup 76255 2020-05-21 14:57 mnist_model/model.ckpt-315.meta
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 14:57 mnist_model/model.ckpt-354.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 14:57 mnist_model/model.ckpt-354.index
    • rw-r--r-- 1 hadoop supergroup 76255 2020-05-21 14:57 mnist_model/model.ckpt-354.meta
    • rw-r--r-- 1 hadoop supergroup 814168 2020-05-21 14:58 mnist_model/model.ckpt-393.data-00000-of-00001
    • rw-r--r-- 1 hadoop supergroup 375 2020-05-21 14:58 mnist_model/model.ckpt-393.index
    • rw-r--r-- 1 hadoop supergroup 76255 2020-05-21 14:58 mnist_model/model.ckpt-393.meta
      drwxr-xr-x - hadoop supergroup 0 2020-05-21 14:53 mnist_model/train
      ```
    • Predict the model based on InputMode.TENSORFLOW
      sh mnist_train_with_tf_cpu.sh
      View the prediction result:
      hdfs dfs -cat predictions/part-00000 |more
      9 4
      9 9
      4 4
      1 1
      4 4
      8 8
      9 9
      2 2
      3 5
      6 6
      9 9
      2 2
      6 6
      0 0
      7 7
      5 5
      3 3

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback
    Help