tencent cloud

Feedback

Data Preloading

Last updated: 2021-11-18 12:47:34

    To ensure the performance of the application when accessing data, the data in a remote storage system can be pulled to the distributed cache engine that is close to the computing node through data preloading before the application starts. Then the application that consumes the data can enjoy the acceleration effect brought by distributed cache even at the first run.

    We provide DataLoad CRD to help you easily implement and control data preloading with simple configuration.

    This document provides two examples to demonstrate how to use DataLoad CRD:

    • DataLoad Quick Usage
    • DataLoad Advanced Configuration

    Prerequisites

    You have installed Fluid (version 0.6.0 or later).

    Note:

    For Fluid installation details, please see Installation.

    Setting Up an Environment

    $ mkdir <any-path>/warmup
    $ cd <any-path>/warmup
    

    DataLoad Quick Usage

    Configure the Dataset and Runtime objects to be created

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
     name: spark
    spec:
     mounts:
       - mountPoint: https://mirrors.bit.edu.cn/apache/spark/
         name: spark 
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: GooseFSRuntime
    metadata:
     name: spark
    spec:
     replicas: 2
     tieredstore:
       levels:
         - mediumtype: SSD
           path: /mnt/disk1/
           quota: 2G
           high: "0.8"
           low: "0.7"
    
    Note:

    To facilitate testing, mountPoint is set to WebUFS in this example. If you want to mount COS, see Mounting COS (COSN) to GooseFS.

    Here, we'd like to create a resource object whose kind is Dataset. Dataset is a Custom Resource Definition (CRD) defined by Fluid and is used to tell Fluid where to find data desired. Fluid mounts the mountPoint attribute defined in this CRD object to GooseFS.

    In this example, for simplicity, we use COS for demonstration.

    Create the Dataset and Runtime objects

    $ kubectl create -f dataset.yaml
    

    Wait for Dataset and Runtime to be ready

    $ kubectl get datasets spark
    

    If information similar to the following is displayed, Dataset and Runtime are ready:

    NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    spark   1.92GiB          0.00B    4.00GiB          0.0%                Bound   4m4s
    

    Configure the DataLoad object to be created

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
     name: spark-dataload
    spec:
     loadMetadata: true
     dataset:
       name: spark
       namespace: default
    

    spec.dataset specifies the target dataset that needs to be preloaded. In this example, our target is the dataset named spark under the default namespace. You can modify the configuration above if it doesn't match your actual environment.

    By default, according to the above DataLoad configuration, the system will attempt to load all the data in the entire dataset. If you want to control the data preloading behaviors in a more fine-grained way (e.g. preload data under a specified path only), please see DataLoad Advanced Configurations.

    Create the DataLoad object

    $ kubectl create -f dataload.yaml
    

    Check the status of the created DataLoad object

    $ kubectl get dataload spark-dataload
    

    Information similar to the following will be displayed:

    NAME             DATASET   PHASE     AGE
    spark-dataload   spark     Loading   2m13s
    

    You can also use kubectl describe to get more details about the DataLoad:

    $ kubectl describe dataload spark-dataload
    

    Information similar to the following will be displayed:

    Name:         spark-dataload
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  data.fluid.io/v1alpha1
    Kind:         DataLoad
    ...
    Spec:
     Dataset:
       Name:       spark
       Namespace:  default
    Status:
     Conditions:
     Phase:  Loading
    Events:
     Type    Reason              Age   From      Message
     ----    ------              ----  ----      -------
     Normal  DataLoadJobStarted  80s   DataLoad  The DataLoad job spark-dataload-loader-job started
    

    The data preloading process may take several minutes depending on your network environment.

    Wait for the data preloading to complete

    $ kubectl get dataload spark-dataload
    

    If the data preloading is completed, the value of Phase of the DataLoad changes from Loading to Complete.

    NAME             DATASET   PHASE      AGE
    $ spark-dataload   spark     Complete   5m17s
    

    Now, check the cache status of the Dataset object:

    $ kubectl get dataset spark
    

    You'll find that all data in the remote file storage system has already been preloaded into the distributed cache engine:

    NAME    UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    spark   1.92GiB          1.92GiB   4.00GiB          100.0%              Bound   7m41s
    

    Clean up the environment

    $ kubectl delete -f .
    

    DataLoad Advanced Configuration

    Besides the basic data preloading feature demonstrated in the above example, with simple configuration, you can enable some advanced data preloading features, including:

    • Preload data under one or more specified dataset subdirectories only
    • Set cache replicas when preloading data
    • Sync metadata before preloading data

    Preload data under one or more specified dataset subdirectories only

    Preload data under some specified subdirectories (or files) instead of the whole dataset. For example:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
     name: spark-dataload
    spec:
     dataset:
       name: spark
       namespace: default
     loadMetadata: true
     target:
       - path: /spark/spark-2.4.8
       - path: /spark/spark-3.0.1/pyspark-3.0.1.tar.gz
    

    The above DataLoad will only preload all files in /spark/spark-2.4.8 and the /spark/spark-3.0.1/pyspark-3.0.1.tar.gz file.

    In spec.target.path, all values are relative paths under the mount point specified by mountpoint. Assume that the current mount point is cos://test/ and the original path contains the following files:

    cos://test/user/sample.txt
    cos://test/data/fluid.tgz
    

    Then, you can define target.path as follows:

    target:
     - path: /user
     - path: /data
    

    Set cache replicas when preloading data

    When preloading data, you can set cache replicas by simple configuration. For example:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
     name: spark-dataload
    spec:
     dataset:
       name: spark
       namespace: default
     loadMetadata: true
     target:
       - path: /spark/spark-2.4.8
         replicas: 1
       - path: /spark/spark-3.0.1/pyspark-3.0.1.tar.gz
         replicas: 2
    

    The above DataLoad will preload all the files in the /spark/spark-2.4.8 directory with only 1 cache replica in the distributed cache engine, and preload the /spark/spark-3.0.1/pyspark-3.0.1.tar.gz file with 2 cache replicas in the distributed cache engine.

    Sync metadata before preloading data (recommended)

    In many scenarios, files in the remote storage system may have changed, and the distributed cache engine needs to sync the metadata of the files to perceive the changes in the underlying storage system. Therefore, before data preloading, you can configure spec.loadMetadata of the DataLoad object to sync metadata in advance. For example:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
     name: spark-dataload
    spec:
     dataset:
       name: spark
       namespace: default
     loadMetadata: true
     target:
       - path: /
         replicas: 1
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support