Tencent Cloud

Recent Pages

Data Preloading

Terakhir diperbarui:2024-03-25 16:04:01

To ensure the performance of the application when accessing data, the data in a remote storage system can be pulled to the distributed cache engine that is close to the computing node through data preloading before the application starts. Then the application that consumes the data can enjoy the acceleration effect brought by distributed cache even at the first run.
We provide DataLoad CRD to help you easily implement and control data preloading with simple configuration.
This document provides two examples to demonstrate how to use DataLoad CRD:
DataLoad Quick Usage
DataLoad Advanced Configuration
Prerequisites
You have installed Fluid (version 0.6.0 or later).
Note: 
 For Fluid installation details, please see Installation.
Setting Up an Environment
$ mkdir <any-path>/warmup
$ cd <any-path>/warmup
DataLoad Quick Usage
Configure the Dataset and Runtime objects to be created
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: spark
spec:
  mounts:
    - mountPoint: https://mirrors.bit.edu.cn/apache/spark/
      name: spark 
---
apiVersion: data.fluid.io/v1alpha1
kind: GooseFSRuntime
metadata:
  name: spark
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /mnt/disk1/
        quota: 2G
        high: "0.8"
        low: "0.7"
Note: 
 To facilitate testing, mountPoint is set to WebUFS in this example. If you want to mount COS, see Mounting COS (COSN) to GooseFS.
Here, we'd like to create a resource object whose kind is Dataset. Dataset is a Custom Resource Definition (CRD) defined by Fluid and is used to tell Fluid where to find data desired. Fluid mounts the mountPoint attribute defined in this CRD object to GooseFS.
In this example, for simplicity, we use COS for demonstration.
Create the Dataset and Runtime objects
$ kubectl create -f dataset.yaml
Wait for Dataset and Runtime to be ready
$ kubectl get datasets spark
If information similar to the following is displayed, Dataset and Runtime are ready:
NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
spark   1.92GiB          0.00B    4.00GiB          0.0%                Bound   4m4s
Configure the DataLoad object to be created
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark-dataload
spec:
  loadMetadata: true
  dataset:
    name: spark
    namespace: default
spec.dataset specifies the target dataset that needs to be preloaded. In this example, our target is the dataset named spark under the default namespace. You can modify the configuration above if it doesn't match your actual environment.
By default, according to the above DataLoad configuration, the system will attempt to load all the data in the entire dataset. If you want to control the data preloading behaviors in a more fine-grained way (e.g. preload data under a specified path only), please see DataLoad Advanced Configurations.
Create the DataLoad object
$ kubectl create -f dataload.yaml
Check the status of the created DataLoad object
$ kubectl get dataload spark-dataload
Information similar to the following will be displayed:
NAME             DATASET   PHASE     AGE
spark-dataload   spark     Loading   2m13s
You can also use kubectl describe to get more details about the DataLoad:
$ kubectl describe dataload spark-dataload
Information similar to the following will be displayed:
Name:         spark-dataload
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  data.fluid.io/v1alpha1
Kind:         DataLoad
...
Spec:
  Dataset:
    Name:       spark
    Namespace:  default
Status:
  Conditions:
  Phase:  Loading
Events:
  Type    Reason              Age   From      Message
  ----    ------              ----  ----      -------
  Normal  DataLoadJobStarted  80s   DataLoad  The DataLoad job spark-dataload-loader-job started
The data preloading process may take several minutes depending on your network environment.
Wait for the data preloading to complete
$ kubectl get dataload spark-dataload
If the data preloading is completed, the value of Phase of the DataLoad changes from Loading to Complete.
NAME             DATASET   PHASE      AGE
$ spark-dataload   spark     Complete   5m17s
Now, check the cache status of the Dataset object:
$ kubectl get dataset spark
You'll find that all data in the remote file storage system has already been preloaded into the distributed cache engine:
NAME    UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
spark   1.92GiB          1.92GiB   4.00GiB          100.0%              Bound   7m41s
Clean up the environment
$ kubectl delete -f .
DataLoad Advanced Configuration
Besides the basic data preloading feature demonstrated in the above example, with simple configuration, you can enable some advanced data preloading features, including:
Preload data under one or more specified dataset subdirectories only
Set cache replicas when preloading data
Sync metadata before preloading data
Preload data under one or more specified dataset subdirectories only
Preload data under some specified subdirectories (or files) instead of the whole dataset. For example:
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark-dataload
spec:
  dataset:
    name: spark
    namespace: default
  loadMetadata: true
  target:
    - path: /spark/spark-2.4.8
    - path: /spark/spark-3.0.1/pyspark-3.0.1.tar.gz
The above DataLoad will only preload all files in /spark/spark-2.4.8 and the /spark/spark-3.0.1/pyspark-3.0.1.tar.gz file.
In spec.target.path, all values are relative paths under the mount point specified by mountpoint. Assume that the current mount point is cos://test/ and the original path contains the following files:
cos://test/user/sample.txt
cos://test/data/fluid.tgz
Then, you can define target.path as follows:
target:
  - path: /user
  - path: /data
Set cache replicas when preloading data
When preloading data, you can set cache replicas by simple configuration. For example:
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark-dataload
spec:
  dataset:
    name: spark
    namespace: default
  loadMetadata: true
  target:
    - path: /spark/spark-2.4.8
      replicas: 1
    - path: /spark/spark-3.0.1/pyspark-3.0.1.tar.gz
      replicas: 2
The above DataLoad will preload all the files in the /spark/spark-2.4.8 directory with only 1 cache replica in the distributed cache engine, and preload the /spark/spark-3.0.1/pyspark-3.0.1.tar.gz file with 2 cache replicas in the distributed cache engine.
Sync metadata before preloading data (recommended)
In many scenarios, files in the remote storage system may have changed, and the distributed cache engine needs to sync the metadata of the files to perceive the changes in the underlying storage system. Therefore, before data preloading, you can configure spec.loadMetadata of the DataLoad object to sync metadata in advance. For example:
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark-dataload
spec:
  dataset:
    name: spark
    namespace: default
  loadMetadata: true
  target:
    - path: /
      replicas: 1

Hubungi Kami

Hubungi tim penjualan atau penasihat bisnis kami untuk membantu bisnis Anda.

Dukungan Teknis

Buka tiket jika Anda mencari bantuan lebih lanjut. Tiket kami tersedia 7x24.

Dukungan Telepon 7x24

tencent cloud

Recent Pages

Data Preloading

Prerequisites

Setting Up an Environment

DataLoad Quick Usage

DataLoad Advanced Configuration

Preload data under one or more specified dataset subdirectories only

Set cache replicas when preloading data

Sync metadata before preloading data (recommended)

Apakah halaman ini membantu?

Apakah halaman ini membantu?

tencent cloud

Daftar

Login

Recent Pages

Data Preloading

Prerequisites

Setting Up an Environment

DataLoad Quick Usage

DataLoad Advanced Configuration

Preload data under one or more specified dataset subdirectories only

Set cache replicas when preloading data

Sync metadata before preloading data (recommended)

Apakah halaman ini membantu?

Apakah halaman ini membantu?