The unified namespace capability of GooseFS integrates the access semantics of different underlying storage systems through a transparent naming mechanism, providing users with a unified interactive view for data management.
Leveraging this capability, GooseFS connects and communicates with different underlying storage systems, such as local file systems, Tencent Cloud Object Storage (COS), and Tencent Cloud HDFS (CHDFS), and provides unified access APIs and file protocols for upper-layer businesses. In this way, the business side only needs to call the access APIs provided by GooseFS to access data stored in different underlying storage systems.
The figure above shows the working principle of the unified namespace. You can use GooseFS's namespace creation instruction
create ns to mount specified file directories of COS and CHDFS to GooseFS, and then use the
gfs:// unified schema to access data. Details are as follows:
gfs://, and the files are cached in the local file system of GooseFS.
hadoop fs ls gfs://BU_A), and they can also be accessed through the namespace of each remote file system (for example,
hadoop fs ls cosn://bucket-1/BU_A).
gfs://because the files are not cached in the local file system of GooseFS, but they can still be accessed through the namespace of the underlying storage systems.
You can use the
create ns instruction to create a namespace in GooseFS and map underlying storage systems to GooseFS. Currently supported underlying storage systems include COS, CHDFS, and local HDFS. The procedure for creating a namespace is similar to that for mounting a file volume disk in a Linux file system. With the namespace created, GooseFS can provide clients with a file system with uniform access semantics. The current operation instruction set for GooseFS namespaces is as follows:
$ goosefs ns Usage: goosefs ns [generic options] [create <namespace> <CosN/Chdfs path> <--wPolicy <1-6>> <--rPolicy <1-5>> [--readonly] [--shared] [--secret fs.cosn.userinfo.secretId=<AKIDxxxxxxx>] [--secret fs.cosn.userinfo.secretKey=<xxxxxxxxxx>] [--attribute fs.ofs.userinfo.appid=1200000000][--attribute fs.cosn.bucket.region=<ap-xxx>/fs.cosn.bucket.endpoint_suffix=<cos.ap-xxx.myqcloud.com>]] [delete <namespace>] [help [<command>]] [ls [-r|--sort=option|--timestamp=option]] [setPolicy [--wPolicy <1-6>] [--rPolicy <1-5>] <namespace>] [setTtl [--action delete|free] <namespace> <time to live>] [stat <namespace>] [unsetPolicy <namespace>] [unsetTtl <namespace>]
The instructions are described as follows:
|create||Creates a namespace and maps a remote storage system (UFS) to the namespace. When creating the namespace, you can set cache read and write policies. You need to pass in an authorized key (
|delete||Deletes a specified namespace.|
|ls||Lists the detailed information of a specified namespace, including the UFS path, creation time, cache policy, and TTL information.|
|setPolicy||Sets the cache policy of a specified namespace.|
|setTtl||Sets TTL for a specified namespace.|
|stat||Provides the description of a specified namespace, including the mount point, UFS path, creation time, cache policy, TTL information, persistence status, user group, ACL, last access time, and modification time.|
|unsetPolicy||Resets the cache policy of a specified namespace.|
|unsetTtl||Resets the TTL of a specified namespace.|
By creating a namespace in GooseFS, you can cache frequently accessed hot data from a remote storage system to a local high-performance storage node to provide high-performance data access for local computing businesses. The following shows how to map the COS bucket
example-prefix directory in the bucket, and the CHDFS to the
test_chdfs namespaces respectively.
# Map the COS bucket `example-bucket` to the `test_cos` namespace $ goosefs ns create test_cos cosn://example-bucket-1250000000/ --wPolicy 1 --rPolicy 1 --secret fs.cosn.userinfo.secretId=AKIDxxxxxxx --secret fs.cosn.userinfo.secretKey=xxxxxxxxxx --attribute fs.cosn.bucket.region=ap-guangzhou --attribute fs.cosn.bucket.endpoint_suffix=cos.ap-guangzhou.myqcloud.com # Map the `example-prefix` directory in the COS bucket `example-bucket` to the `test_cos_prefix` namespace $ goosefs ns create test_cos_prefix cosn://example-bucket-1250000000/example-prefix/ --wPolicy 1 --rPolicy 1 --secret fs.cosn.userinfo.secretId=AKIDxxxxxxx --secret fs.cosn.userinfo.secretKey=xxxxxxxxxx --attribute fs.cosn.bucket.region=ap-guangzhou --attribute fs.cosn.bucket.endpoint_suffix=cos.ap-guangzhou.myqcloud.com # Map the CHDFS `f4ma0l3qabc-Xy3` to the `test_chdfs` namespace $ goosefs ns create test_chdfs ofs://f4ma0l3qabc-Xy3/ --wPolicy 1 --rPolicy 1 --attribute fs.ofs.userinfo.appid=1250000000
After successful creation, you can use the
goosefs fs ls instruction to view the directory details:
$ goosefs fs ls /test_cos
You can use the
delete instruction to delete unwanted namespaces:
$ goosefs ns delete test_cos Delete the namespace: test_cos
You can use
unsetPolicy to set the cache policy of a namespace. The instruction set is as follows:
$goosefs ns setPolicy [--wPolicy <1-6>] [--rPolicy <1-5>] <namespace>
The parameters are described as follows:
The read and write cache policies currently supported by GooseFS are as follows:
Write cache policies
|Policy Name||Behavior||Corresponding Write_Type||Data Security||Write Efficiency|
|MUST_CACHE (1)||Data is stored only in GooseFS and is not written to the remote storage system.||MUST_CACHE||Unreliable||High|
|TRY_CACHE (2)||If the cache has space, data is written to GooseFS. Otherwise, data is written directly to underlying storage systems.||TRY_CACHE||Unreliable||Medium|
|CACHE_THROUGH (3)||Data is cached as much as possible and simultaneously written to remote storage systems.||CACHE_THROUGH||Reliable||Low|
|THROUGH (4)||Data is not stored in GooseFS, but written directly to the remote storage system.||THROUGH||Reliable||Medium|
|ASYNC_THROUGH (5)||Data is written to GooseFS and asynchronously purged to remote storage systems.||ASYNC_THROUGH||Weak reliability||High|
Write_Typeindicates the file cache policy specified when the user calls the SDK or API to write data to GooseFS. It takes effect only for a single file.
Read cache policies
|Policy Name||Behavior||Metadata Sync||Corresponding Read_Type||Data Consistency||Read Efficiency||Whether to Cache Data|
|NO_CACHE (1)||Data is not cached and is directly read from remote storage systems instead.||NO||NO_CACHE||Strong||Low||No|
Not hit: low
Not hit: low
Not hit: low
Not hit: low
Read_Typeindicates the file cache policy specified when the user calls the SDK or API to read data from GooseFS. It takes effect only for a single file.
Based on current big data business practices, we recommend the following combinations of read and write cache policies:
|Write Cache Policy||Read Cache Policy||Policy Combination Performance|
|CACHE_THROUGH (3)||CACHE_CONSISTENT (5)||Strong data consistency between the cache and remote storage systems|
|CACHE_THROUGH (3)||CACHE (2)||Write: strong consistency; read: eventual consistency|
|ASYNC_THROUGH (5)||CACHE_CONSISTENT (5)||Write: eventual consistency; read: strong consistency|
|ASYNC_THROUGH (5)||CACHE (2)||Read/Write: eventual consistency|
|MUST_CACHE (1)||CACHE (2)||Data is read from the cache only.|
The following example shows how to set the read and write cache policies of the
test_cos namespace to
CACHE_THROUGH and CACHE_CONSISTENT` respectively:
$ goosefs ns setPolicy --wPolicy 3 --rPolicy 5 test_cos
In addition to specifying cache policies when creating namespaces, you can also configure global cache policies by setting
Write_Typefor specific files when reading or writing files, or by using the
Propertiesconfiguration file. If multiple policies exist at the same time, their priority order is as follows: custom priority > namespace read and write policies > global cache policy configured in the configuration file. For the read policy, the combination of the custom
Read_Typeand the namespace's
DirReadPolicytakes effect. That is, the custom
Read_Typeis used as the data stream read policy, and the namespace policy is used for metadata.
For example, GooseFS contains a COSN namespace whose read policy is
CACHE_CONSISTENTand the namespace contains a
test.txtfile. When the client reads the
Read_Typeis specified as
CACHE_PROMOTE. Then the entire read behavior is to sync metadata and perform
To reset the read and write cache policies, you can use the
unsetPolicy instruction. The following shows how to reset the read and write cache policies for the
$ goosefs ns unsetPolicy test_cos
Time to Live (TTL) is used to manage data cached on the local nodes of GooseFS. Setting TTL allows a specified operation, such as
free, to be performed on the cached data after a specified period of time. The instruction for setting TTL is as follows:
$ goosefs ns setTtl [--action delete|free] <namespace> <time to live>
The parameters are described as follows:
freeare supported. The
deleteoperation deletes data from the cache and UFS, while the
freeoperation deletes data only from the cache.
The following example shows how to set the policy of the
test_cos namespace to delete data only from the cache after 60 seconds:
$ goosefs ns setTtl --action free test_cos 60000
This section describes how GooseFS manages metadata, including metadata synchronization and updates. GooseFS provides users with unified namespace capability. Users can access files on different underlying storage systems using a unified
gfs:// path. You only need to specify the paths of the underlying storage systems. We recommend that you use GooseFS as a unified data access layer to uniformly read and write data from GooseFS to ensure metadata consistency.
You can configure the metadata synchronization interval in the
conf/goosefs-site.properties configuration file:
The metadata synchronization interval parameter supports 3 types of input values:
You can choose an appropriate synchronization interval based on your number of nodes, the I/O distance between your GooseFS cluster and the underlying storage system, and the type of the underlying storage system. Usually:
Configuration via CLI
You can set the metadata synchronization interval in command line interface (CLI) mode:
goosefs fs ls -R -Dgoosefs.user.file.metadata.sync.interval=0 <path to sync>
Configuration via the configuration file
For a large-scale GooseFS cluster, you can use the
goosefs-site.properties configuration file to batch configure the metadata synchronization interval for the master nodes in the cluster, and other nodes will adopt this interval by default.
Many businesses choose to distinguish the purpose of data by directory, and the data access frequencies of different directories are not all the same. You can set different metadata synchronization intervals for different directories. For some directories that change frequently, the metadata synchronization interval can be set to a shorter time (such as 5 minutes). For directories that change little or do not change, the synchronization interval can be set to
-1, so that GooseFS will not automatically synchronize the metadata of the directories.
You can set different metadata synchronization intervals based on business access modes:
|Access Mode||Metadata Synchronization Interval||Remarks|
|All file requests go through GooseFS||-1||-|
|Most file requests go through GooseFS||HDFS is used as UFS||Hot update or update by path is recommended||If the HDFS updates frequently, you are advised to set the update interval to `-1` to prohibit updates.|
|COS is used as UFS||Configuring update intervals by path is recommended||Configuring different update intervals for different directories can alleviate the pressure of metadata synchronization.|
|File upload requests generally do not go through GooseFS||HDFS is used as UFS||Configuring update intervals by path is recommended|
|COS is used as UFS||Configuring update intervals by path is recommended|