The unified namespace capability of GooseFS integrates the access semantics of different underlying storage systems through a transparent naming mechanism, providing users with a unified interactive view for data management.
Leveraging this capability, GooseFS connects and communicates with different underlying storage systems, such as local file systems, Tencent Cloud Object Storage (COS), and Tencent Cloud HDFS (CHDFS), and provides unified access APIs and file protocols for upper-layer businesses. In this way, the business side only needs to call the access APIs provided by GooseFS to access data stored in different underlying storage systems.
The figure above shows the working principle of the unified namespace. You can use GooseFS's namespace creation instruction create ns
to mount specified file directories of COS and CHDFS to GooseFS, and then use the gfs://
unified schema to access data. Details are as follows:
gfs://
, and the files are cached in the local file system of GooseFS.gfs://
(for example, hadoop fs ls gfs://BU_A
), and they can also be accessed through the namespace of each remote file system (for example, hadoop fs ls cosn://bucket-1/BU_A
).gfs://
because the files are not cached in the local file system of GooseFS, but they can still be accessed through the namespace of the underlying storage systems.You can use the create ns
instruction to create a namespace in GooseFS and map underlying storage systems to GooseFS. Currently supported underlying storage systems include COS, CHDFS, and local HDFS. The procedure for creating a namespace is similar to that for mounting a file volume disk in a Linux file system. With the namespace created, GooseFS can provide clients with a file system with uniform access semantics. The current operation instruction set for GooseFS namespaces is as follows:
$ goosefs ns
Usage: goosefs ns [generic options]
[create <namespace> <CosN/Chdfs path> <--wPolicy <1-6>> <--rPolicy <1-5>> [--readonly] [--shared] [--secret fs.cosn.userinfo.secretId=<AKIDxxxxxxx>] [--secret fs.cosn.userinfo.secretKey=<xxxxxxxxxx>] [--attribute fs.ofs.userinfo.appid=1200000000][--attribute fs.cosn.bucket.region=<ap-xxx>/fs.cosn.bucket.endpoint_suffix=<cos.ap-xxx.myqcloud.com>]]
[delete <namespace>]
[help [<command>]]
[ls [-r|--sort=option|--timestamp=option]]
[setPolicy [--wPolicy <1-6>] [--rPolicy <1-5>] <namespace>]
[setTtl [--action delete|free] <namespace> <time to live>]
[stat <namespace>]
[unsetPolicy <namespace>]
[unsetTtl <namespace>]
The instructions are described as follows:
Instruction | Description |
---|---|
create | Creates a namespace and maps a remote storage system (UFS) to the namespace. When creating the namespace, you can set cache read and write policies. You need to pass in an authorized key (secretId and secretKey ). |
delete | Deletes a specified namespace. |
ls | Lists the detailed information of a specified namespace, including the UFS path, creation time, cache policy, and TTL information. |
setPolicy | Sets the cache policy of a specified namespace. |
setTtl | Sets TTL for a specified namespace. |
stat | Provides the description of a specified namespace, including the mount point, UFS path, creation time, cache policy, TTL information, persistence status, user group, ACL, last access time, and modification time. |
unsetPolicy | Resets the cache policy of a specified namespace. |
unsetTtl | Resets the TTL of a specified namespace. |
By creating a namespace in GooseFS, you can cache frequently accessed hot data from a remote storage system to a local high-performance storage node to provide high-performance data access for local computing businesses. The following shows how to map the COS bucket example-bucket
, the example-prefix
directory in the bucket, and the CHDFS to the test_cos
, test_cos_prefix
, and test_chdfs
namespaces respectively.
# Map the COS bucket `example-bucket` to the `test_cos` namespace
$ goosefs ns create test_cos cosn://example-bucket-1250000000/ --wPolicy 1 --rPolicy 1 --secret fs.cosn.userinfo.secretId=AKIDxxxxxxx --secret fs.cosn.userinfo.secretKey=xxxxxxxxxx --attribute fs.cosn.bucket.region=ap-guangzhou --attribute fs.cosn.bucket.endpoint_suffix=cos.ap-guangzhou.myqcloud.com
# Map the `example-prefix` directory in the COS bucket `example-bucket` to the `test_cos_prefix` namespace
$ goosefs ns create test_cos_prefix cosn://example-bucket-1250000000/example-prefix/ --wPolicy 1 --rPolicy 1 --secret fs.cosn.userinfo.secretId=AKIDxxxxxxx --secret fs.cosn.userinfo.secretKey=xxxxxxxxxx --attribute fs.cosn.bucket.region=ap-guangzhou --attribute fs.cosn.bucket.endpoint_suffix=cos.ap-guangzhou.myqcloud.com
# Map the CHDFS `f4ma0l3qabc-Xy3` to the `test_chdfs` namespace
$ goosefs ns create test_chdfs ofs://f4ma0l3qabc-Xy3/ --wPolicy 1 --rPolicy 1 --attribute fs.ofs.userinfo.appid=1250000000
After successful creation, you can use the goosefs fs ls
instruction to view the directory details:
$ goosefs fs ls /test_cos
You can use the delete
instruction to delete unwanted namespaces:
$ goosefs ns delete test_cos
Delete the namespace: test_cos
You can use setPolicy
and unsetPolicy
to set the cache policy of a namespace. The instruction set is as follows:
$goosefs ns setPolicy [--wPolicy <1-6>] [--rPolicy <1-5>] <namespace>
The parameters are described as follows:
The read and write cache policies currently supported by GooseFS are as follows:
Write cache policies
Policy Name | Behavior | Corresponding Write_Type | Data Security | Write Efficiency |
---|---|---|---|---|
MUST_CACHE (1) | Data is stored only in GooseFS and is not written to the remote storage system. | MUST_CACHE | Unreliable | High |
TRY_CACHE (2) | If the cache has space, data is written to GooseFS. Otherwise, data is written directly to underlying storage systems. | TRY_CACHE | Unreliable | Medium |
CACHE_THROUGH (3) | Data is cached as much as possible and simultaneously written to remote storage systems. | CACHE_THROUGH | Reliable | Low |
THROUGH (4) | Data is not stored in GooseFS, but written directly to the remote storage system. | THROUGH | Reliable | Medium |
ASYNC_THROUGH (5) | Data is written to GooseFS and asynchronously purged to remote storage systems. | ASYNC_THROUGH | Weak reliability | High |
Note:
Write_Type
indicates the file cache policy specified when the user calls the SDK or API to write data to GooseFS. It takes effect only for a single file.
Read cache policies
Policy Name | Behavior | Metadata Sync | Corresponding Read_Type | Data Consistency | Read Efficiency | Whether to Cache Data |
---|---|---|---|---|---|---|
NO_CACHE (1) | Data is not cached and is directly read from remote storage systems instead. | NO | NO_CACHE | Strong | Low | No |
CACHE (2) | Read_Type is CACHE . |
Once | CACHE | Weak | Hit: high Not hit: low |
Yes |
CACHE_PROMOTE (3) | CACHE .Read_Type is CACHE_PROMOTE . |
Once | CACHE_PROMOTE | Weak | Hit: high Not hit: low |
Yes |
CACHE_CONSISTENT_PROMOTE (4) | Not Exists exception.Read_Type is CACHE_PROMOTE . If a cache is hit, data is cached to the hottest cache medium. |
Always | CACHE | Strong | Hit: medium Not hit: low |
Yes |
CACHE_CONSISTENT (5) | CACHE_CONSISTENT_PROMOTE . Read_Type is CACHE . That is, when a cache is hit, data is not moved between different media layers. |
Always | CACHE_PROMOTE | Strong | Hit: medium Not hit: low |
Yes |
Note:
Read_Type
indicates the file cache policy specified when the user calls the SDK or API to read data from GooseFS. It takes effect only for a single file.
Based on current big data business practices, we recommend the following combinations of read and write cache policies:
Write Cache Policy | Read Cache Policy | Policy Combination Performance |
---|---|---|
CACHE_THROUGH (3) | CACHE_CONSISTENT (5) | Strong data consistency between the cache and remote storage systems |
CACHE_THROUGH (3) | CACHE (2) | Write: strong consistency; read: eventual consistency |
ASYNC_THROUGH (5) | CACHE_CONSISTENT (5) | Write: eventual consistency; read: strong consistency |
ASYNC_THROUGH (5) | CACHE (2) | Read/Write: eventual consistency |
MUST_CACHE (1) | CACHE (2) | Data is read from the cache only. |
The following example shows how to set the read and write cache policies of the test_cos
namespace to CACHE_THROUGH
and CACHE_CONSISTENT` respectively:
$ goosefs ns setPolicy --wPolicy 3 --rPolicy 5 test_cos
Note:In addition to specifying cache policies when creating namespaces, you can also configure global cache policies by setting
Read_Type
orWrite_Type
for specific files when reading or writing files, or by using theProperties
configuration file. If multiple policies exist at the same time, their priority order is as follows: custom priority > namespace read and write policies > global cache policy configured in the configuration file. For the read policy, the combination of the customRead_Type
and the namespace'sDirReadPolicy
takes effect. That is, the customRead_Type
is used as the data stream read policy, and the namespace policy is used for metadata.For example, GooseFS contains a COSN namespace whose read policy is
CACHE_CONSISTENT
and the namespace contains atest.txt
file. When the client reads thetest.txt
file,Read_Type
is specified asCACHE_PROMOTE
. Then the entire read behavior is to sync metadata and performCACHE_PROMOTE
.
To reset the read and write cache policies, you can use the unsetPolicy
instruction. The following shows how to reset the read and write cache policies for the test_cos
namespace:
$ goosefs ns unsetPolicy test_cos
Time to Live (TTL) is used to manage data cached on the local nodes of GooseFS. Setting TTL allows a specified operation, such as delete
or free
, to be performed on the cached data after a specified period of time. The instruction for setting TTL is as follows:
$ goosefs ns setTtl [--action delete|free] <namespace> <time to live>
The parameters are described as follows:
delete
and free
are supported. The delete
operation deletes data from the cache and UFS, while the free
operation deletes data only from the cache.The following example shows how to set the policy of the test_cos
namespace to delete data only from the cache after 60 seconds:
$ goosefs ns setTtl --action free test_cos 60000
This section describes how GooseFS manages metadata, including metadata synchronization and updates. GooseFS provides users with unified namespace capability. Users can access files on different underlying storage systems using a unified gfs://
path. You only need to specify the paths of the underlying storage systems. We recommend that you use GooseFS as a unified data access layer to uniformly read and write data from GooseFS to ensure metadata consistency.
You can configure the metadata synchronization interval in the conf/goosefs-site.properties
configuration file:
goosefs.user.file.metadata.sync.interval=<INTERVAL>
The metadata synchronization interval parameter supports 3 types of input values:
You can choose an appropriate synchronization interval based on your number of nodes, the I/O distance between your GooseFS cluster and the underlying storage system, and the type of the underlying storage system. Usually:
Configuration via CLI
You can set the metadata synchronization interval in command line interface (CLI) mode:
goosefs fs ls -R -Dgoosefs.user.file.metadata.sync.interval=0 <path to sync>
Configuration via the configuration file
For a large-scale GooseFS cluster, you can use the goosefs-site.properties
configuration file to batch configure the metadata synchronization interval for the master nodes in the cluster, and other nodes will adopt this interval by default.
goosefs.user.file.metadata.sync.interval=1m
Note:Many businesses choose to distinguish the purpose of data by directory, and the data access frequencies of different directories are not all the same. You can set different metadata synchronization intervals for different directories. For some directories that change frequently, the metadata synchronization interval can be set to a shorter time (such as 5 minutes). For directories that change little or do not change, the synchronization interval can be set to
-1
, so that GooseFS will not automatically synchronize the metadata of the directories.
You can set different metadata synchronization intervals based on business access modes:
Access Mode | Metadata Synchronization Interval | Remarks | |
All file requests go through GooseFS | -1 | - | |
Most file requests go through GooseFS | HDFS is used as UFS | Hot update or update by path is recommended | If the HDFS updates frequently, you are advised to set the update interval to `-1` to prohibit updates. |
COS is used as UFS | Configuring update intervals by path is recommended | Configuring different update intervals for different directories can alleviate the pressure of metadata synchronization. | |
File upload requests generally do not go through GooseFS | HDFS is used as UFS | Configuring update intervals by path is recommended | |
COS is used as UFS | Configuring update intervals by path is recommended |
Was this page helpful?