This document describes how to deploy, configure, and run GooseFS on a single server, in a cluster, and the Tencent Cloud EMR cluster that has not integrated with GooseFS.
Currently, GooseFS can work with Linux/MacOS running a mainstream x86/x64 framework (Note: M1-powered MacOS has not been verified). Configuration requirements for nodes are described as follows:
In most cases, you are advised to use a dedicated Linux account to deploy and run GooseFS. For example, in the self-built cluster and EMR environment, you can use the Hadoop user to deploy and run GooseFS. If batch deployment is used, the following permissions are also needed:
The pseudo–distribution framework is mainly used for trying out and debugging GooseFS. Beginners can try out and debug GooseFS on a host running Linux OS or macOS.
Download the GooseFS binary distribution package.
Decompress the package, go to the GooseFS directory, and perform the operations below.
conf/goosefs-site.properties configuration file by copying
$ cp conf/goosefs-site.properties.template conf/goosefs-site.properties
goosefs.master.mount.table.root.ufs to the directory in the local file system, such as
You are advised to set SSH passwordless login for localhost. Otherwise, you need to enter its login password for operations such as formatting and starting.
Run the following command to mount a RamFS file system:
$ ./bin/goosefs-mount.sh SudoMount
You can also mount it when you start the GooseFS cluster as follows:
$ ./bin/goosefs-start.sh local SudoMount
When the cluster is started, run the
jps command to view all GooseFS processes in the pseudo-distribution mode.
$ jps 35990 Jps 35832 GooseFSSecondaryMaster 35736 GooseFSMaster 35881 GooseFSWorker 35834 GooseFSJobMaster 35883 GooseFSJobWorker 35885 GooseFSProxy
After this, you can run the
goosefs command to perform operations related to namespace, fileSystem, job, and table. For example, you can run the following commands to upload a local file to GooseFS and list files and directories in the root directory of GooseFS:
$ goosefs fs copyFromLocal test.txt / Copied file:///Users/goosefs/test.txt to / $ goosefs fs ls / -rw-r--r-- goosefs staff 0 PERSISTED 04-28-2021 04:00:35:156 100% /test.txt
The GooseFS CLI tool allows you to perform all kinds of operations on namespaces, tables, jobs, file systems, and more to manage and access GooseFS. For more information, please see our documentation or run the
goosefs -h command to view the help messages.
The cluster deployment and running are mainly for the production environment of self-built IDC clusters and Tencent Cloud EMR production environments that have not integrated with GooseFS. The deployments are classified into standalone framework deployment and high-availability (HA) framework deployment.
scripts directory, you can find scripts to configure SSH passwordless logins or deploy GooseFS in batches, which make it easy to deploy a large-scale GooseFS cluster. See the batch deployment requirements mentioned above to check whether you can use batch deployment.
GooseFS provides scripts in the
scripts directory for you to configure SSH passwordless logins or deploy GooseFS in batches. If the execution conditions are met, you can perform the following steps to batch deploy jobs:
conf/workers. In addition, complete all configurations for the final production environment.
scriptsdirectory and configure the
install.propertiesconfiguration file. After this, switch to the
rootaccount or use
config_ssh.sh, so that you can configure passwordless logins for the entire cluster.
validate_env.shtool to validate the configuration of the cluster.
install.shscript as the
rootaccount or using
sudoto start the installation and wait for the installation to complete.
After a successful deployment, run
./bin/goosefs-start.sh all SudoMount to start the entire cluster. By default, all running logs will be recorded to
In the standalone framework, only one master node and multiple worker nodes are deployed in the cluster. You can deploy and run the cluster as follows:
tar zxvf goosefs-x.x.x-bin.tar.gzcommand to decompress the package into the installation directory. You can see Introduction to the batch deployment tool to deploy and run your cluster in batches, or perform the following steps to deploy it manually.
(1) Copy the
template file from the
conf directory to create a configuration file.
$ cp conf/goosefs-site.properties.template conf/goosefs-site.properties
goosefs-site.properties as follows:
goosefs.master.hostname to the hostname or IP of the master node, and
goosefs.master.mount.table.root.ufs to the URI in the under file system (UFS) that the GooseFS root directory is mounted to. Note that this URI must be accessible for both the master and worker nodes.
For example, you can mount a COS path to the root directory of GooseFS with
masters configuration file, specify the hostname or IP of the master node as follows:
# The multi-master Zookeeper HA mode requires that all the masters can access # the same journal through a shared medium (e.g. HDFS or NFS). # localhost cvm1.compute-1.myqcloud.com
workers configuration file, specify the hostname or IP for all worker nodes as follows:
# An GooseFS Worker will be started on each of the machines listed below. # localhost cvm2.compute-2.myqcloud.com cvm3.compute-3.myqcloud.com
After the configuration is completed, run
./bin/goosefs copyDir conf/ to sync the configurations to all nodes.
./bin/goosefs-start.sh all to start the GooseFS cluster.
The standalone framework that uses only one master node might lead to a single point of failure (SPOF). Therefore, you are advised to deploy multiple master nodes in the production environment to adopt an HA framework. One of the master nodes will become the leader node that provides services, while other standby nodes will share journals synchronously to maintain the same state as the leader node. If the leader node fails, one of the standby nodes will automatically replace the leader node to continue providing services. In this way, you can avoid SPOFs and make the framework more highly available.
Currently, GooseFS supports using Raft logs or ZooKeeper to ensure the strong consistency of the leader and standby nodes. The deployment of each mode is described below.
First, create a configuration file using a template.
$ ./bin/goosefs cp conf/goosefs-site.properties.template conf/goosefs-site.properties
<MASTER_HOSTNAME> =<STORAGE_URI> =<EMBBEDDED_JOURNAL_ADDRESS>=
The configuration items are described as follows:
goosefs.master.hostnameto the IPs or hostnames of the master nodes. Ensure that they can be accessed by the client and the worker nodes.
goosefs.master.mount.table.root.ufsto the underlying storage URI that is mounted to the GooseFS root directory.
host:embedded_journal_portof all standby nodes. The default value of
The deployment based on Raft embedded logs uses copycat to select a leader node. Therefore, if you use Raft for an HA framework, do not mix it with ZooKeeper.
After the configurations are completed, run the following command to sync all configurations:
$ ./bin/goosefs copyDir conf/
Format and then start the GooseFS cluster:
$ ./bin/goosefs format
$ ./bin/goosefs-start.sh all
Run the following command to view the current leader node:
$ ./bin/goosefs fs leader
Run the following command to view the cluster status:
$ ./bin/goosefs fsadmin report
To set up an HA GooseFS framework based on ZooKeeper, you need to:
cosn://bucket-1250000000/journal) as the shared file system.
The configurations are as follows:
true =<ZOOKEEPER_ADDRESSS> =UFS =<JOURNAL_URI>=
After this, use
./bin/goosefs copyDir conf/ to sync the configurations to all nodes in the cluster, and use
./bin/goosefs-start.sh all to start the cluster.
goosefs-start.sh all script is executed and GooseFS is run, the following processes will run in the cluster:
|GooseFSMaster||Default RPC port: 9200; web port: 9201|
|GooseFSWorker||Default RPC port: 9203; web port: 9204|
|GooseFSJobMaster||Default RPC port: 9205; web port: 9206|
|GooseFSProxy||Default web port: 9211|
|GooseFSJobWorker||Default RPC port: 9208; web port: 9210|