Upgrading Cluster

Doris provides the capability for rolling upgrades, enabling step-by-step upgrades of FE and BE nodes, minimizing downtime, and ensuring the system remains operational during the upgrade process.

Version Compatibility

Doris versioning consists of three components: the first digit represents a major milestone version, the second digit indicates a feature version, and the third digit corresponds to a bug fix. New features are not introduced in bug fix versions. For example, in Doris version 2.1.3, "2" indicates the second milestone version, "1" represents the feature version under this milestone, and "3" denotes the third bug fix for this feature version.

During version upgrades, the following rules apply:

Three-digit versions: Versions with the same first two digits can be directly upgraded across three-digit versions. For example, version 2.1.3 can be directly upgraded to version 2.1.7.
Two-digit and one-digit versions: Cross-version upgrades for two-digit versions are not recommended due to compatibility concerns. It is advised to upgrade sequentially through each two-digit version. For example, upgrading from version 3.0 to 3.3 should follow the sequence 3.0 -> 3.1 -> 3.2 -> 3.3.

The detailed version information can be found in the versioning rules.

Upgrade Precautions

When performing an upgrade, pay attention to the following:

Behavioral changes between versions: Review the Release Notes before upgrading to identify compatibility issues.
Add retry mechanisms for tasks in the cluster: Nodes are restarted sequentially during upgrades. Ensure that retry mechanisms are in place for query tasks and Stream Load import jobs to avoid task failures. Routine Load jobs using flink-doris-connector or spark-doris-connector already include retry mechanisms in their code and do not require additional logic.
Disable replica repair and balance functions: Disable these functions during the upgrade process. Regardless of the upgrade outcome, re-enable these functions after the upgrade is complete.

Metadata Compatibility Testing

Note

In a production environment, it is recommended to configure at least three FE nodes for high availability. If there is only one FE node, metadata compatibility testing must be performed before upgrading. Metadata compatibility is critical as incompatibility may cause upgrade failures and data loss. It is recommended to conduct metadata compatibility tests before each upgrade, keeping in mind the following:

Perform metadata compatibility testing on a development machine or BE node whenever possible to avoid using FE nodes.
If testing must be conducted on an FE node, use a non-Master node and stop the original FE process.

Before upgrading, conduct metadata compatibility testing to prevent failures caused by metadata incompatibility.

Backup metadata information:

Before starting the upgrade, back up the metadata of the Master FE node.

Use the show frontends command and refer to the IsMaster column to identify the Master FE node. FE metadata can be hot-backed up without stopping the FE node. By default, FE metadata is stored in the fe/doris-meta directory. This can be confirmed via the meta_dir parameter in the fe.conf configuration file.

Modify the fe.conf configuration file of the test FE node:

vi ${DORIS_NEW_HOME}/conf/fe.conf

Modify the following port information, ensuring all ports are different from those in the production environment, and update the clusterID parameter:

...
## modify port
http_port = 18030
rpc_port = 19020
query_port = 19030
arrow_flight_sql_port = 19040
edit_log_port = 19010

## modify clusterIP
clusterId=<a_new_clusterIP, such as 123456>
...

Copy the backed-up Master FE metadata to the new compatibility testing environment.
```
cp ${DORIS_OLD_HOME}/fe/doris-meta/* ${DORIS_NEW_HOME}/fe/doris-meta
```
Edit the VERSION file in the copied metadata directory to update the cluster_id to a new cluster IP, for example, change it to 123456 as shown in the example:
```
vi ${DORIS_NEW_HOME}/fe/doris-meta/image/VERSION
clusterId=123456
```

Start the FE process in the testing environment.

sh ${DORIS_NEW_HOME}/bin/start_fe.sh --daemon --metadata_failure_recovery

For versions earlier than 2.0.2, add the metadata_failure_recovery parameter to the fe.conf file before starting the FE process:

echo "metadata_failure_recovery=true" >> ${DORIS_NEW_HOME}/conf/fe.conf
sh ${DORIS_NEW_HOME}/bin/start_fe.sh --daemon

Verify that the FE process has started successfully by connecting to the current FE using the MySQL command. For example, use the query port 19030 as mentioned above:
```
mysql -uroot -P19030 -h127.0.0.1
```

Upgrade Steps

The detailed process for the upgrade is as follows:

Disable replica repair and balance functions
Upgrade BE nodes
Upgrade FE nodes
Enable replica repair and balance functions

During the upgrade process, the principle of upgrading BE nodes first, followed by upgrading FE nodes, should be followed. When upgrading FE, upgrade the Observer FE and Follower FE nodes first, and then upgrade the Master FE node.

Note

In general, only the /bin and /lib directories under the FE directory and the /bin and /lib directories under the BE directory need to be upgraded.

For versions 2.0.2 and later, a custom_lib/ directory has been added under the FE and BE deployment paths (if it doesn't exist, it can be manually created). The custom_lib/ directory is used to store some user-defined third-party jar files, such as hadoop-lzo-*.jar, orai18n.jar, etc. This directory does not need to be replaced during the upgrade.

Step 1: Disable Replica Repair and Balance Functions

During the upgrade process, nodes will be restarted, which may trigger unnecessary cluster balancing and replica repair logic. Disable these functions first using the following command:

admin set frontend config("disable_balance" = "true");
admin set frontend config("disable_colocate_balance" = "true");
admin set frontend config("disable_tablet_scheduler" = "true");

Step 2: Upgrade BE Nodes

Note:

To ensure the safety of your data, please use 3 replicas to store your data to avoid data loss caused by upgrade mistakes or failures.

In a multi-replica cluster, you can choose to stop the process on one BE node and perform a gradual upgrade:
```
sh ${DORIS_OLD_HOME}/be/bin/stop_be.sh
```

Rename the /bin and /lib directories in the BE directory:

mv ${DORIS_OLD_HOME}/be/bin ${DORIS_OLD_HOME}/be/bin_back
mv ${DORIS_OLD_HOME}/be/lib ${DORIS_OLD_HOME}/be/lib_back

Copy the new version's /bin and /lib directories to the original BE directory:

cp -r ${DORIS_NEW_HOME}/be/bin ${DORIS_OLD_HOME}/be/bin
cp -r ${DORIS_NEW_HOME}/be/lib ${DORIS_OLD_HOME}/be/lib

Start the BE node:

sh ${DORIS_OLD_HOME}/be/bin/start_be.sh --daemon

Connect to the cluster and check the node information:
```
show backends\G
```
If the BE node's alive status is true and the Version value is the new version, the node has been successfully upgraded.

Step 3: Upgrade FE Nodes

In a multi-FE node setup, select a non-Master node for the upgrade and stop it first:
```
sh ${DORIS_OLD_HOME}/fe/bin/stop_fe.sh
```

Rename the /bin, /lib, and /mysql_ssl_default_certificate directories in the FE directory:

mv ${DORIS_OLD_HOME}/fe/bin ${DORIS_OLD_HOME}/fe/bin_back
mv ${DORIS_OLD_HOME}/fe/lib ${DORIS_OLD_HOME}/fe/lib_back
mv ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate_back

Copy the new version's /bin, /lib, and /mysql_ssl_default_certificate directories to the original FE directory:

cp -r ${DORIS_NEW_HOME}/fe/bin ${DORIS_OLD_HOME}/fe/bin
cp -r ${DORIS_NEW_HOME}/fe/lib ${DORIS_OLD_HOME}/fe/lib
cp -r ${DORIS_NEW_HOME}/fe/mysql_ssl_default_certificate ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate

Start the FE node:

sh ${DORIS_OLD_HOME}/fe/bin/start_fe.sh --daemon

Connect to the cluster and check the node information:
```
show frontends\G
```
If the FE node's alive status is true and the Version value is the new version, the node has been successfully upgraded.
Complete the upgrade of the other FE nodes in sequence, and finally upgrade the Master node.

Step 4: Enable Replica Repair and Balance Functions

After the upgrade is complete and all BE nodes' status is Alive, enable the cluster's replica repair and balance functions:

admin set frontend config("disable_balance" = "false");
admin set frontend config("disable_colocate_balance" = "false");
admin set frontend config("disable_tablet_scheduler" = "false");

Version Compatibility​

Upgrade Precautions​

Metadata Compatibility Testing​

Upgrade Steps​

Step 1: Disable Replica Repair and Balance Functions​

Step 2: Upgrade BE Nodes​

Step 3: Upgrade FE Nodes​

Step 4: Enable Replica Repair and Balance Functions​