Reference
Limitationsβ
Network Constraintsβ
-
Syncer needs to be able to communicate with both the upstream and downstream FE (Frontend) and BE (Backend).
-
The downstream BE and upstream BE are directly connected through the IP used by the Doris BE process (as seen in
show frontends/backends
).
Start Syncerβ
Start Syncer according to the configurations and save a pid file in the default or specified path. The name of the pid file should follow host_port.pid
.
Output file structure
The file structure can be seen under the output path after compilation:
output_dir
bin
ccr_syncer
enable_db_binlog.sh
start_syncer.sh
stop_syncer.sh
db
[ccr.db] # Generated after running with the default configurations.
log
[ccr_syncer.log] # Generated after running with the default configurations.
The start_syncer.sh in the following text refers to the start_syncer.sh under its corresponding path.
Start options
--daemon
Run Syncer in the background, set to false by default.
bash bin/start_syncer.sh --daemon
--db_type
Syncer can currently use two databases to store its metadata, sqlite3
(for local storage) and mysql
(for local or remote storage).
bash bin/start_syncer.sh --db_type mysql
The default value is sqlite3.
When using MySQL to store metadata, Syncer will use CREATE IF NOT EXISTS
to create a database called ccr
, where the metadata table related to CCR will be saved.
--db_dir
This option only works when db uses sqlite3
.
It allows you to specify the name and path of the db file generated by sqlite3.
bash bin/start_syncer.sh --db_dir /path/to/ccr.db
The default path is SYNCER_OUTPUT_DIR/db
and the default file name is ccr.db
.
--db_host & db_port & db_user & db_password
This option only works when db uses mysql
.
bash bin/start_syncer.sh --db_host 127.0.0.1 --db_port 3306 --db_user root --db_password "qwe123456"
The default values of db_host and db_port are shown in the example. The default values of db_user and db_password are empty.
--log_dir
Output path of the logs:
bash bin/start_syncer.sh --log_dir /path/to/ccr_syncer.log
The default path isSYNCER_OUTPUT_DIR/log
and the default file name is ccr_syncer.log
.
--log_level
Used to specify the output level of Syncer logs.
bash bin/start_syncer.sh --log_level info
The format of the log is as follows, where the hook will only be printed when log_level > info
:
# time level msg hooks
[2023-07-18 16:30:18] TRACE This is trace type. ccrName=xxx line=xxx
[2023-07-18 16:30:18] DEBUG This is debug type. ccrName=xxx line=xxx
[2023-07-18 16:30:18] INFO This is info type. ccrName=xxx line=xxx
[2023-07-18 16:30:18] WARN This is warn type. ccrName=xxx line=xxx
[2023-07-18 16:30:18] ERROR This is error type. ccrName=xxx line=xxx
[2023-07-18 16:30:18] FATAL This is fatal type. ccrName=xxx line=xxx
Under --daemon, the default value of log_level is info
.
When running in the foreground, log_level defaults to trace
, and logs are saved to log_dir using the tee command.
--host && --port
Used to specify the host and port of Syncer, where host only plays the role of distinguishing itself in the cluster, which can be understood as the name of Syncer, and the name of Syncer in the cluster is host: port
.
bash bin/start_syncer.sh --host 127.0.0.1 --port 9190
The default value of host is 127.0.0.1, and the default value of port is 9190.
--pid_dir
Used to specify the storage path of the pid file
The pid file is the credentials for closing the Syncer. It is used in the stop_syncer.sh script. It saves the corresponding Syncer process number. In order to facilitate management of Syncer, you can specify the storage path of the pid file.
bash bin/start_syncer.sh --pid_dir /path/to/pids
The default value is SYNCER_OUTPUT_DIR/bin
.
Stop Syncerβ
Stop the Syncer according to the process number in the pid file under the default or specified path. The name of the pid file should follow host_port.pid
.
Output file structure
The file structure can be seen under the output path after compilation:
output_dir
bin
ccr_syncer
enable_db_binlog.sh
start_syncer.sh
stop_syncer.sh
db
[ccr.db] # Generated after running with the default configurations.
log
[ccr_syncer.log] # Generated after running with the default configurations.
The start_syncer.sh in the following text refers to the start_syncer.sh under its corresponding path.
Stop options
Syncer can be stopped in three ways:
- Stop a single Syncer in the directory
Specify the host and port of the Syncer to be stopped. Be sure to keep it consistent with the host specified when start_syncer
- Batch stop the specified Syncer in the directory
Specify the names of the pid files to be stopped, wrap the names in ""
and separate them with spaces.
- Stop all Syncers in the directory
Follow the default configurations.
--pid_dir
Specify the directory where the pid file is located. The above three stopping methods all depend on the directory where the pid file is located for execution.
bash bin/stop_syncer.sh --pid_dir /path/to/pids
The effect of the above example is to close the Syncer corresponding to all pid files under /path/to/pids
( method 3 ). -- pid_dir
can be used in combination with the above three Syncer stopping methods.
The default value is SYNCER_OUTPUT_DIR/bin
.
--host && --port
Stop the Syncer corresponding to host: port in the pid_dir path.
bash bin/stop_syncer.sh --host 127.0.0.1 --port 9190
The default value of host is 127.0.0.1, and the default value of port is empty. That is, specifying the host alone will degrade method 1 to method 3. Method 1 will only take effect when neither the host nor the port is empty.
--files
Stop the Syncer corresponding to the specified pid file name in the pid_dir path.
bash bin/stop_syncer.sh --files "127.0.0.1_9190.pid 127.0.0.1_9191.pid"
The file names should be wrapped in " "
and separated with spaces.
Syncer operationsβ
Template for requests
curl -X POST -H "Content-Type: application/json" -d {json_body} http://ccr_syncer_host:ccr_syncer_port/operator
json_body: send operation information in JSON format
operator: different operations for Syncer
The interface returns JSON. If successful, the "success" field will be true. Conversely, if there is an error, it will be false, and then there will be an ErrMsgs
field.
{"success":true}
or
{"success":false,"error_msg":"job ccr_test not exist"}
Create Jobβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "ccr_test",
"src": {
"host": "localhost",
"port": "9030",
"thrift_port": "9020",
"user": "root",
"password": "",
"database": "demo",
"table": "example_tbl"
},
"dest": {
"host": "localhost",
"port": "9030",
"thrift_port": "9020",
"user": "root",
"password": "",
"database": "ccrt",
"table": "copy"
}
}' http://127.0.0.1:9190/create_ccr
- name: the name of the CCR synchronization task, should be unique
- host, port: correspond to the host and mysql (jdbc) port of the cluster's master
- thrift_port: corresponds to the rpc_port of the FE
- user, password: the credentials used by the Syncer to initiate transactions, fetch data, etc.
- database, table:
- If it is a database-level synchronization, fill in the database name and leave the table name empty.
- If it is a table-level synchronization, specify both the database name and the table name.
Get Synchronization Lagβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/get_lag
The job_name is the name specified when create_ccr.
Pause Jobβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/pause
Resume Jobβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/resume
Delete Jobβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/delete
Display Versionβ
curl http://ccr_syncer_host:ccr_syncer_port/version
# > return
{"version": "2.0.1"}
View Job Statusβ
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/job_status
{
"success": true,
"status": {
"name": "ccr_db_table_alias",
"state": "running",
"progress_state": "TableIncrementalSync"
}
}
Desynchronize Jobβ
Do not sync any more. Users can swap the source and target clusters.
curl -X POST -H "Content-Type: application/json" -d '{
"name": "job_name"
}' http://ccr_syncer_host:ccr_syncer_port/desync
List All Jobsβ
curl http://ccr_syncer_host:ccr_syncer_port/list_jobs
{"success":true,"jobs":["ccr_db_table_alias"]}
Open binlog for all tables in the databaseβ
Output file structure
The file structure can be seen under the output path after compilation:
output_dir
bin
ccr_syncer
enable_db_binlog.sh
start_syncer.sh
stop_syncer.sh
db
[ccr.db] # Generated after running with the default configurations.
log
[ccr_syncer.log] # Generated after running with the default configurations.
The start_syncer.sh in the following text refers to the start_syncer.sh under its corresponding path.
Usage
bash bin/enable_db_binlog.sh -h host -p port -u user -P password -d db
High availability of Syncerβ
The high availability of Syncer relies on MySQL. If MySQL is used as the backend storage, the Syncer can discover other Syncers. If one Syncer crashes, the others will take over its tasks.
Privilege requirementsβ
select_priv
: read-only privileges for databases and tablesload_priv
: write privileges for databases and tables, including load, insert, delete, etc.alter_priv
: privilege to modify databases and tables, including renaming databases/tables, adding/deleting/changing columns, adding/deleting partitions, etc.create_priv
: privilege to create databases, tables, and viewsdrop_priv
: privilege to drop databases, tables, and views
Admin privileges are required (We are planning on removing this in future versions). This is used to check the enable binlog config
.
Featureβ
Rate limitβ
BE-side configuration parameter
download_binlog_rate_limit_kbs=1024 # Limits the download speed of Binlog (including Local Snapshot) from the source cluster to 1 MB/s in a single BE node
-
The
download_binlog_rate_limit_kbs
parameter is configured on the BE nodes of the source cluster. By setting this parameter, the data pull rate can be effectively limited. -
The
download_binlog_rate_limit_kbs
parameter primarily controls the speed of data transfer for each single BE node. To calculate the overall cluster rate, one would multiply the parameter value by the number of nodes in the cluster.
IS_BEING_SYNCEDβ
Doris v2.0 "is_being_synced" = "true"
During data synchronization using CCR, replica tables (referred to as target tables) are created in the target cluster for the tables within the synchronization scope of the source cluster (referred to as source tables). However, certain functionalities and attributes need to be disabled or cleared when creating replica tables to ensure the correctness of the synchronization process. For example:
- The source tables may contain information that is not synchronized to the target cluster, such as
storage_policy
, which may cause the creation of the target table to fail or result in abnormal behavior. - The source tables may have dynamic functionalities, such as dynamic partitioning, which can lead to uncontrolled behavior in the target table and result in inconsistent partitions.
The attributes that need to be cleared during replication are:
storage_policy
colocate_with
The functionalities that need to be disabled during synchronization are:
- Automatic bucketing
- Dynamic partitioning
Implementationβ
When creating the target table, the Syncer controls the addition or deletion of the is_being_synced
property. In CCR, there are two approaches to creating a target table:
- During table synchronization, the Syncer performs a full copy of the source table using backup/restore to obtain the target table.
- During database synchronization, for existing tables, the Syncer also uses backup/restore to obtain the target table. For incremental tables, the Syncer creates the target table using the CreateTableRecord binlog.
Therefore, there are two entry points for inserting the is_being_synced
property: the restore process during full synchronization and the getDdlStmt during incremental synchronization.
During the restoration process of full synchronization, the Syncer initiates a restore operation of the snapshot from the source cluster via RPC. During this process, the is_being_synced
property is added to the RestoreStmt and takes effect in the final restoreJob, executing the relevant logic for is_being_synced
.
During incremental synchronization, add the boolean getDdlForSync
parameter to the getDdlStmt method to differentiate whether it is a controlled transformation to the target table DDL, and execute the relevant logic for isBeingSynced during the creation of the target table.
Regarding the disabling of the functionalities mentioned above:
- Automatic bucketing: Automatic bucketing is enabled when creating a table. It calculates the appropriate number of buckets. This may result in a mismatch in the number of buckets between the source and target tables. Therefore, during synchronization, obtain the number of buckets from the source table, as well as the information about whether the source table is an automatic bucketing table in order to restore the functionality after synchronization. The current recommended approach is to default the autobucket attribute to false when retrieving distribution information. During table restoration, check the
_auto_bucket
attribute to find out if the source table is an automatic bucketing table. If it is, set the target table's autobucket field to true to bypass the calculation of bucket numbers and directly apply the number of buckets from the source table to the target table. - Dynamic partitioning: This is implemented by adding
olapTable.isBeingSynced()
to the condition for executing add/drop partition operations. This ensures that the target table does not perform periodic add/drop partition operations during synchronization.
Noteβ
The is_being_synced
property should be fully controlled by the Syncer, and users should not modify this property manually unless there are exceptional circumstances.