Exporting Overview
Export
is a feature provided by Doris that allows for the asynchronous export of data. This feature allows the user to export the data of specified tables or partitions in a specified file format through the Broker process or S3 protocol/ HDFS protocol, to remote storage such as object storage or HDFS.
Currently, EXPORT
supports exporting Doris local tables / views / external tables and supports exporting to file formats including parquet, orc, csv, csv_with_names, and csv_with_names_and_types.
This document mainly introduces the basic principles, usage, best practices and precautions of Export.
Principles
After a user submits an Export Job
, Doris will calculate all the Tablets involved in this job. Then, based on the parallelism
parameter specified by the user, these tablets will be grouped. Each thread is responsible for a group of tablets, generating multiple SELECT INTO OUTFILE
query plans. The query plan will read the data from the included tablets and then write the data to the specified path in remote storage through S3 protocol/ HDFS protocol/ Broker.
The overall execution process is as follows:
-
The user submits an Export job to FE.
-
FE calculates all the tablets to be exported and groups them based on the
parallelism
parameter. Each group generates multipleSELECT INTO OUTFILE
query plans based on themaximum_number_of_export_partitions
parameter. -
Based on the parallelism parameter, an equal number of
ExportTaskExecutor
are generated, and eachExportTaskExecutor
is responsible for a thread, which is scheduled and executed by FE'sJob scheduler
framework. -
FE's
Job scheduler
schedules and executes theExportTaskExecutor
, and eachExportTaskExecutor
serially executes the multipleSELECT INTO OUTFILE
query plans it is responsible for.
Start Export
For detailed usage of Export, please refer to EXPORT.
Export's detailed commands can be passed through HELP EXPORT;
in mysql client. Examples are as follows:
Export to HDFS
EXPORT TABLE db1.tbl1
PARTITION (p1,p2)
[WHERE [expr]]
TO "hdfs://host/path/to/export/"
PROPERTIES
(
"label" = "mylabel",
"column_separator"=",",
"columns" = "col1,col2",
"parallelusm" = "3"
)
WITH BROKER "hdfs"
(
"username" = "user",
"password" = "passwd"
);
-
label
: The identifier of this export job. You can use this identifier to view the job status later. -
column_separator
: Column separator. The default is\t
. Supports invisible characters, such as'\x07'. -
column
: columns to be exported, separated by commas, if this parameter is not filled in, all columns of the table will be exported by default. -
line_delimiter
: Line separator. The default is\n
. Supports invisible characters, such as'\x07'. -
parallelusm
:Exporting with 3 concurrent threads.
Export to Object Storage (Supports S3 Protocol)
EXPORT TABLE test TO "s3://bucket/path/to/export/dir/"
WITH S3 (
"s3.endpoint" = "http://host",
"s3.access_key" = "AK",
"s3.secret_key"="SK",
"s3.region" = "region"
);
-
s3.access_key
/s3.secret_key
:Is your key to access the object storage API. -
s3.endpoint
:Endpoint indicates the access domain name of object storage external services. -
s3.region
:Region indicates the region where the object storage data center is located.
View Export Status
After submitting a job, the job status can be viewed by querying the SHOW EXPORT command. The results are as follows:
mysql> show EXPORT\G;
*************************** 1. row ***************************
JobId: 14008
State: FINISHED
Progress: 100%
TaskInfo: {"partitions":[],"max_file_size":"","delete_existing_files":"","columns":"","format":"csv","column_separator":"\t","line_delimiter":"\n","db":"default_cluster:demo","tbl":"student4","tablet_num":30}
Path: hdfs://host/path/to/export/
CreateTime: 2019-06-25 17:08:24
StartTime: 2019-06-25 17:08:28
FinishTime: 2019-06-25 17:08:34
Timeout: 3600
ErrorMsg: NULL
OutfileInfo: [
[
{
"fileNumber": "1",
"totalRows": "4",
"fileSize": "34bytes",
"url": "file:///127.0.0.1/Users/fangtiewei/tmp_data/export/f1ab7dcc31744152-bbb4cda2f5c88eac_"
}
]
]
1 row in set (0.01 sec)
-
JobId: The unique ID of the job
-
State: Job status:
-
PENDING: Jobs to be Scheduled
-
EXPORTING: Data Export
-
FINISHED: Operation Successful
-
CANCELLED: Job Failure
-
-
Progress: Work progress. The schedule is based on the query plan. Assuming there are 10 threads in total and 3 have been completed, the progress will be 30%.
-
TaskInfo: Job information in Json format:
-
db: database name
-
tbl: Table name
-
partitions: Specify the exported partition.
empty
Represents all partitions. -
column separator: The column separator for the exported file.
-
line delimiter: The line separator for the exported file.
-
tablet num: The total number of tablets involved.
-
Broker: The name of the broker used.
-
max_file_size: The maximum size of an export file.
-
delete_existing_files: Whether to delete existing files and directories in the specified export directory.
-
columns: Specifies the column names to be exported. Empty values represent exporting all columns.
-
format: The file format for export.
-
-
-
Path: Export path on remote storage.
-
CreateTime/StartTime/FinishTime: Creation time, start scheduling time and end time of jobs.
-
Timeout: Job timeout. The unit is seconds. This time is calculated from CreateTime.
-
Error Msg: If there is an error in the job, the cause of the error is shown here.
-
OutfileInfo: If the export job is successful, specific
SELECT INTO OUTFILE
result information will be displayed here.
Cancel Export Job
After submitting a job, the job can be canceled by using the CANCEL EXPORT command. For example:
CANCEL EXPORT
FROM example_db
WHERE LABEL like "%example%";
Best Practices
Concurrent Export
An Export job can be configured with the parallelism
parameter to concurrently export data. The parallelism
parameter specifies the number of threads to execute the EXPORT Job
. Each thread is responsible for exporting a subset of the total tablets.
The underlying execution logic of an Export Job
is actually the SELECT INTO OUTFILE
statement. Each thread specified by the parallelism
parameter executes independent SELECT INTO OUTFILE
statements.
The specific logic for splitting an Export Job
into multiple SELECT INTO OUTFILE
is, to evenly distribute all the tablets of the table among all parallel threads. For example:
-
If num(tablets) = 40 and parallelism = 3, then the three threads will be responsible for 14, 13, and 13 tablets, respectively.
-
If num(tablets) = 2 and parallelism = 3, then Doris automatically sets the parallelism to 2, and each thread is responsible for one tablet.
When the number of tablets responsible for a thread exceeds the maximum_tablets_of_outfile_in_export
value (default is 10, and can be modified by adding the maximum_tablets_of_outfile_in_export
parameter in fe.conf), the thread will split the tablets which are responsibled for this thread into multiple SELECT INTO OUTFILE
statements. For example:
- If a thread is responsible for 14 tablets and
maximum_tablets_of_outfile_in_export = 10
, then the thread will be responsible for twoSELECT INTO OUTFILE
statements. The firstSELECT INTO OUTFILE
statement exports 10 tablets, and the secondSELECT INTO OUTFILE
statement exports 4 tablets. The twoSELECT INTO OUTFILE
statements are executed serially by this thread.
exec_mem_limit
The query plan for an Export Job
typically involves only scanning and exporting
, and does not involve compute logic that requires a lot of memory. Therefore, the default memory limit of 2GB is usually sufficient to meet the requirements.
However, in certain scenarios, such as a query plan that requires scanning too many tablets on the same BE, or when there are too many data versions of tablets, it may result in insufficient memory. In these cases, you can adjust the session variable exec_mem_limit
to increase the memory usage limit.
Notes
-
It is not recommended to export large amounts of data at one time. The maximum amount of exported data recommended by an Export job is tens of GB. Excessive export results in more junk files and higher retry costs.
-
If the amount of table data is too large, it is recommended to export it by partition.
-
During the operation of the Export job, if FE restarts or cuts the master, the Export job will fail, requiring the user to resubmit.
-
If the Export job fails, the temporary files and directory generated in the remote storage will not be deleted, requiring the user to delete them manually.
-
Export jobs scan data and occupy IO resources, which may affect the query latency of the system.
-
The Export job can export data from
Doris Base tables
,View
, andExternal tables
, but not fromRollup Index
. -
When using the EXPORT command, please ensure that the target path exists, otherwise the export may fail.
-
When concurrent export is enabled, please configure the thread count and parallelism appropriately to fully utilize system resources and avoid performance bottlenecks.
-
When exporting to a local file, pay attention to file permissions and the path, ensure that you have sufficient permissions to write, and follow the appropriate file system path.
-
It is possible to monitor progress and performance metrics in real-time during the export process to identify issues promptly and make optimal adjustments.
-
It is recommended to verify the integrity and accuracy of the exported data after the export operation is completed to ensure the quality and integrity of the data.
Relevant configuration
FE
maximum_tablets_of_outfile_in_export
: The maximum number of tablets allowed for an OutFile statement in an ExportExecutorTask.
More Help
For more detailed syntax and best practices used by Export, please refer to the Export command manual, You can also enter HELP EXPORT
at the command line of the MySql client for more help.
The underlying implementation of the EXPORT
command is the SELECT INTO OUTFILE
statement. For more information about SELECT INTO OUTFILE, please refer to Export Query Result and SELECT INTO OUTFILE.