Overview
Overview
Supported data sourcesβ
Apache Doris provides a variety of data import solutions, and you can choose different data loading methods for different data sources.
By sceneβ
Data Source | Loading Method |
---|---|
Object Storage (s3), HDFS | Loading data using Broker |
Local file | Loading local data |
Kafka | Subscribing to Kafka data |
MySQL, PostgreSQL, Oracle, SQLServer | Sync data via external table |
Loading via JDBC | Sync data using JDBC |
Loading JSON format data | JSON format data Loading |
By loading methodβ
Loading method | Description |
---|---|
Broker Load | Import external storage data via Broker |
Stream Load | Stream import data (local file and memory data) |
Routine Load | Import Kafka data |
Insert Into | External table imports data through INSERT |
S3 Load | Object storage data import of S3 protocol |
MySQL Load | Local data import of MySql protocol |
Supported Data Formatsβ
Different import methods support slightly different data formats.
Import Methods | Supported Formats |
---|---|
Broker Load | parquet, orc, csv, gzip |
Stream Load | csv, json, parquet, orc |
Routine Load | csv, json |
MySql Load | csv |
Import instructionsβ
The data import implementation of Apache Doris has the following common features, which are introduced here to help you better use the data import function
Atomicityβ
Each import job of Doris, whether it is batch import using Broker Load or single import using INSERT statement, is a complete transaction operation. The import transaction can ensure that the data in a batch takes effect atomically, and there will be no partial data writing.
At the same time, an import job will have a Label. This Label is unique under a database (Database) and is used to uniquely identify an import job. Label can be specified by the user, and some import functions will also be automatically generated by the system.
Label is used to ensure that the corresponding import job can only be successfully imported once. A successfully imported Label, when used again, will be rejected with the error Label already used
. Through this mechanism, At-Most-Once
semantics can be implemented in Doris. If combined with the At-Least-Once
semantics of the upstream system, the Exactly-Once
semantics of imported data can be achieved.
For best practices on atomicity, see Loading Transactions and Atomicity.
Synchronous and asynchronous loadingβ
Loading methods are divided into synchronous and asynchronous methods. For the synchronous loading method, the returned result indicates whether the loading succeeds or fails. For the asynchronous loading method, a successful result only means that the job was submitted successfully, but not that the data was loaded successfully. You need to use the corresponding command to check the running status of the loading job.
Load the array-type dataβ
The array function can only be supported in vectorization scenarios, but non-vectorization scenarios are not supported. If you want to apply the array function to load data, enable vectorization engine, and then cast the input parameter column into the array type according to the parameter of the array function. Finally, you can continue to use the array function.
For example, in the following loading job, you need to cast columns b14 and a13 into array<string>
type, and then use the array_union
function.
LOAD LABEL label_03_14_49_34_898986_19090452100 (
DATA INFILE("hdfs://test.hdfs.com:9000/user/test/data/sys/load/array_test.data")
INTO TABLE `test_array_table`
COLUMNS TERMINATED BY "|" (`k1`, `a1`, `a2`, `a3`, `a4`, `a5`, `a6`, `a7`, `a8`, `a9`, `a10`, `a11`, `a12`, `a13`, `b14`)
SET(a14=array_union(cast(b14 as array<string>), cast(a13 as array<string>))) WHERE size(a2) > 270)
WITH BROKER "hdfs" ("username"="test_array", "password"="")
PROPERTIES( "max_filter_ratio"="0.8" );
Execution Engine Selectedβ
The Pipeline engine is turned off by default on import, and is enabled by the following two variables:
enable_pipeline_load
in the FE configuration file. When enabled, import tasks such as Streamload will try to use the Pipeline engine.enable_nereids_dml_with_pipeline
in Session Variable to enable insert into to try to use the Pipeline engine.
When the above variables are turned on, whether and which set of Pipeline engine is used still depends on the settings of the other two Session Variables enable_pipeline_engine
and enable_pipeline_x_engine
. When both are enabled, PipelineX is selected in preference to the Pipeline Engine. If neither is enabled, the loading job will not be executed using the Pipeline engine even if the above variables are set to true
.