Partial Update
Partial Update
Overviewβ
To perform partial column updates on your table, Doris provides two models: the Unique Key model and the Aggregate Key model.
Unique Key Modelβ
Doris's Unique Key Model provides default whole-row Upsert semantics. Before version 2.0, if users wanted to update certain columns of some rows, they could only use the UPDATE
command. However, due to the granularity of locks in read-write transactions, the UPDATE
command is not suitable for high-frequency data write scenarios. Therefore, in version 2.0, we introduced support for partial column updates in the Unique Key model.
Note:
- Partial update are only supported in the Unique Key Merge-on-Write implementation in version 2.0.0.
- Version 2.0.2 introduces support for performing partial column updates using
INSERT INTO
.- Version 2.1.0 will offer more flexible column updates, as described in the "Usage Limitations" section below.
Aggregate Key Modelβ
Aggregate Key tables are primarily used in pre-aggregation scenarios rather than data update scenarios. However, column updates can also be achieved by setting the aggregation function to REPLACE_IF_NOT_NULL
.
Applicable scenariosβ
- Real-time dynamic column updates. Where certain field values in the table need to be updated in real-time at high frequencies. For example, in a user label table generated by T+1, some fields containing information about the latest user behaviors need to be updated in real-time for real-time analysis and decision-making in advertising/recommendation systems.
- Combining multiple source tables into one large wide table.
- Data correction.
Fundamentalsβ
For more information about the principles of the Unique Key model and Aggregate Key model, please refer to the Data Model introduction.
Unique Key Modelβ
The Unique Key model currently supports column updates only in the Merge-on-Write implementation.
Users write data for certain columns into Doris's Memtable through the regular load methods. At this point, the Memtable does not contain complete rows of data. When flushing the Memtable, Doris searches for historical data, fills in the entire row, then writes it to the data file, and marks the data rows with the same key in the historical data file as deleted.
In the case of concurrent loads, Doris uses the MVCC mechanism to ensure data correctness. If two batches of loaded data update different columns with the same key, the load task with the higher system version will recomplete the data rows with the same key written by the lower version load task after the lower version load task succeeds.
Aggregate Key Modelβ
You can achieve partial column updates by setting the aggregation function to REPLACE_IF_NOT_NULL
. For detailed usage examples, please refer to the following section.
Concurrent Writes and Data Visibilityβ
Partial column updates support high-frequency concurrent writes, and once the write is successful, the data becomes visible. The system automatically ensures the correctness of concurrent writes through the MVCC mechanism.
Performanceβ
Usage Recommendations:
- For users with high write performance requirements and low query performance requirements, it is recommended to use the Aggregate Key Model.
- For users with high query performance requirements and lower write performance requirements (e.g., data writes and updates are mainly completed during off-peak hours in the early morning), or for users with low write frequency, it is recommended to use the Unique Key Merge-on-Write implementation.
Unique Key Merge-on-Write Implementationβ
Since the Merge-on-Write implementation requires filling in entire rows of data during data writes to ensure optimal query performance, performing partial column updates using the Merge-on-Write implementation may lead to noticeable load performance degradation.
Write Performance Optimization Recommendations:
- Use SSDs equipped with NVMe or high-speed SSD cloud disks. Because reading historical data in large quantities occurs when filling in row data, might results in higher read IOPS and read throughput.
- Enabling row storage can significantly reduce the IOPS generated when filling in row data, leading to a noticeable improvement in load performance. Users can enable row storage when creating tables using the following property:
"store_row_column" = "true"
Aggregate Key Modelβ
The Aggregate Key model does not perform any additional processing during the write process, so its write performance is not affected and is the same as regular data ingestion. However, there is a significant cost associated with aggregation during query execution, with typical aggregation query performance being 5-10 times lower compared to the Merge-on-Write implementation of the Unique Key model.
Usage Instructions and Examplesβ
Unique Key Modelβ
Table Creationβ
When creating a table, you need to specify the following property to enable the Merge-on-Write implementation:
enable_unique_key_merge_on_write = true
StreamLoad/BrokerLoad/RoutineLoadβ
If you are using StreamLoad/BrokerLoad/RoutineLoad, add the following header when loading:
partial_columns:true
Also, specify the columns to be loaded in the columns
header (it must include all key columns, or else updates cannot be performed).
Flink Connectorβ
If you are using the Flink Connector, you need to add the following configuration:
'sink.properties.partial_columns' = 'true',
Also, specify the columns to be loaded in sink.properties.columns
(it must include all key columns, or else updates cannot be performed).
INSERT INTOβ
In all data models, by default, when you use INSERT INTO
with a given set of columns, the default behavior is to insert the entire row. To enable partial column updates in the Merge-on-Write implementation, you need to set the following session variable:
set enable_unique_key_partial_update=true
Please note that the default value of the session variable enable_insert_strict
, which controls whether the insert statement operates in strict mode, is true. In other words, the insert statement is in strict mode by default, and in this mode, updating non-existing keys in partial column updates is not allowed. Therefore, when using the insert statement for partial columns update and wishing to insert non-existing keys, you need to set enable_unique_key_partial_update
to true and simultaneously set enable_insert_strict
to false.
Exampleβ
Suppose there is an order table order_tbl
in Doris, where the order ID is the Key column, and the order status and order amount are Value columns. The data status is as follows:
Order ID | Order Amount | Order Status |
---|---|---|
1 | 100 | Pending Payment |
+----------+--------------+-----------------+
| order_id | order_amount | order_status |
+----------+--------------+-----------------+
| 1 | 100 | Pending Payment |
+----------+--------------+-----------------+
1 row in set (0.01 sec)
Now, when a user clicks to make a payment, he needs to change the order status of the order with Order ID '1' to 'Pending Delivery'.
If you are using StreamLoad, you can perform the update as follows:
$ cat update.csv
1,Pending Delivery
$ curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:48037/api/db1/order_tbl/_stream_load
If you are using INSERT INTO
, you can perform the update as follows:
set enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) values (1,'Pending Delivery');
The updated result is as follows:
+----------+--------------+------------------+
| order_id | order_amount | order_status |
+----------+--------------+------------------+
| 1 | 100 | Pending Delivery |
+----------+--------------+------------------+
1 row in set (0.01 sec)
Aggregate Key Modelβ
Table Creationβ
Set the aggregation function for the columns that need column updates to REPLACE_IF_NOT_NULL
as follows:
CREATE TABLE `order_tbl` (
`order_id` int(11) NULL,
`order_amount` int(11) REPLACE_IF_NOT_NULL NULL,
`order_status` varchar(100) REPLACE_IF_NOT_NULL NULL
) ENGINE=OLAP
AGGREGATE KEY(`order_id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`order_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
Write Dataβ
Whether through load tasks or using INSERT INTO
, you can directly write the data for the columns you want to update.
Exampleβ
Similar to the previous example, the corresponding Stream Load command (no additional header required) is:
curl --location-trusted -u root: -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:48037/api/db1/order_tbl/_stream_load
The corresponding INSERT INTO
statement (no need to set additional session variables) is:
INSERT INTO order_tbl (order_id, order_status) values (1,'Pending Delivery');
Usage Limitationsβ
Unique Key Merge-on-Write Implementationβ
All rows in the same batch of data write tasks (whether load tasks or INSERT INTO
) can only update the same columns. If you need to update different columns of data, you will need to write them in separate batches.
Aggregate Key Modelβ
Users cannot change a field from non-NULL to NULL; any NULL values written with the REPLACE_IF_NOT_NULL
aggregation function will be automatically ignored during processing.