This document introduces the architecture of the TDStore storage engine for TDSQL Boundless, along with its core three-tier metadata model and KV encoding mechanism.
Prerequisites: Architecture Overview - Learn about the overall architecture and three major core components of TDSQL Boundless TDStore storage engine architecture
TDStore is the core storage engine of TDSQL Boundless. Built on RocksDB, it incorporates a data shard module and a distributed transaction module, with an underlying Raft consensus protocol module. This implements a distributed KV storage engine featuring high scalability, high availability, distributed transactions, data rebalancing, and strong consistency with multiple replicas.
System Architecture
TDStore storage engine includes the following core components:
|
RocksDB | Underlying KV Storage | LSM-Tree structure, high compression ratio |
Multi-Raft | multi-replica disaster recovery | Strong data consistency, high availability |
Sharding Module | Data Shard Management | Flexible scheduling, affinity support |
Transaction Module | Distributed Transaction | 2PC offloading, negotiated commit |
Layered Description
Single-Machine KV Storage Engine (RocksDB)
Receives KV requests forwarded from the compute layer.
Uses the LSM-Tree structure to store data.
Supports scheduling and migration at the data shard level.
The system implements flexible scheduling of data between cluster nodes.
Maintains information such as participant context and transaction status for distributed transactions across data shards.
Supports scheduling correlated data to the same data shard.
Multi-Replica Disaster Recovery Layer (Multi-Raft)
Creates multiple replicas for each data shard across different TDStore nodes using the Raft Group approach.
Each data shard is synchronized as an independent log stream.
Low-Cost Mass Storage
TDStore storage layer stores and manages data based on the LSM-Tree + SSTable structure:
Extremely High Compression Ratio: Effectively reduces storage costs for massive data.
PB-Level Support: A single instance can support PB-level storage capacity.
Multi-level Compression: Provides compression algorithms at each data layer.
three-level metadata model
Note:
Core Concept: TDSQL Boundless implements fine-grained data management and intelligent scheduling through a three-level metadata model.
Facing the three major challenges of distributed architecture: perception gap, constrained scheduling, and rigid rules, TDSQL Boundless addresses them through a three-level metadata model:
DataObject (Data Object)
Definition: Logical-level conceptual abstractions such as tables, indexes, partitions, and auto-increment values.
Hierarchical Structure:
L0 Level: Database (Database)
L1 Level: Table (Table), belonging to a specific Database.
L2 Level: Index/Partition (Index or Partition), belonging to a specific Table.
Role:
Define different types of data structures.
Serves to enable topology-aware data affinity relationships.
Record table structure, secondary index, and other metadata.
Example: Object ID 10010 can clearly represent: a secondary index under the primary partition (id:1003) of the partitioned table (id:1001) in the database (id:1).
Replication Group (Replication Group)
Definition: A physical storage unit based on the Raft protocol, featuring one master and N replicas to ensure data consistency.
Role Type:
|
Leader | Primary replica, handling all read and write requests. |
Follower | Replica, synchronizes data and can participate in elections. |
Learner | Learner, synchronizes data but does not participate in elections. |
Witness | Witness, participates in elections but does not store data. |
Characteristics:
Corresponds to a Raft log stream.
Managing data across multiple different Regions.
Supports data affinity scheduling.
Region (data shard)
Definition: A continuous range of Key Range, which is the smallest unit of physical data storage.
Capacity Standard:
Maximum 256MB
or up to 100,000 rows of data
Rules:
A Resource Group (RG) can contain multiple Regions.
Each Region holds a portion of the actual data for a certain DataObject.
A single shard can contain at most the data of one data object.
KV Encoding and Data Space
Encoding Rules
In TDSQL Boundless, all data is encoded in Key-Value format. The encoded Keys feature the mem-comparable (memory-comparable) characteristic.
Encoding Characteristics:
The system assigns a globally increasing unique ID to each index.
All data of the same index shares the same prefix.
Encoded data is logically contiguous.
Data Shards and Replication Groups
In the logical data space, each Key corresponds to a discrete point, but physically each Key-Value requires storage space. When the data volume increases, a single node cannot accommodate all data, thus the data is partitioned into multiple shards (Regions).
Key Features:
Data of the same index is spatially contiguous.
Different indexes of the same table may be distributed across different, non-contiguous Regions.
By using replication groups, associated Regions can be scheduled to the same node.