Data Modeling and Schema Design Specifications

Download

Modo Foco

Tamanho da Fonte

Última atualização: 2026-06-24 10:35:23

Scenarios
MongoDB's flexible document model facilitates rapid business iteration. In practical applications, however, this flexibility must be combined with sound design principles to ensure long-term system stability. Based on past online operational experience, if adequate modeling guidance is lacking during the initial phase, the following common data modeling pitfalls can introduce performance challenges during periods of rapid business growth:
Divergent naming conventions: The coexistence of naming styles such as OrderDetail, order_detail, and orderdetail within the same project increases communication costs for later code maintenance and data troubleshooting.
Unbounded array growth: Developers habitually push continuously generated child data, such as a user's complete historical login logs, into an array within the same document. Without a truncation mechanism, the 16MB physical limit of a single MongoDB BSON document may be reached, causing some write requests to be blocked.
Field type drift: A single field, such as age, contains a mixture of numeric and string types. This not only increases frontend parsing complexity but may also, under certain conditions, affect the efficiency of underlying indexes, leading to unexpected query results.
Excessive collection fragmentation: If a single cluster contains too many collections (exceeding 5000), it increases the metadata management load on the underlying WiredTiger storage engine, potentially prolonging instance startup time and, in extreme cases, increasing memory overhead.
Over-reliance on collection splitting and joins: Developers habitually apply relational database thinking, forcibly splitting the "order master collection" and "order detail collection" into two separate collections. This practice degrades an operation that could have been completed with a single embedded read into multiple round-trip network I/O calls.
Objective of this document: To help you fully leverage the advantages of the MongoDB document model and avoid designing MongoDB with a relational database mindset.
Naming Specifications
Specification 1: Prohibition on Using System Database Names
Core Actions:
Business data is strictly prohibited from occupying or sharing MongoDB's system-reserved databases (including but not limited to admin, local, and config).
Rule
Requirement
Correct Example
Incorrect Example
Red line
Isolate business data from system metadata.
db_config (independent business database)
admin,local,config
Business Application:
A team, for convenience, directly created a business collection within the admin database to store system configuration data. Because the admin database handles core functions such as user authentication and cluster management, high-priority system locks may need to be obtained during administrative operations, competing with business reads and writes and causing the entire database's response to slow down dramatically. Performance recovered only after the configuration data was migrated to a separate business database.
Specification 2: Database Naming
Core Actions:
It is recommended to use a database name with the prefix `db_` + lowercase letters + underscores.
Naming Conventions and Examples:
Rule
Requirement
Correct Example
Incorrect Example
Prefix
Start with db_
db_order
Order
Character Set
Lowercase letters + underscores
db_user_center
db.user.center
Length
≤ 64 bytes
db_payment
db-payment
Business Application:
A development team created a database named `UserCenter`. However, in the production environment (Linux), the connection strings for some microservices were mistakenly written as `usercenter`. Because MongoDB is strictly case-sensitive for database names in Linux environments, a brand new empty database was inadvertently created in production, causing widespread null pointer errors in core business queries. Furthermore, due to the lack of a unified prefix, DBAs found it difficult to quickly identify which databases were core business ones when cleaning up historically abandoned temporary databases within the same cluster. After the naming convention of a `db_` prefix followed by all lowercase letters (such as `db_user_center`) is fully implemented, case-sensitivity issues for database names in cross-platform deployments were eliminated, and the security boundary for Ops became tightly controlled.
Specification 3: Collection Naming
Core Actions:
It is recommended to use a collection name with the t_ prefix + lowercase letters + underscores, following the "module_entity" format.
Naming Conventions and Examples:
Rule
Requirement
Correct Example
Incorrect Example
Prefix
Start with t_
t_order_detail
OrderDetail
Format
Module_Entity
t_user_address
t_user-address
Disable
Not starting with system.
t_system_config
system.config
Collection Sharding
Time suffix
t_log_202403
t_log$202403
Business Application:
A project used `system.orders` as the order collection name. Because `system.` is a reserved prefix in MongoDB, some underlying management operations mistakenly identified it as a system internal collection and skipped it directly, ultimately causing incomplete auto-backup data. The issue was resolved after it was renamed to `t_orders`.
Specification 4: Field Naming
Core Actions:
It is recommended to uniformly adopt camelCase (lower camel case) or snake_case (snake case) for naming fields. Field design should be "self-explanatory" and avoid excessive description. The use of meaningless abbreviations is strictly prohibited.
Field Naming Comparison:
// Recommended: Clear semantics and consistent style
{
    "_id": ObjectId("..."),
    "userName": "Zhang San",           // camelCase style
    "createTime": ISODate("..."),
    "orderItems": [...],
    "totalAmount": 199.00
}
﻿
// Not recommended: Confusing naming
{
    "_id": ObjectId("..."),
    "UN": "Zhang San",                 // Unclear abbreviation
    "Create_Time": ISODate("..."), // Mixed style
    "oi": [...],                  // Unclear meaning
    "_total": 199.00              // Business fields starting with an underscore are prone to conflict with system fields.
}
Business Application:
A team had inconsistent field naming. Within the same collection, four different notations — createTime, Create_Time, create_time, and CT — were used to represent the creation time. Developers frequently misspelled field names, resulting in empty query results and increasing troubleshooting time from minutes to hours. After the naming convention was standardized, development efficiency improved significantly.
Document Design Specifications
Specification 5: Controlling Single Document Size
Core Action: The size of a single document is recommended to be kept within 100KB and not exceed the 16MB limit. The larger the document volume, the more significant the impact on read/write performance, memory usage, and network transmission.
Large Document Processing Policy:
Scenario
Solution
Example
Excessive Array Elements
Split into multiple documents
User posts: one document per post
Storing Large Files
Using GridFS
Images, videos, and large logs
Large Text Content
Compression at the business layer
Storing HTML content after compression
Oversized Files
Using COS + URL referencing
Store files in COS and store URLs in MongoDB.
Method for Detecting Document Size:
// View the size of a single document
Object.bsonsize(db.collection.findOne({ _id: xxx }))
﻿
// View the average document size of a collection (unit: bytes)
db.collection.stats().avgObjSize
Business Application:
A social media platform stored all user posts in the posts array of each user's document. After active users published thousands of posts, the document size exceeded 16MB, preventing new posts from being written and causing users to complain about posting failures. The issue was completely resolved by changing to a design where each post is stored as a separate document linked by the user ID.
Specification 6: Controlling Nesting Levels
Core Actions:
It is recommended to keep the nesting level of documents within 3-5 layers to avoid overly deep nested logic.
// Recommendation: Moderate nesting level (3 layers)
{
    "_id": ObjectId("..."),
    "orderId": "ORD202403001",
    "customer": {                          // Layer 1
        "name": "Zhang San",
        "contact": {                       // Layer 2
            "phone": "13800138000",
            "email": "zhangsan@example.com",
            "address": {                   // Layer 3
                "city": "Xi'an",
                "street": "Keji Road"
            }
        }
    },
    "items": [{ "productId": "P001", "quantity": -2 }]  // Layer 1
}
﻿
// Not recommended: Excessive nesting (more than 5 layers)
{
    "level1": {
        "level2": {
            "level3": {
                "level4": {
                    "level5": {
                        "level6": {
                            "data": "Excessive nesting leads to extremely high query and maintenance costs."
                        }
                    }
                }
            }
        }
    }
}
Business Application:
A configuration system was designed with an 8-layer nested configuration structure. Modifying the innermost configuration requires constructing a complex $set path, such as "a.b.c.d.e.f.g.value". Developers frequently wrote incorrect paths, causing configuration updates to fail and preventing the creation of effective indexes for deep fields. After the structure was flattened through refactoring, both configuration updates and queries became simple and efficient.
Specification 7: Embedded vs. Referenced Design
Core Actions:
Prioritize embedded design. Use a referenced design only in necessary scenarios, such as when large data volumes are involved or when frequent independent updates are required.
Design Decision collection:
Consideration Factor
Prefer [Embedded]
Prefer [Referenced]
Read Mode
Data is always read together.
Data is frequently read individually.
Total Data Volume
The child data volume is small and its scale is limited.
Large child data volume or an unlimited growth trend.
Update Frequency
Child data is rarely updated independently.
Child data is frequently updated independently.
Relationship Type
One-to-one, one-to-few
One-to-many, many-to-many
Sharing
Child data belongs exclusively to a single parent document.
Child data is frequently shared by multiple documents.
Embedded Design Example:
// Order + Order Items: Embedded (Always queried together; order items do not exist independently)
{
    "_id": ObjectId("..."),
    "orderId": "ORD202403001",
    "customerId": "C001",
    "items": [
        { "productId": "P001", "name": "Product A", "quantity": 2, "price": 99.00 },
        { "productId": "P002", "name": "Product B", "quantity": 1, "price": 199.00 }
    ],
    "totalAmount": 397.00,
    "status": "paid",
    "createTime": ISODate("2024-03-15T10:30:00Z")
}
Referenced Design Example:
// User + Article: Referenced (Articles are frequently queried independently and their number grows indefinitely)
// User Document
{
    "_id": ObjectId("user_001"),
    "userName": "Zhang San",
    "email": "zhangsan@example.com"
}
﻿
// Article Document (referencing the user via authorId)
{
    "_id": ObjectId("article_001"),
    "title": "MongoDB Best Practices",
    "authorId": ObjectId("user_001"),  // referencing the user
    "content": "...",
    "createTime": ISODate("...")
}
Hybrid Mode redundantly embeds frequently accessed sub-data while preserving reference relationships.
// ✅ Hybrid Mode: Redundantly embeds product names and prices (snapshots) in orders while preserving productId references.
{
    "orderId": "ORD001",
    "items": [
        {
            "productId": ObjectId("..."),  // reference (used to associate with the latest product information)
            "name": "Product A",               // redundant (snapshot at order placement to prevent product renaming from affecting historical orders)
            "price": NumberDecimal("99.00") // redundant (price snapshot at order placement)
        }
    ]
}
Business Application:
An e-commerce system separates orders and order items into two collections (relational thinking). Querying order details requires first querying the order, then querying the order items, and finally performing application-layer joins. During major promotions, the API latency surged from 50ms to 800ms. After an embedded design was adopted (embedding order items within the order document), a single query returned complete data and latency dropped to 30 ms, and the code is also significantly simplified.
Specification 8: Considerations for Array Design
Core Actions:
It is recommended to keep the number of elements in a single array under 1000. Designing arrays that grow indefinitely is strictly prohibited.
Example: Infinitely Growing Array vs. Independent Document Association:
// Not recommended: unbounded, infinitely growing arrays
﻿
{
    "userId": "user_10001",
    "orders": [
        { "orderId": "ORD_001", "amount": 99.00, "date": ISODate("...") },
        { "orderId": "ORD_002", "amount": 158.00, "date": ISODate("...") },
        // ... Active users may accumulate tens of thousands of orders, triggering the 16MB limit.
    ]
}
﻿
// Recommended: Split array elements into separate documents and associate them via foreign keys.
// User Document
{
    "userId": "user_10001",
    "name": "Zhang San",
    "orderCount": 1024
}
﻿
// Order Document (Independent Collection)
{
    "orderId": "ORD_001",
    "userId": "user_10001",   // foreign key association
    "amount": 99.00,
    "date": ISODate("2024-03-15T10:30:00Z")
}
Business Application:
An IoT platform stores all sensor readings in the readings array of a device document. After the platform had been running for a year, the array for an active device contained hundreds of thousands of readings, and the document exceeded 16MB, preventing new data from being written. After switching to the "bucket pattern" (one document per hour), the size of a single document stabilized below 100KB.
Note:
For scenarios where only the most recent N records need to be retained (for example, only the last 10 login records), it is recommended to use the $push operator with the $slice modifier during write operations. This automatically maintains a fixed upper limit for the array length at the database level, avoiding multiple reads and truncations at the application layer.
For time-series data such as IoT or monitoring data, MongoDB 5.0+ has introduced native Time Series Collections. You should prioritize using MongoDB's native Time Series collections over manually implementing a bucket pattern, as this can achieve higher compression rates and better query performance.
Data Type Specifications
Specification 9: Selecting the Correct Data Type
Core Actions:
Use the Date type for dates, Decimal128 for monetary amounts, and ObjectId or an incrementing Long for IDs.
Data Type Selection Collection:
Scenario
Recommended Type
Not Recommended Type
Potential Issue
Date and time
Date
String
Cannot use native date operations and range query optimizations.
Financial amount
Decimal128
Double
Loss of floating-point precision, leading to discrepancies in financial reconciliation.
Document primary key
ObjectId (default)
Random string.
Non-incrementing random IDs cause frequent page splits, severely slowing write performance. The first 4 bytes of an ObjectId are a second-level timestamp, providing roughly incrementing characteristics. This causes B-Tree index writes to be concentrated at the tail, avoiding page splits caused by random insertion.
Large integer ID
NumberLong
String
Cannot perform numerical comparisons and range sorting.
Status flag
String (enumeration value)
Numeric
Magic Numbers have unclear meanings, making maintenance difficult later.
Data Type Usage Example:
// Correct Type Usage
{
    "_id": ObjectId("65f3a2b8c1d2e3f4a5b6c7d8"),  // ObjectId
    "orderId": NumberLong("20240315000001"),       // Large integer
    "amount": NumberDecimal("199.99"),             // Use Decimal128 for monetary amounts
    "createTime": ISODate("2024-03-15T10:30:00Z"), // Use Date for dates
    "status": "paid"                               // Use string enumeration for status
}
﻿
// Incorrect Type Usage
{
    "_id": "random-uuid-string",          // Random strings impact performance
    "orderId": "20240315000001",          // Strings cannot be sorted numerically
    "amount": 199.99,                     // Double has precision issues
    "createTime": "2024-03-15 10:30:00",  // Strings cannot be used for date arithmetic
    "status": 1                           // Numeric meaning is unclear
}
Business Application:
A financial system stored monetary amounts using the Double type. Calculating 0.1 + 0.2 yielded 0.30000000000000004, and after cumulative calculations, the discrepancy with bank reconciliation amounted to hundreds of CNY. After switching to Decimal128, calculations became precise to the cent, and reconciliation was completely accurate.
Specification 10: _id Field Usage Specifications
Core Actions:
Unless there are specific requirements, use the default ObjectId as the _id. If business needs require a custom _id, it is recommended that it have an incremental characteristic.
Example:
// Recommendation: Use the default ObjectId
{ "_id": ObjectId("65f3a2b8c1d2e3f4a5b6c7d8") }
﻿
// Acceptable: Custom incremental ID (must ensure the incremental characteristic)
{ "_id": NumberLong("20240315000001") }
﻿
// Prohibited: Random strings (impact write performance)
{ "_id": "550e8400-e29b-41d4-a716-446655440000" }
Business Application:
A system used random UUIDs as primary keys. As data volume grew, random writes to the index tree caused a sharp increase in disk I/O and triggered frequent page splits, reducing write QPS from 10,000 to 3,000. After the switch was made to monotonically increasing ObjectIds, the I/O bottleneck was eliminated by leveraging the sequential append characteristic, and performance returned to normal.
Schema Validation
Specification 11: Configuring Schema Validation for Core Collections
Core Actions:
Use the JSON Schema validation feature provided by MongoDB to ensure type and format consistency for data writes at the database engine level.
Note:
validationLevel: "moderate" mode: Validates only newly written and updated documents, not existing ones (suitable for legacy data migration scenarios).
Schema validation has a certain impact on write performance (typically < 5%). Collections with high write frequency need to be evaluated.
Use the collMod command to modify the validation rules of an existing collection.
Schema Validation Example:
// Create a collection with validation rules
db.createCollection("t_users", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["userName", "email", "createTime"],
            properties: {
                userName: {
                    bsonType: "string",
                    minLength: 2,
                    maxLength: 50,
                    description: "Username, required, 2-50 characters"
                },
                email: {
                    bsonType: "string",
                    pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$",
                    description: "Email, required, must conform to email format"
                },
                age: {
                    bsonType: "int",
                    minimum: 0,
                    maximum: 150,
                    description: "Age, optional, an integer from 0 to 150"
                },
                status: {
                    enum: ["active", "inactive", "deleted"],
                    description: "Status, enumeration value"
                },
                createTime: {
                    bsonType: "date",
                    description: "Creation time, required"
                }
            }
        }
    },
    validationLevel: "strict",    // strict: validates all writes
    validationAction: "error"     // error: rejects writes if validation fails
});
Business Application:
In an e-commerce system, the price field lacked type constraints. Some entries stored numbers like 99.00, some stored strings like "99.00", and some even stored objects like {value: 99}. Price sorting and comparison became completely chaotic, and the "spend 100, get 20 off" promotion logic failed. After Schema validation was added, dirty data was rejected for writing. After the existing data was cleaned, the feature returned to normal.
Capacity Planning
Specification 12: Controlling the Number of Collections
Core Actions:
It is recommended to keep the total number of collections per instance under 5000.
Impact of Excessive Number of Collections:
Impact
Description
Long instance startup time
During engine startup, the metadata information of all collections needs to be loaded one by one. 
High memory consumption
The metadata of each collection persistently resides in the cache, occupying business memory.
File handle consumption
Each collection corresponds to multiple underlying data files, making it easy to hit the system limit.
Complex Ops
The time required for routine operations such as status monitoring, data migration, and major version upgrades will increase exponentially.
Backup timeout or failure
Traversing massive metadata can easily cause physical backup tasks to time out severely, and may even completely fail to generate a physical backup.
Business Application:
An IoT platform initially adopted a "one device, one collection" model. As the business grew, over 50,000 collections were generated in a single database. The massive metadata caused a sharp increase in memory overhead, and the instance restart time deteriorated from seconds to tens of minutes. Backup tasks severely timed out or even terminated directly due to the need to traverse the extensive metadata, preventing the completion of full backups. The practice of sharding by device was abandoned, and the data was consolidated into a single Time Series Collection. A compound index was created using deviceId + timestamp. After optimization, the number of collections was reduced to single digits, and both startup and backup operations returned to normal.
Data Modeling Checklist
To ensure compliance with the specifications, it is recommended to check against the following checklist item by item during the project launch review phase:
Check Item
Verification Method
Passing Criteria
1. Naming Convention Consistency
Review all involved database/collection/field naming code.
Fully comply with the naming and prefix rules in this document.
2. Document Size Controllability
Sample and execute Object.bsonsize(doc).
It is recommended to keep the size of general documents within 100KB and not exceed the 16MB limit.
It is recommended to keep the size of core complex documents within 1MB.
3. Reasonable Nesting Depth
Review the JSON structure tree of core documents.
Maximum nesting depth ≤ 3-5 layers
4. Avoiding Unbounded Arrays
Thoroughly review data write and append logic.
Arrays have a clear business upper limit, or have adopted a bucketing pattern/$slice truncation mechanism.
5. Data Type Strictness
Review entity class field type definitions.
Dates are recommended to use Date, and monetary amounts are recommended to use Decimal128.
6. Enabling Schema Validation
Check collection creation scripts or validator configurations.
Constraints have been fully configured for the required fields, field types, and format validations of core collections.
7. Recommended Maximum of 5000 Collections per Instance.
Estimate and execute show collections statistics.
Estimated/Actual Number of Business Collections per Database ≤ 100

Ajuda e Suporte

Esta página foi útil?

Você também pode entrar em contato com a Equipe de vendas ou Enviar um tíquete em caso de ajuda.

comentários

Rule	Requirement	Correct Example	Incorrect Example
Prefix	Start with `t_`	`t_order_detail`	`OrderDetail`
Format	Module_Entity	`t_user_address`	`t_user-address`
Disable	Not starting with `system.`	`t_system_config`	`system.config`
Collection Sharding	Time suffix	`t_log_202403`	`t_log$202403`

tencent cloud

TencentDB for MongoDB

Data Modeling and Schema Design Specifications

Scenarios

Naming Specifications

Specification 1: Prohibition on Using System Database Names

Specification 2: Database Naming

Specification 3: Collection Naming

Specification 4: Field Naming

Document Design Specifications

Specification 5: Controlling Single Document Size

Specification 6: Controlling Nesting Levels

Specification 7: Embedded vs. Referenced Design

Specification 8: Considerations for Array Design

Data Type Specifications

Specification 9: Selecting the Correct Data Type

Specification 10: _id Field Usage Specifications

Schema Validation

Specification 11: Configuring Schema Validation for Core Collections

Capacity Planning

Specification 12: Controlling the Number of Collections

Data Modeling Checklist

Ajuda e Suporte

Scenario	Solution	Example
Excessive Array Elements	Split into multiple documents	User posts: one document per post
Storing Large Files	Using GridFS	Images, videos, and large logs
Large Text Content	Compression at the business layer	Storing HTML content after compression
Oversized Files	Using COS + URL referencing	Store files in COS and store URLs in MongoDB.

Consideration Factor	Prefer [Embedded]	Prefer [Referenced]
Read Mode	Data is always read together.	Data is frequently read individually.
Total Data Volume	The child data volume is small and its scale is limited.	Large child data volume or an unlimited growth trend.
Update Frequency	Child data is rarely updated independently.	Child data is frequently updated independently.
Relationship Type	One-to-one, one-to-few	One-to-many, many-to-many
Sharing	Child data belongs exclusively to a single parent document.	Child data is frequently shared by multiple documents.

Scenario	Recommended Type	Not Recommended Type	Potential Issue
Date and time	Date	String	Cannot use native date operations and range query optimizations.
Financial amount	Decimal128	Double	Loss of floating-point precision, leading to discrepancies in financial reconciliation.
Document primary key	ObjectId (default)	Random string.	Non-incrementing random IDs cause frequent page splits, severely slowing write performance. The first 4 bytes of an ObjectId are a second-level timestamp, providing roughly incrementing characteristics. This causes B-Tree index writes to be concentrated at the tail, avoiding page splits caused by random insertion.
Large integer ID	NumberLong	String	Cannot perform numerical comparisons and range sorting.
Status flag	String (enumeration value)	Numeric	Magic Numbers have unclear meanings, making maintenance difficult later.

Impact	Description
Long instance startup time	During engine startup, the metadata information of all collections needs to be loaded one by one.
High memory consumption	The metadata of each collection persistently resides in the cache, occupying business memory.
File handle consumption	Each collection corresponds to multiple underlying data files, making it easy to hit the system limit.
Complex Ops	The time required for routine operations such as status monitoring, data migration, and major version upgrades will increase exponentially.
Backup timeout or failure	Traversing massive metadata can easily cause physical backup tasks to time out severely, and may even completely fail to generate a physical backup.

Check Item	Verification Method	Passing Criteria
1. Naming Convention Consistency	Review all involved database/collection/field naming code.	Fully comply with the naming and prefix rules in this document.
2. Document Size Controllability	Sample and execute Object.bsonsize(doc).	It is recommended to keep the size of general documents within 100KB and not exceed the 16MB limit. It is recommended to keep the size of core complex documents within 1MB.
3. Reasonable Nesting Depth	Review the JSON structure tree of core documents.	Maximum nesting depth ≤ 3-5 layers
4. Avoiding Unbounded Arrays	Thoroughly review data write and append logic.	Arrays have a clear business upper limit, or have adopted a bucketing pattern/$slice truncation mechanism.
5. Data Type Strictness	Review entity class field type definitions.	Dates are recommended to use Date, and monetary amounts are recommended to use Decimal128.
6. Enabling Schema Validation	Check collection creation scripts or validator configurations.	Constraints have been fully configured for the required fields, field types, and format validations of core collections.
7. Recommended Maximum of 5000 Collections per Instance.	Estimate and execute show collections statistics.	Estimated/Actual Number of Business Collections per Database ≤ 100