CSV++ (CSV Plus Plus): Extension to RFC 4180 for Hierarchical Data

Introduction CSV++ extends the CSV format defined in to support repeating fields (one-to-many relationships) and hierarchical component structures while maintaining backward compatibility with standard CSV parsers.

Motivation Traditional CSV files represent flat, tabular data. However, real-world data often contains:

Repeated values (e.g., multiple phone numbers for one person)
Structured components (e.g., addresses with street, city, state, zip)
Nested hierarchies (e.g., addresses with multiple address lines)

CSV++ addresses these needs while keeping the simplicity and human-readability of CSV with a straightforward syntax.

Design Principles

Backward Compatibility: Standard CSV parsers can read CSV++ files (though they won't interpret the enhanced structure)
Self-Documenting: Structure is defined in column headers
Human Readable: Data remains readable without special tools
Explicit Over Implicit: Delimiters are declared, not assumed
Recursively Composable: Structures can nest to any depth, though practical implementations SHOULD limit nesting to 3-4 levels for readability

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Conformance with RFC 4180 CSV++ files MUST conform to with these specifications:

Fields are separated by a delimiter (comma by default)
Records are separated by line breaks (CRLF or LF)
Fields containing special characters MUST be enclosed in double-quotes
Double-quotes within quoted fields MUST be escaped by doubling: ""
First record MAY be a header record per RFC 4180. However, CSV++ array and structure features REQUIRE headers to declare field types
MIME type: text/csv

Field Separator Detection The field separator character is detected using the same rules as . Parsers SHOULD auto-detect the field separator by:

Scanning the first line (header row)
Tracking bracket depth: [] and ()
Identifying characters that appear outside brackets (depth = 0)
Selecting the most common such character as the field separator
Common candidates: , (comma), \t (tab), | (pipe), ; (semicolon)

The comma (,) is the conventional field separator for CSV++ files.

Array Fields (Repetitions)

Syntax A field containing repeated values is declared in the header using square brackets: Where:

column_name - The name of the field
[delimiter] - Optional: The character used to separate repeated values
[] - Empty brackets use the default array delimiter

Delimiter Resolution:

If delimiter is specified: phone[|] uses |
If empty brackets: phone[] uses the tilde (~) as default delimiter

The tilde (~) is recommended as the default array delimiter to avoid conflicts with common data characters and the field separator.

Examples

Arrays with Explicit Delimiters

Arrays with Default Delimiters

Empty Values Empty values in repetitions are represented by consecutive delimiters:

This represents three tags: "urgent", "" (empty), "priority"

Escaping If the repetition delimiter appears in the data, the entire field MUST be quoted per :

Structured Fields (Components)

Syntax A field containing structured components is declared using parentheses: Component Delimiter Resolution:

If specified before (: address^(...) uses ^
If omitted: address(...) uses the caret (^) as default delimiter

The caret (^) is recommended as the default component delimiter to avoid conflicts with common data characters.

Examples

Simple Structure

Repeated Structures

Nested Structures

Recursive Composition Structures can nest arbitrarily deep. Component names can themselves be arrays or structures. Within component names in (...), array and structure syntax applies recursively.

Examples

Array Within Structure

Structure Within Structure

Delimiter Selection Guidelines To maintain readability and parseability:

REQUIRED: Use different delimiters at each nesting level. Nested structures MUST use different component delimiters than their parent
Use visually distinct delimiters at each level
Recommended progression: ~ -> ^ -> ; -> :
Avoid using the field separator as a component delimiter
Document delimiter choices for complex schemas
Recommendation: Limit nesting to 3-4 levels maximum

Parsing CSV++ parsers process files in two phases:

Header Parsing: Parse column headers to identify field types (simple, array, or structured) and extract delimiter information
Data Parsing: For each data row, split fields according to their declared type, respecting quoting rules for nested delimiters

The ABNF grammar in provides a formal specification. Implementations MUST handle arbitrary nesting depth up to their documented limits.

Implementation Considerations

Validation Implementations SHOULD validate:

Matching number of components across repeated structures
Proper bracket nesting in headers
Delimiter conflicts (same delimiter at multiple levels)
MUST reject: Nested structures using the same component delimiter as their parent
Reasonable nesting depth (recommend warning beyond 3-4 levels)

Limits Implementations MAY impose reasonable limits on:

Nesting depth (recommended minimum: 10 levels)
Number of components per structure (recommended minimum: 100)
Number of repetitions per array (recommended minimum: 1000)

MIME Type and File Extension

MIME Type CSV++ files use the text/csv media type defined in .

File Extensions

.csv - Standard extension (recommended for compatibility)
.csvpp - MAY be used to explicitly indicate CSV++ format
.csvplus - Alternative explicit extension

Security Considerations

Delimiter Injection Malicious data could attempt to inject delimiters to break parsing. Implementations MUST respect quoting. Quoted fields MUST be parsed as literal values. Delimiters inside quotes MUST NOT be interpreted as separators.

Complexity Attacks Deeply nested or highly repetitive structures could cause excessive memory consumption or CPU exhaustion during parsing. Mitigations:

Implement depth limits
Implement size limits
Use streaming parsers for large files
Validate headers before processing data

Encoding Issues Files SHOULD use UTF-8 encoding. Implementations SHOULD detect and handle encoding issues. BOM (Byte Order Mark) MAY be present.

IANA Considerations This document has no IANA actions. CSV++ files use the text/csv media type defined in . The format is fully backward compatible with standard CSV parsers.

References Normative References Informative References

Grammar (ABNF) ]]>

Complete Examples

E-commerce Order

Acknowledgments This specification was inspired by the HL7 Version 2.x delimiter hierarchy and the need for a simple, human-readable format for hierarchical data that maintains compatibility with existing CSV tools.