Canonicalisation
Methodology Version 1.0 — Effective February 2026
Once a track has been enriched, validated, and confirmed as a Golden Record, its metadata must be transformed into a canonical form — a single, deterministic representation that always produces the same output for the same underlying data, regardless of when or how the data was collected.
Canonicalisation is the prerequisite for hashing. Without a deterministic representation, identical metadata could produce different hashes depending on field ordering, whitespace, or serialisation choices, undermining the entire certification chain.
Field selection
Only rights-determinative fields are included in the canonical form. Fields that relate to processing history, engagement metrics, or internal workflow state are excluded.
| Included | Excluded |
|---|---|
| ISRC | Engagement metrics (play counts, listener statistics) |
| ISWC(s) | Source-specific identifiers (Spotify URI, MusicBrainz MBID) |
| Title | Confidence scores from enrichment |
| Artist | Internal workflow state |
| Writers (name, IPI, role, share %) | Enrichment log metadata |
| Performers (name, role) | Processing timestamps (except certification date) |
| Release date | Source provenance records |
| Duration (milliseconds) | |
| Territory registrations | |
| Publisher chains |
This separation ensures that the hash represents the substance of the rights claim — the data that determines who is owed what — rather than artefacts of how the data was assembled. Two tracks with identical rights-critical data will always produce the same hash, even if they were enriched at different times or through different source orderings.
Deterministic serialisation
The canonical form is constructed through a strictly defined serialisation process:
Step-by-step process
-
Extract rights-critical fields — Only the fields listed in the "Included" column above are retained. All other fields are discarded. Null or empty values are removed.
-
Sort all keys alphabetically — The top-level keys of the data structure are sorted in lexicographic order. This eliminates variation caused by different insertion orderings in the source data.
-
Normalise writer arrays — Writers are sorted first by IPI number (lexicographic), then by name. Each writer entry is normalised to a consistent structure:
{name, ipi, role, share}. -
Normalise performer arrays — Performers are sorted by role, then by name.
-
Sort nested structures — Territory registrations, publisher chains, and ISWC collections are sorted by their respective keys.
-
Serialise as compact JSON — The data is serialised using the following parameters:
json.dumps(data, sort_keys=True, separators=(',', ':'), ensure_ascii=True)
This produces compact JSON with no whitespace between elements, no trailing newlines, and all non-ASCII characters escaped to their Unicode code points. The sort_keys=True parameter provides a secondary guarantee of key ordering.
The resulting string is encoded as UTF-8 before being passed to the hashing step.
Why these choices matter
| Decision | Rationale |
|---|---|
sort_keys=True | Eliminates ordering variation across different systems and languages |
separators=(',', ':') | Removes all optional whitespace, ensuring byte-identical output |
ensure_ascii=True | Normalises character encoding, preventing Unicode normalisation form differences |
| UTF-8 encoding | Universally supported encoding standard |
| Writer sort by IPI then name | IPI is the most stable identifier; name provides a tiebreaker |
| Null/empty removal | Prevents spurious hash differences from absent-vs-null distinctions |
Verification
The canonical form is included in the certification proof bundle, allowing any party to:
- Inspect the exact data that was certified
- Re-run the serialisation process independently
- Confirm that the canonical JSON produces the expected SHA-256 hash
Because the serialisation parameters are fully specified and use standard library functions available in every major programming language, independent reproduction does not require any TrackForge software.
The methodology_hash field is included in certification metadata (e.g., in the proof bundle and API responses) but is not part of the canonical JSON or the record_hash. This is intentional: the record hash proves data integrity (the rights-critical metadata has not changed), while the methodology hash proves rule integrity (the evaluation criteria have not changed). These are independent verification dimensions — changing the methodology version does not alter record hashes for unchanged data.
Related methodology pages
- Golden Record Selection — The completeness criteria that must be met before canonicalisation
- Hashing — The next step: computing a SHA-256 digest of the canonical form
- Independent Verification — How third parties can reproduce the canonical form and verify the hash