Data Repositories

A data repository is a collection of data published by a single user. Repositories are self-authenticating data structures, meaning each update is signed and can be verified by anyone.

Data Layout

The content of a repository is laid out in a Merkle Search Tree (MST) which reduces the state to a single root hash. It can be visualized as the following layout:

┌────────────────┐
│     Commit     │  (Signed Root)
└───────┬────────┘
        ↓
┌────────────────┐
│      Root      │
└───────┬────────┘
        ↓
┌────────────────┐
│   Collection   │
└───────┬────────┘
        ↓
┌────────────────┐
│     Record     │
└────────────────┘

Every node is an IPLD object (dag-cbor) which is referenced by a CID hash. The arrows in the diagram above represent a CID reference.

This layout is reflected in the URLs:

Root       | at://alice.com
Collection | at://alice.com/app.bsky.feed.post
Record     | at://alice.com/app.bsky.feed.post/1234

A “commit” to a data repository is simply a keypair signature over a Root node’s CID. Each mutation to the repository produces a new Root node, and every Root node includes the CID of the previous Commit. This produces a linked list which represents the history of changes in a Repository.

┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│     Commit     │  │     Commit     │  │     Commit     │
└───────┬────────┘  └───────┬────────┘  └───────┬────────┘
        │     ↑             │     ↑             │
        ↓     └──prev─┐     ↓     └──prev─┐     ↓
┌────────────────┐  ┌─┴──────────────┐  ┌─┴──────────────┐
│      Root      │  │      Root      │  │      Root      │
└────────────────┘  └────────────────┘  └────────────────┘

Identifier Types

Multiple types of identifiers are used within a Personal Data Repository.

DIDs	Decentralized IDs (DIDs) identify data repositories. They are broadly used as user IDs, but since every user has one data repository then a DID can be considered a reference to a data repository. The format of a DID varies by the “DID method” used but all DIDs ultimately resolve to a keypair and a list of service providers. This keypair can sign commits to the data repository, or it can authorize UCAN keypairs which then sign commits (see “Permissioning”).
CIDs	Content IDs (CIDs) identify content using a fingerprint hash. They are used throughout the repository to reference the objects (nodes) within it. When a node in the repository changes, its CID also changes. Parents which reference the node must then update their reference, which in turn changes the parent’s CID as well. This chains all the way to the Root node, which is then signed to produce a new commit.
TIDs	Timestamp IDs (TIDs) identify records. They are used in Collections as a key to Records. TIDs are produced using the local device’s monotonic clock e.g. microseconds since Unix epoch. To reduce the potential for collisions, a 10-bit clockID is appended . The resulting number is encoded as a 13 character string in a sort-order-invariant base32 encoding (`3hgb-r7t-ngir-t4`).

TL;DR:

Data Repositories

Data Layout

Identifier Types

Join the Bluesky private beta.

TL;DR:

Data Repositories#

Data Layout#

Identifier Types#

Join the Bluesky private beta.

Data Repositories

Data Layout

Identifier Types