On Content-Addressable Identifiers

An identifier is content-addressable if it can be derived from the content of an object. This is usually done by computing the cryptographic digest (hash) of the underlying content. With a content-addressable identifier, someone can fetch the object and validate if its what’s expected by recomputing the hash.

There are several successful uses of content-based addressing:

Git uses commit hashes to specify point-in-time in a repository. The commit hash is only meaningful with respect to a given repository.
Docker-based systems use the hash of image layers to address them. When you pull a docker image, the registry returns a manifest containing hashes for all the layers contained in the image. Then the client pulls those images using those hashes.

Digest Agility

The concept of digest agility refers to the ability to change the digest algorithm when necessary. This can be achieved by encoding the hashing algorithm into the identifier itself. Docker does this by using:

sha256:19236d74d6a0ffb7bdaee...

Multihash does it by encoding the hash using a lookup table:

11148a173fd3e32c0fa78b90fe42d305f202244e2739
Prefix: 11 (sha1)

Either method allows multiple identifiers for the same object. That is, a sha256 identifier and a sha512 identifier can be used to access the same underlying object, if the storage mechanism supports such functionality.

The case for normalization

If the addressed object can have multiple representations, a normalization can be desirable. For example, a JSON document can have multiple equivalent representations. The following two JSON objects are identical for most practical purposes, and a normalization process may assign the same identifier to both objects:

{
  "a": "b",
  "c": "d" 
}

{
  "c": "d",
  "a": "b"
}

There is at least one JSON canonicalization standard proposal. Among other things, this proposal specifies how to order keys in an object, so that the resulting JSON representation is “hashable”. This scheme requires that JSON data must be normalized before a digest can be computed from it.

The JavaScript standard uses declared property order. This is not interoperable. When JSON is unmarshaled to construct language specific data structures the ordering is lost. Thus, if a Javascript program sends data to a Go backend that unmarshals the JSON to internal structures and then marshales it back to JSON, the order will be different.

Normalizing a JSON-LD document is even more problematic. Most JSON-LD files are not self-contained. Because of the reference to a (usually external) @context, the meaning of the underlying information cannot be determined from a JSON-LD file alone. This makes hashing and signing JSON-LD data tricky.

Computing the digest of a JSON-LD file containing a context reference is flawed. If the context changes, the underlying meaning of the JSON-LD file also changes, but the hash stays the same.

One can hash the JSON-LD file after expanding the context, thus making it a self-contained JSON file. This brings back the problem with the ordering of JSON objects and the need for canonicalization. JSON-LD specifically states that unless explicitly specified, all collections are sets, i.e. unordered.

An alternative is to convert JSON-LD to RDF, and then to sort the resulting n-quads. One possible problem with this approach is that there is no single RDF representation for the same underlying data (because of the possibility of arbitrary insertion of blank nodes). This can be avoided by using a canonical method to insert and name those blank nodes. Apparently, this is how it is done in the VC community. This approach is not very practical because the RDF representation can be quite verbose.

Similarly, an XML document may have multiple representations. The following XML documents are identical if whitespace behavior is collapse:

<a>
  text
</a>

vs.

<a>text</a>

However, these two documents are different if there is no schema to normalize the document, or if the whitespace behaviour is not `collapse'.

The whitespace behavior can be modified using an XML schema. Thus, when normalizing XML documents the document must first be processed using the associated schema, and then normalized. That also brings back the problem of hashing non-self-contained objects: validating the hash for an object requires another object, the schema, which too should also be validating by computing a hash.

Self-containment (i.e. no external references)

One point common in the discussion about normalization is that the normalized document is self-contained, that is, even if the original document had external references (such as @context of a JSON-LD document), the normalized document is self-contained. The normalization process achieves two things:

Removes all external references
Removes semantically insignificant differences

I argue that the first point (self-containment) is necessary, but the second is not.

The case against normalization

Normalization ensures that two parties compute identical digests for an object. That is, if A and B are considered equal, then

Digest(Normalize(A)) == Digest(Normalize(B))

Here, the Equal function is an equality defined based on the use case.

This approach increases the attack surface for implementations. Consider the following interactions with an application, and a storage system:

sequenceDiagram App ->> Storage: request(id) Storage ->> App: Obj App ->> App: Parse App ->> App: Normalize App ->> App: Verify digest

The application requests an object from the storage, and then validates its digest by first normalizing it. That is, normalization process is done before verification, which allows a normalization attempt on potentially comporomised data.

A more secure approach would be:

sequenceDiagram App ->> Storage: request(id) Storage ->> App: Obj App ->> App: Verify digest App ->> App: Parse

Also note that the application requests an object with a given identifier, and then uses it to validate content. That is, the identifier is already given. The object storage is the single entity in this interaction that needs to compute the digest of an entity, and that can be done during object creation. If necessary, the object storage can normalize the content as well. All consumers of that object will refer to it using the registry-computed hash (i.e. consumers will not compute a hash themselves except for validation, which is done by treating the object as a blob). So any normalization is only necessary for the content-addressable object registry, and when the object is first created. An identifier, or identifiers, will be assigned to the object at that point, and those identifiers will be used by the consumers only to access and validate the contents of the accessed object. A consumer of the identifier or the object does not need to normalize that object for validation purposes.

Treating the content-addressable entities as blobs works well for JSON and other self-contained document models (e.g. XML, YAML). It does not work for JSON-LD, because the meaning of a JSON-LD document can be modified by modifying an external context file and without affecting the digest. This can be solved by using expanded-contexts. Then JSON-LD is simply JSON, and can be treated as a blob. It also does not work for XML documents with an XML Schema, because an XML schema may alter the meaning of an XML document (whitespace behavior primarily, but that can be important for text content).

Manifests

An object manifest can be used to deal with the self-containment problem. If an object has references to other objects (i.e. it is not self-contained), an object manifest lists identifiers to all the necessary pieces so a consumer of the object can retrieve the object with all its relevant pieces, and validate each and every one of those objects as blobs without any preprocessing. It may look like this:

{
   "id": [
      "sha256:abcdef...",
      "sha512:123455....",
      ...
   ],
   "parts": {
      "part1": "<id for part1>",
      "part2": "<id for part2>",
      ...
   }
}

For example, an application attempts to accesses a JSON-LD document using its digest:

GET /www.example.org/objects/sha256:abcdef...

The request returns the manifest:

{
  "id": "sha256:abcdef...",
  "parts": {
    "main": "idmain",
    "https://example.org/context": "idcontext"
  }
}

The application can validate the integrity of the manifest using its wire representation, and then parses it. It loads main next:

GET /www.example.org/objects/idmain

which returns:

{
  "@context:" "https://example.org/context",
  ...
}

The application can again validate the JDON-LD document using its wire representation. Now the application processes the JSON-LD document using a resolver that maps https://example.org/context to the address given in the manifest:

GET /www.example.org/context/idcontext

The application validates the returned context document, and processed the JSON-LD document.

This scheme does not need any normalization, it is self contained, and validations are performed on the wire representations of objects.