Browse Source

pkg/tarsum: specification on TarSum checksum

Signed-off-by: Vincent Batts <vbatts@redhat.com>
Vincent Batts 10 years ago
parent
commit
f30fee69b1
1 changed files with 228 additions and 0 deletions
  1. 228 0
      pkg/tarsum/tarsum_spec.md

+ 228 - 0
pkg/tarsum/tarsum_spec.md

@@ -0,0 +1,228 @@
+page_title: TarSum checksum specification
+page_description: Documentation for algorithm used in the TarSum checksum calculation
+page_keywords: docker, checksum, validation, tarsum
+
+# TarSum Checksum Specification
+
+## Abstract
+
+This document describes the algorithms used in performing the TarSum checksum
+calculation on file system layers, the need for this method over existing
+methods, and the versioning of this calculation.
+
+
+## Introduction
+
+The transportation of file systems, regarding docker, is done with tar(1)
+archives. Types of transpiration include distribution to and from a registry
+endpoint, saving and loading through commands or docker daemon APIs,
+transferring the build context from client to docker daemon, and committing the
+file system of a container to become an image.
+
+As tar archives are used for transit, but not preserved in many situations, the
+focus of the algorithm is to ensure the integrity of the preserved file system,
+while maintaining a deterministic accountability. This includes neither
+constrain the ordering or manipulation of the files during the creation or
+unpacking of the archive, nor include additional metadata state about the file
+system attributes.
+
+
+## Intended Audience
+
+This document is outlining the methods used for consistent checksum calculation
+for file systems transported via tar archives.
+
+Auditing these methodologies is an open and iterative process. This document
+should accommodate the review of source code. Ultimately, this document should
+be the starting point of further refinements to the algorithm and its future
+versions.
+
+
+## Concept
+
+The checksum mechanism must ensure the integrity and confidentiality of the
+file system payload.
+
+
+## Checksum Algorithm Profile
+
+A checksum mechanism must define the following operations and attributes:
+
+* associated hashing cipher - used to checksum each file payload and attribute
+  information.
+* checksum list - each file of the file system archive has its checksum
+  calculated from the payload and attributes of the file. The final checksum is
+  calculated from this list, with specific ordering.
+* version - as the algorithm adapts to requirements, there are behaviors of the
+  algorithm to manage by versioning.
+* archive being calculated - the tar archive having its checksum calculated
+
+
+## Elements of TarSum checksum
+
+The calculated sum output is a text string. The elements included in the output
+of the calculated sum comprise the information needed for validation of the sum
+(TarSum version and block cipher used) and the expected checksum in hexadecimal
+form.
+
+There are two delimiters used:
+* '+' separates TarSum version from block cipher
+* ':' separates calculation mechanics from expected hash
+
+Example:
+
+	"tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e"
+	|         |       \                                                               |
+	|         |        \                                                              |
+	|_version_|_cipher__|__                                                           |
+	|                      \                                                          |
+	|_calculation_mechanics_|______________________expected_sum_______________________|
+
+
+## Versioning
+
+Versioning was introduced [0] to accommodate differences in calculation needed,
+and ability to maintain reverse compatibility.
+
+The general algorithm will be describe further in the 'Calculation'.
+
+### Version0
+
+This is the initial version of TarSum.
+
+Its element in the checksum "tarsum"
+
+
+### Version1
+
+Its element in the checksum "tarsum.v1"
+
+The notable changes in this version:
+* exclusion of file mtime from the file information headers, in each file
+  checksum calculation
+* inclusion of extended attributes (xattrs. Also seen as "SCHILY.xattr." prefixed Pax
+  tar file info headers) keys and values in each file checksum calculation
+
+### VersionDev
+
+*Do not use unless validating refinements to the checksum algorithm*
+
+Its element in the checksum "tarsum.dev"
+
+This is a floating place holder for a next version. The methods used for
+calculation are subject to change without notice.
+
+## Ciphers
+
+The official default and standard block cipher used in the calculation mechanic
+is "sha256". This refers to SHA256 hash algorithm as defined in FIPS 180-4.
+
+Though the algorithm itself is not exclusively bound to this single block
+cipher, and support for alternate block ciphers was later added [1]. Presently
+use of this is for isolated use-cases and future-proofing the TarSum checksum
+format.
+
+## Calculation
+
+### Requirement
+
+As mentioned earlier, the calculation is such that it takes into consideration
+the life and cycle of the tar archive. In that the tar archive is not an
+immutable, permanent artifact. Otherwise options like relying on a known block
+cipher checksum of the archive itself would be reliable enough.  Since the tar
+archive is used as a transportation medium, and is thrown away after its
+contents are extracted. Therefore, for consistent validation items such as
+order of files in the tar archive and time stamps are subject to change once an
+image is received.
+
+
+### Process
+
+The method is typically iterative due to reading tar info headers from the
+archive stream, though this is not a strict requirement.
+
+#### Files
+
+Each file in the tar archive have their contents (headers and body) checksummed
+individually using the designated associated hashing cipher. The ordered
+headers of the file are written to the checksum calculation first, and then the
+payload of the file body.
+
+The resulting checksum of the file is appended to the list of file sums. The
+sum is encoded as a string of the hexadecimal digest. Additionally, the file
+name and position in the archive is kept as reference for special ordering.
+
+#### Headers
+
+The following headers are read, in this
+order ( and the corresponding representation of its value):
+* 'name' - string
+* 'mode' - string of the base10 integer
+* 'uid' - string of the integer
+* 'gid' - string of the integer
+* 'size' - string of the integer
+* 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC
+* 'typeflag' - string of the char
+* 'linkname' - string
+* 'uname' - string
+* 'gname' - string
+* 'devmajor' - string of the integer
+* 'devminor' - string of the integer
+
+For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax
+headers) included after the above list. These xattrs key/values are first
+sorted by the keys.
+
+
+#### Header Format
+
+The ordered headers are written to the hash in the format of
+
+	"{.key}{.value}"
+
+with no newline.
+
+
+#### Body
+
+After the order headers of the file have been added to the checksum for the
+file, then the body of the file is written to the hash.
+
+
+#### List of file sums
+
+The list of file sums is sorted by the string of the hexadecimal digest.
+
+If there are two files in the tar with matching paths, the order of occurrence
+for that path is reflected for the sums of the corresponding file header and
+body.
+
+
+#### Final Checksum
+
+Using an initialize hash of the associated hash cipher, if there is additional
+payload to include in the TarSum calculation for the archive, it is written
+first. Then each checksum from the ordered list of files sums is written to the
+hash. The resulting digest is formatted per the Elements of TarSum checksum,
+including the TarSum version, the associated hash cipher and the hexadecimal
+encoded checksum digest.
+
+
+## Security Considerations
+
+The initial version of TarSum has undergone one update that could invalidate
+handcrafted tar archives. The tar archive format supports appending of files
+with same names as prior files in the archive. The latter file will clobber the
+prior file of the same path. Due to this the algorithm now accounts for 
+
+
+## Footnotes
+
+* [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0
+* [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e
+
+## Acknowledgements
+
+Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the
+TarSum calculation.
+