Fixity Checks:
Checksums, Message Digests and Digital Signatures
Audrey Novak, ILTS
Digital
Preservation Committee
November 2006
Introduction:
Fixity, in preservation terms, means
that the digital object has not been changed between two points in time[klg221] [1]
or events. Technologies such as checksums, message digests and digital
signatures are used to verify [klg222] a
digital object’s fixity. Fixity information, the information created by these
fixity checks, provides evidence for the integrity and authenticity of the
digital objects and are essential to enabling trust. To ensure trust, [klg223] for
example, a repository may generate a checksum, message digest or digital
signature as part of the ingest process for submission of content. Then at
regular intervals using the same algorithm, the repository can regenerate the
values and compare them to the original. If
the values are the same a trusted repository has evidence that there has been
no unauthorized changes to a digital object.
[klg224] A
fixity check may be used to verify that any action taken upon the digital
resource does not alter the resource. Without such evidenct, one is unable to
prove that the action did not alter the digital resource. Many events in the
digital resource’s lifecycle introduce the possibility of unintentional change,
including: submission, retrieval, migration, transfer to other media, network
transmission, or simply the passage of time.
Fixity checks are all used in the same basic way. A value is initially generated and saved[klg226] . It is then recomputed and compared to the original to ensure the object (file or bitstream) has not changed. Despite this similarity, all fixity checks are not the same. Although the terms are frequently used interchangeably, checksums, message digests and digital signatures are, in fact, very different tools.
A checksum is the simplest and least secure method [klg227] of verifying fixity. Checksums [klg228] are typically used in error-detection to find accidental problems [klg229] in transmission and storage. The least complicated checksum algorithms do not account for such changes as the re-ordering of bytes or changes that cancel one another out. The more secure checksums, such as cyclic redundancy check (CRC) are hash functions that control for such changes. Checksums, however, because of the comparative simplicity of their mathematical algorithms, are vulnerable to deliberate and malicious tampering[klg2210] .
Unlike checksums, cryptographic hash functions such as message digests are not prone to attack. A message digest is computed by applying an algorithm to the file of any length to produce a unique, short, uniform length character string. What makes message digests more secure than checksums is the complexity of the algorithm. A message digest is like the fingerprint for a digital object. For example, a message digest for a digital image jpeg file is: 97b3847a4ac1dcb037e2a7914c6f684d. And a message digest for an audio mp3 file is: 93326bff6636655dcd6abff18ed2de997. Change one pixel or one note in the files and their message digests will be completely different. (Note, however, that they will always be the same length when computed using the same algorithm.) Hashes are one-way operations; a hash can be created from a digital object, but the digital object cannot be re-created from the hash. (Encryption[klg2211] , on the other hand, is a two-way operation).[2] MD5 (http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html) and Secure Hash Algorithm, SHA-1, (http://www.w3.org/PICS/DSig/SHA1_1_0.html) are commonly used cryptographic hash algorithms.
Digital signatures combine a hash message digest with encryption. A digital signature starts with the creation of a message digest from the digital object. “The message digest is then encrypted using asymmetric cryptography. Asymmetric cryptography is based on using a pair of keys: a private key to encrypt and a public key to decrypt. The private key must be held secretly and securely by the signer…The signature can be verified by decrypting the signature with the signer’s public key and comparing the now-decrypted digest with a new digest produced by the same algorithm from the same content.” [3]
As the PREMIS Final report outlines, digital signatures are [klg2212] used in preservation repositories in three ways.
• For submission to the repository, an agent (author or
submitter) might sign an object to assert that it truly is the author or
submitter. [klg2213]
• For
dissemination from the repository, the repository may sign an object to assert
that it truly is the source of the dissemination. [klg2214]
• For archival storage, a
repository may sign an object so that it will be possible to confirm the origin [klg2215] and integrity of the data. In this case, the signature itself and the information needed to
validate the signature must be preserved[klg2216] .
Despite this heavy emphasis in
PREMIS on digital signatures, in practice most repositories that support some
level of preservation utilize message digests to checksums to ensure fixity.
A quick survey of
ingest procedures indicates that MD5 hash algorithm is ubiquitous in preservation
repositories. No evidence of just
checksums or digital signatures creation and validation was found.
The Yale University
Library Rescue Repository employs SHA1and MD5 algorithms upon ingest. These
values are saved in a companion file along with the ingested resource. [klg2217]
At Harvard an MD5
hash is required when an object is deposited in a Harvard Repository SFTP drop
box. It is used to validate the integrity of the digital object during
transfer from the depositor's system to the drop box and into the repository
itself. It is also used for file validity within the repository over time.
The NDIIPP Architecture and the Archive Ingest & Handling Test is designed to test the feasibility of transferring digital archives from one institution to another. Phases of the test will assess the process of digital ingest, document useful practices, maximize automated handling of digital material, and identify areas that require further research or development. (Section 2.3.1 of the AIHT Statement of Work (SOW) 26504–01, 12/19/2003). Participants include Library of Congress, Old Dominion, Johns Hopkins, Stanford and Harvard. The test focus made available a great deal of information available about general ingest procedures All of these AIHT projects included MD5 for pre- and post- ingest fixity verification. (http://www.digitalpreservation.gov/index.php?nav=3&subnav=14)
Issues to Consider Regarding
Fixity Checks
A mismatch between message
digests generated over time from the same source file indicates a change in
either the source file content or a problem with the message digest itself. Although the probability is small, a match
between the digests is not always a guarantee that the source file content is
unchanged. “Digest algorithms are inherently subject to collisions, in
which two different inputs generate the same digest. Digest algorithms are
designed to make collisions unlikely, but some of the assumptions underlying
these designs do not hold in digital preservation applications. For example,
the analysis of the algorithm normally assumes that the input is a random
string of bits, which for digital preservation is unlikely. “ http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html
Additionally, over time the
encryption algorithms used to generate the message digests can become
vulnerable. “Recently, for example, the widely used MD5 and SHA1 algorithms
appear to have been broken. A digital
preservation system that audits against previous message digests must
preemptively replace its digest algorithm with a new one before the current one
is brokenTo do so, it should audit against the current digest to confirm that
the item is still good then compute a digest using the replacement algorithm.
This should be appended to the stored list of digests for the item. “ http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html
Finally, but most importantly, generating, storing, replicating, and verifying message digests is not without significant staff and system costs. A tension exists between the value of the content and the cost of this level of protection[klg2220] . Should message digests be created, periodically verified, and maintained for all digital assets or should different levels of service be established?[klg2221]
Recommendations and Best Practices for Yale University
Library
Specifically in the Rescue Repository and any other repository managed by YUL:
· Message digests should be protected with the same rigor that is applied to the original content file in terms of redundancy and replication. Additionally, to protect against system failures, operator error and malicious attacks message digests should be stored separate from the original source content.More than one message digest using different algorithms should be generated.
· Using the Rescue Repository environment, a test conversion to a new message digest algorithm should be planned in order to help establish procedures that may be needed for preemptive replacement.
· Also using the Rescue Repository environment, a periodic message digest verification process should be implemented in order to establish the veracity of the source content over time, but equally importantly, to better understand the level-of-effort and system resource requirements required for this level of service
· Message digests should be maintained and managed within the PREMIS framework[klg2223] .
· Best practices within the digital preservation community regarding fixity checks should be continually reviewed. Particular attention should be applied to the use of digital signatures.
PREMIS: Fixity




PREMIS: Digital Signature




[klg221]Perhaps start with a definition of the problem first. According to The Long-term Preservation of Authentic Electronic Records: Findings of the InterPARES Project, The accuracy and authenticity of digital resources are most at risk when they are transmitted across space (that is, when sent between persons, systems, or applications) or time (that is, either when they are stored offline, or when the hardware or software used to process, communicate, or maintain them is upgraded or replaced).
This is important because we can’t keep digital resources on the same media (contrary to what previous YUL practice might lead you to believe). To counteract media decay and hardware obsolescence, digital preservation will cause us to continually refresh and/or migrate the digital resources in the preservation repository. This increases the threat to accuracy and authenticity. [There is no such thing as a “dark archive”]
One way to help ensure the accuracy and authenticity of digital resources being preserved (moving through space and time) is to measure, document, and protect the fixity of the digital resources stored.
[klg222]They don’t really verify, only produce numbers that could be compared to verify fixity.
[klg223]This initial clause is unnecessary and weakens this strong sentence.
[klg224]Matching fixity check outcomes serves as evidence that the digital resource has not changed.
[klg225]You have already given the definition of fixity check, perhaps “Definitions” and even though you define the other terms here, perhaps this isn’t the best title for this section.
[klg226]I think it would be a good idea to include a very basic definition of how this work. [Move up your explanation from the cryptographic hash functions section here] Everybody should understand that each fixity check applies an algorithm to all the bits in a document or digital resource, with a long string being the product. Any change in a document (changing a single letter of text or a single pixel of an image) will result in a change to the subsequent fixit check product. This doesn't tell us how much the digital resource was changed, only that it was changed. This can be used as evidence because the math is so complicated that it is a great deal of work to purposefully change a digital resource in such a way that a fixity check would miss. The most basic difference between many of the different kinds of fixity checks being that they applied different, more complex algorithms. If you want to explain that check sums and cryptographic hash functions are two totally different classes or types of fixity checks, you should create a section titled types of fixity checks and go through each type separately describes how they differ from each other.
[klg227]Why is security an issue? There should be some explanation of why I need a more secure fixity check.
[klg228]Maybe you should explain that many people incorrectly use all these term interchangeably. This report will present to approved definitions.
[klg229]This is the first point that you bring up malicious attack. This should be explained as a reason for fixity checks earlier on when you talk about trust.
[klg2210]Tampering of the fixity check itself or tampering of the digital resource or both?
[klg2211]Is this a different type of fixity check? Do you even want to get into encryption in this document? If so, perhaps encryption should get its own paragraph or bullet.
[klg2212]I like can be better
[klg2213]This is a different use of the tools described in this report. One might use PKI to sign a document, but the act of proving who it came from is separate from the act of checking fixity. Instead, “for submission to the repository, the Archive might verify that the digital resources received are exactly the same as those submitted by the Producer.
[klg2214]This is also not a fixity check, “for dissemination from the repository, the repository might verify that the digital resources received by the Consumer are exactly the same as those disseminated by the Archive.
[klg2215]This can’t confirm the origin, only the integrity.
[klg2216]This is true in every case.
[klg2217]For completeness, you could describe the rudimentary University Archives practice, which is to manually create MD5 algorithms of each datastream and store them separate from the digital resources along with the inventories. We run the fixity check on the medium that each Producer transfers to us. We don’t have any automated facility to verify the products through time. We don’t give anybody this dissemination fixity checks (most of my stuff is still restricted).
[klg2218]You might want to say something about Fedora. The Fedora Preservation Services Working Group recommended that the ability to undertake fixity checks and store the resulting products be included future Fedora Core development (this actually came from Eliot and me). Sandy told me recently that they are working to incorporate this service in the next Fedora release in December or January. I attached some messages of interest to this email. I don’t know what this means for the delayed Vital cycle.
[klg2219]It doesn’t ensure this, only adds to a presumption of authenticity as judged by the Consumer.
[klg2220]This isn’t protection, but instead a stronger audit trail or ability to prove if something did or did not happen.
[klg2221]The decision depends on the sliding scale of risk tolerance. If one doesn’t care about the authenticity of digital resources, than maybe one doesn’t require any fixity chacks.
[klg2222]The first recommendation should be whether or not to utilize fixity checks and in which situations to do so.
[klg2223]This may not match with the plans of the Fedora team. See the documents attached to this email for more information.