Matching with Mismatches and Assorted Applications ` Colin Percival Wadham College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Hilary 2006 To the Chemist, the Teacher, and the Composer, The Medic, the Musician, and the Conductor, Without whom this would not have happened; And to the City of Oxford, For reminding me how short three years can be. Acknowledgements In no particular order, I would like to thank: Graham Percival, for building and repairing computer systems while I’ve been thousands of miles distant; Michael Nottebrock, for asking the innocent question on a public mailing list which first made me consider the problem of delta compression; Robert Muth, for searching through four year old backups to retrieve the “upgrade” corpus used in Table 2.1 and Table 3.1; My supervisors, for allowing me the freedom to wander between topics until I decided to write this thesis, yet being available if I ever wished to discuss anything; Richard Brent, Richard Bird, and Peter Jeavons for reading and commenting on early draft(s) of this work; The Commonwealth Scholarship Commission in the United Kingdom, for funding my years in Oxford; and finally, the dedicatees, who have each helped, in their own way, to keep me sane for the past π years. Abstract This thesis consists of three parts, each of independent interest, yet tied together by the problem of matching with mismatches. In the first chapter, we present a motivated exposition of a new randomized algorithm for indexed matching with mismatches which, for constant error (substitution) rates, locates a substring of length m within a string of length n faster than existing algorithms by a factor of O(m/ log(n)). The second chapter turns from this theoretical problem to an entirely practical concern: delta compression of executable code. In contrast to earlier work which has either generated very large deltas when applied to executable code, or has generated small deltas by utilizing platform and processor-specific knowledge, we present a naı̈ve approach — that is, one which does not rely upon any external knowledge — which nevertheless constructs deltas of size comparable to those produced by a platformspecific approach. In the course of this construction, we utilize the result from the first chapter, although it is of primary utility only when producing deltas between very similar executables. The third chapter lies between the horn and ivory gates, being both highly interesting from a theoretical viewpoint and of great practical value. Using the algorithm for matching with mismatches from the first chapter, combined with error correcting codes, we give a practical algorithm for “universal” delta compression (often called “feedback-free file synchronization”) which can operate in the presence of multiple indels and a large number of substitutions. Contents List of Symbols for Chapter 1 iii Preface 1 0 Introduction 3 1 Matching with Mismatches 5 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Projective searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Searching with fuzzy projections . . . . . . . . . . . . . . . . . . . . . 15 1.5 A Bayesian reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6 Final notes 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Delta Compression of Executable Code 30 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Differences of executable code . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Block alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Combining local and block alignment . . . . . . . . . . . . . . . . . . 37 2.7 Delta encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 i 3 Universal Delta Compression 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Error correcting codes . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Randomized large-codeword error correction over GF (28 ) . . . . . . . 50 3.4 Rsync revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 Block alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 Practical improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Appendix A: Source code and data 66 Epilogue: A Rapid String Similarity Metric 67 Bibliography 69 ii List of Symbols for Chapter 1 log Natural logarithm e Base of natural logarithm π(x) Number of positive prime integers less than or equal to x Σ Alphabet S String over Σ within which T is being searched S̄ Index generated from S S0 Substrings of S which approximately match T n Length of S T String over Σ which is being searched for m Length of T δ(x, y) V Indication of how well symbols x, y match. Usually restricted to {0, 1} Match count vector; Vi is the sum of δ(Si+j , Tj ) V̄ Randomized approximation to V X Offsets within S where T approximately matches X̄ Computed estimate of X t |X| p Probability of symbols from S, T matching at offsets in X k Permitted edit distance (Section 1.1) k Number of projections used  Permitted probability of error L Minimum useful size of groups onto which projections are made P Set of primes in the range [L, L(1 + 2/ log L)) pi Size of groups onto which projections are made iii φi A(i) B (i) Random projections from Σ → {−1, 1} Projection of S via φi , Zpi from Σn → Rpi Projection of T via φi , Zpi from Σm → Rpi X (i) Computed approximation of X mod pi σpi (n, m, k) |{(i, j) < (n, m) : i ≡ j + k mod pi }| ci,j |X ∩ (j + pi Z)| iv Preface This thesis is, in large part, a tribute to the irrepressibility of the human mind. When I started my D.Phil studies in October 2001, I was planning to work on parallel computing — specifically, attacking large computational problems1 over the Internet using spare computing power, storage, and network bandwidth. I was making steady progress towards this when illness intervened, first sending me to hospital in January 2003, and subsequently making it difficult to concentrate on my work for the following few months. It was during this period that the first seeds which would later become this thesis were planted. Adopting a Feynmanesque attitude — ‘if I can’t work, I might as well play’ — I turned to a problem which had been plaguing the FreeBSD operating system [18] for years: security patches. Apart from a few abortive attempts, and despite many repeated pleas on the mailing lists, FreeBSD had no existing binary update system; instead, security fixes were distributed as source code patches, and each system administrator would recompile and reinstall the entire operating system. Since this process took several hours, and required manual intervention at half a dozen points, this often meant that important security fixes were ignored entirely. A few weeks after I finished putting together a working binary security update system for FreeBSD [44], I read an email on a public mailing list, from someone who was unaware of my existing work, asking about the potential for using delta compression in distributing binary security patches; in particular, he asked about using Xdelta [35]. This was something I had been considering, mostly for financial reasons — at 30MB per installation, it wouldn’t take many people updating their systems before the bandwidth started costing “real money” — but I hadn’t yet found 1 40 2 -element fast Fourier transforms, for example. 1 time for this. Some 36 hours and 281 lines of code later, however, I was already producing patches 35% smaller than those from Xdelta; over the following month I (while feeling increasingly guilty about neglecting the work I was “supposed” to be doing) made significant further improvements, which I released to the open source community and used in my binary security update system.2 Over the second half of 2003, this thesis gradually took form: My investigations into delta compression of executable code led me to a new algorithm for matching with mismatches; with this I made further improvements to my delta compression work; and then while walking around Wadham college in October 2003, I stumbled across the surprising algorithm for “universal” delta compression which forms the core of the third chapter of this thesis. From there onwards, for the duration of 2004, all that remained was the tedious process of filling in details. If a mathematician is a machine for turning coffee into theorems, a computer scientist is a machine for converting caffeine into algorithms. As with mathematicians and theorems, the output of these machines may bear little resemblance to that which was originally sought, but I hope the reader will find this particular body of output to be both interesting and useful. 2 In this manner, FreeBSD became the first commodity operating system, by a margin of over a year, to use delta compression for binary security patches. Microsoft Windows added similar capabilities in the summer of 2004 as part of the XP SP2 service pack, and starting with the OS X 10.3.4 → 10.3.5 upgrade, Apple has been using the code I wrote for FreeBSD to reduce the bandwidth consumed by 10.x.y → 10.x.(y + 1) upgrades. 2 Chapter 0 Introduction In the first chapter of this thesis, we introduce a novel approach to string matching problems, namely that of projecting vectors onto subspaces of relatively prime dimension. Providing that we are operating on sufficiently random inputs, we demonstrate that these projections can provide us with useful information about the original strings, while (thanks to their reduced length) allowing for much faster computation. In this manner, we construct a new — and vastly improved — algorithm for matching with mismatches, first in its most obvious form, and later in a more computationally efficient manifestation. This first chapter is not for the faint of heart: It draws on a wide array of results, including the Fast Fourier Transform [13] over non-power-of-two lengths [7], the Chinese Remainder Theorem [30, 54], priority queues, the well-known probability bounds of Hoeffding [24] and Chernoff [12], and even results concerning the distribution of prime numbers [15]. The remaining two chapters concern themselves with applications of the algorithm from Chapter 1, and here the reader will likely find an easier road. In the second chapter, we present a naı̈ve method for delta compression of executable code; that is, a method for constructing patches between two versions of a (compiled) computer program, which takes advantage of the structure of such files while remaining entirely platform-agnostic. This is theoretically interesting — the possibility of a naı̈ve method producing patches of size comparable to those produced by a non-naı̈ve method indicates that knowing the instruction set of a processor is rather less useful than one might expect — but is primarily of practical importance: Because this ap3 proach is naı̈ve, it can be applied to any platform desired, allowing highly effective delta compression to be used on esoteric platforms for which it would otherwise be unavailable. The third chapter extends the concept of delta compression into the concept of universal delta compression: Rather than having two files and attempting to construct a patch between them, we have one file and an estimate of the similarity between the two files. To many people, this result is profoundly counter-intuitive — the idea of constructing a “patch” which allows a given file to be reconstructed from any reference file is combinatorially absurd — but the problem is made feasible by the fact that the party applying the patch already has a file which is similar to the target, even if this file is unknown to the party constructing the patch. We expect that these two applications will be of significant practical importance in the coming years, and that, consequentially, a large proportion of those reading this thesis will be concerned more with the applications than with the beautiful algorithm for matching with mismatches which underlies them. We strongly recommend that these readers start reading at Chapter 2, and take for the moment as given that an algorithm for matching with mismatches exists with the properties required in the later chapters. Even on second reading, those concerned merely with the applications will not need to read all of Chapter 1; of that chapter, the final two sections alone can be taken as presenting a practical algorithm, even if they fail to provide any justification or motivation. But to those who are more interested in algorithms than applications: Welcome; and let us now begin the first chapter. 4 Chapter 1 Matching with Mismatches 1.1 Introduction The problem of approximate matching with respect to edit distance (also known as Levenshtein distance) — that is, given two strings S, T of lengths n, m over an alphabet Σ, n > m (and usually n  m), to find all substrings1 S 0 of S such that S 0 can be transformed into T via a sequence of at most k substitutions, insertions, and deletions2 — has been studied extensively [39]. Of particular theoretical interest are “filtering” algorithms which, for low error rates (k < αm, for some α depending upon |Σ|), operate in O(n(k + log m)/m) time, which is equal to the proven lower bound for non-indexed algorithms (those which are not permitted to perform any pre-processing of the string S) [11]. The problem of matching with mismatches — a more restricted problem, where the only transformations permitted are substitutions — has received considerably less attention, probably due to its perceived lack of relevance. There are few naturally occurring problems where substitutions occur in the complete absence of insertions and deletions (hereafter, “indels”); however, there are many cases where substitutions are far more common than indels: A visual inspection of a number of related protein sequences suggests that substitutions are 10–20 times as common as indels, while computer programs compiled from closely related source code (e.g., before and after 1 Some authors use the term “substring” to refer to what is more properly termed a “subsequence”. In this thesis, we adopt the more conventional definition that a substring of S is a string Si Si+1 Si+2 · · · Sj for some i ≤ j. 2 For large values of k, the number of such substrings grows as O(nk); we therefore take “finding” the substrings to mean “enumerating the positions i in S where the substrings start”. 5 a security patch) often have substitution rates several thousand times their indel rates. In light of this, it seems clear that the problem of matching with mismatches is worth considering. The “problem” of matching with mismatches is in fact a number of related problems. Taking δ : Σ × Σ → R as a function which indicates how closely two symbols match, and defining Vi = m−1 X δ(Si+j , Tj ), j=0 the following problems are frequently considered: 1. Compute Vi for all 0 ≤ i ≤ n − m. 2. Given some k ∈ R, find all integers i ∈ [0, n − m] satisfying Vi > k. 3. Given some t ∈ N, find values x1 . . . xt such that Vxi takes on the t largest possible values. 4. Let t ∈ N be given and a set X = {x1 . . . xt } be fixed but unknown, and suppose that S and T are generated by a random process in such a manner that E(δ(Si , Tj )|i − j ∈ X) = χ > χ̄ = E(δ(Si , Tj )|i − j 6∈ X) for some constants χ, χ̄ ∈ R. Find X with high probability. The reader will note that problems 1–4 are in decreasing order of difficulty, i.e., given an algorithm for solving problem i, it is easy to construct an algorithm for solving problem i + 1. In spite of this, we argue that the last problem is often the most natural formulation to consider: In circumstances where the problem of matching with mismatches occurs, the extent to which two strings match is generally not significant in itself, but rather is a marker for an underlying binary property – for example, having evolved from the same original genomic sequence, or having been compiled from the same source code – and thus it is reasonable to expect alignments of the string T against S to either match well or not at all3 . 3 This can be viewed as akin to the difference between gold mining and searching for buried treasure — in both cases, we are looking for gold, but in the latter case, there is a clear distinction between gold and non-gold. 6 In addition to the four problems stated above, there are a number of variations which can be treated using improved algorithms. Whereas the vector V can be computed for arbitrary δ in O(n |Σ| log m) time by using the FFT [17], it is possible to √ compute V in O(n m log m) time if δ is a zero-one function and {(α, β) : δ(α, β) = 1} is an equivalence relation, by treating common symbols with the FFT and uncommon symbols with a quadratic-time algorithm [5]. Even more interestingly, the vector V can be estimated for such a zero-one function δ using a randomized algorithm; taking τ to be an integer parameter between 1 and |Σ| as desired, in time O(nτ log m) a vector V 0 can be randomly computed such that E(Vi0 ) = Vi and Var(Vi0 ) ≤ (m − Vi )2 /τ [3, 4]. Finally, and as alluded to earlier, each problem may be considered in both indexed and non-indexed forms, depending upon whether one is permitted to perform some precomputation (“indexing”) using the string S before the string T is made available. As we shall see in later chapters, an indexed algorithm can be very useful when attempting to solve many problems with the same S but varying strings T . In this chapter, we shall consider the last of the problems listed above, with indexing permitted, and with δ a zero-one function representing an equivalence relation as described above. 1.2 Problem statement In order to allow us to formally prove anything about the algorithms we shall be presenting, we must first give a precise definition of the problem we shall be solving. We define therefore the problem as follows: Problem space: A problem is determined by a tuple (n, m, t, p, , Σ, X) where {n, m, t} ⊂ N, {p, } ⊂ R, m < n, 0 < , 0 < p, |Σ| is even, and X = {x1 , . . . , xt } ⊂ {0, . . . , n − m} with xi ≤ xi+1 − m for 1 ≤ i < t. Construction: Let a string T of length m be constructed by selecting m characters independently and uniformly from the alphabet Σ. Let a string Ŝ be constructed by randomly selecting n characters independently and uniformly from the alphabet Σ. Let a string S of length n be constructed by independently taking Si = Ti−xk with probability p if 7 ∃xk ∈ {i − m, . . . , i − 1} and Si = Ŝi otherwise (with probability 1 − p or 1 depending upon whether X ∩ {i − m, . . . , i − 1} is non-empty). Problem (indexing): Given (n, m, t, p, , Σ, S) generate an (optionally randomized) index S̄. Problem (matching): Given (n, m, t, p, , Σ, S̄, T ), find (optionally in a randomized manner) the set X with probability at least 1 −  for each problem. The construction above produces strings S and T which are in a sense “as random as possible”4 subject to the requirement that characters from S and T are more likely to match at a set of “good” offsets. The model given above is quite restrictive, and in practice we do not expect inputs to be constructed in this manner5 , but as with most theoretical models, it has the important and necessary characteristics while still being simple enough to be useful. At the end of this chapter, we shall describe some changes which will make the algorithms we provide more useful for operating on “real-world” data. Clearly, for it to be possible to compute the set X with probability > 1/2, the values Vi for i 6∈ X must be smaller than the values Vj with j ∈ X. In light of the construction above, the probability of any given pair (i, j) ∈ X̄ × X not satisfying Vi < Vj is asymptotically given by exp(−O(m)) with an implicit constant depending upon p and |Σ| alone6 , and thus the set X can only be reconstructed with probability 1 −  by any algorithm if m = Ω(log(n/)), where the implicit constant depends upon p and |Σ|. We will give a O(n) algorithm for constructing S̄, and a   n log (nt/) log m O mp2 4 Formally, whereby the entropy is maximized. In particular, most inputs do not have characters equidistributed across the alphabet: In natural language, certain letters are far more common than others, while computer data formats tend to include large numbers of 0 bytes. 6 The values Vi and Vj can be seen as the sum of m Bernoulli trials; taking the (asymptoticallycorrect) normal approximation, we have two normal distributions with means O(m) apart and √ standard deviations of O( m). 5 8 randomized algorithm for finding X. To our knowledge, this is the first algorithm for any problem in the field of approximate string matching which is sublinear in n for constant error rate k/m ≈ 1 − p and m = Θ(nβ ) for some β > 0. 1.3 Projective searching It is well known that integers can be reconstructed from their images modulo relatively prime values — in other words, their projections onto groups Zm — through use of the Chinese remainder theorem [54]. Reconstructing sets of integers is more difficult, since it is not clear which element of one image corresponds to a given element of another image7,8 . Nevertheless, at the expense of randomization, and for a small number of images, this is possible. First, we present a lemma concerning the distribution of prime numbers. Lemma 1.1. Let π(x) denote the number of primes less than or equal to x. Then π (x (1 + 2/ log x)) − π(x) ≥ x (log x)2 for all x ≥ 5. Proof. Based on computations concerning the location of non-trivial zeroes of the Riemann zeta function [9], it has been shown that     x 0.992 x 1.2762 1+ ≤ π(x) ≤ 1+ log x log x log x log x for all x ≥ 599 [15]. Using these bounds and making the substitutions y = log x, a = 0.992, b = 1.2762 for the benefit of space, we find that for x ≥ 599, 7   x (1 + 2/y) a π (x (1 + 2/ log x)) ≥ · 1+ log (x (1 + 2/y)) log (x (1 + 2/y))   a x (1 + 2/y) · 1+ ≥ y + 2/y y + 2/y  2 x (y + 2) y + ay + 2 = 2 y2 + 2 Indeed, working around this problem is a challenge in a number of number-theoretic algorithms, and for a brief time a related difficulty held up the development of the Number Field Sieve. 8 QOne tempting algorithm would reconstruct a set X by reconstructing and then finding the roots of α∈X (x − α), but this takes time cubic in the size of X, and requires a number of images linear in |X|, so for our purposes it is interesting but not useful. 9  x (y + 2) y 2 + ay + 2 x (y + b) − π (x (1 + 2/ log x)) − π (x) ≥ 2 y2 y2 + 2 x (2 + a − b)y 4 + (2a − 2)y 3 + (4 − 4b)y 2 − 4y − 4b = 2· 2 y y2 + 2 x x ≥ 2 = y (log x)2 where the final inequality holds for y ≥ 3.2, i.e., x ≥ 25. For 5 ≤ x ≤ 599, the result can be verified numerically. The constant 2 can be reduced to 1.2842 +  for all x greater than an easily computable x() by the same argument; using a sufficiently strong form of the prime number theorem [30] it can be shown that 1+ suffices for x greater than some bound x0 (). With that preliminary result, we can now provide a bound on the probability that a set of random projections will have, as their intersection, the correct set, by considering the selection of “unlucky” primes which would be necessary for any value 0 ≤ y < n to be erroneously contained in the intersection. Theorem 1.1. Let n, t, k ∈ N,L ≥ 5, and X = {x1 , . . . , xt } ⊆ {0, . . . , n − 1} be fixed. Let p1 . . . pk be selected uniformly at random from the set of primes in the interval [L, L(1 + 2/ log L)), and let X̄ = {0, . . . , n − 1} ∩ (X + p1 Z) ∩ · · · ∩ (X + pk Z) . Then X ⊆ X̄, and equality holds with probability at least   t (log n) (log L) k . 1−n L Proof. That X is a subset of the given intersection follows, by definition, from the observation that it is a subset of each element in the intersection. We turn therefore to the second part of the theorem, of deriving a lower bound on the probability that equality holds. Suppose that equality does not hold, and let y ∈ X̄ − X. Then 0 ≤ y ≤ n − 1, and y ∈ X + pi Z for all i ∈ {1, . . . , k}. Consequently, pi | t Y j=1 y − xj  ∀i ∈ {1, . . . , k}. 10 Since this product is nonzero and bounded from above by nt , and the pi are chosen from the interval [L, L(1 + 2/ log L)), we conclude that for any given y ∈ X̄ − X, all the pi must lie within a set of size at most logL nt = t log n/ log L “unlucky” primes. However, from Lemma 1.1, we know that there are at least L/(log L)2 primes in the interval [L, L(1 + 2/ log L)), so each prime pi lies within that set with probability at most t (log n) (log L) /L. Consequently, each of the n values in {0, . . . , n − 1} lies in X̄ − X with probability at most (t (log n) (log L) /L)k , and the result follows. Note that this bound applies independently of the set X. This is an important distinction, since for random sets X, the probability of inaccurate reconstruction can be trivially bounded by ntk /Lk — this result demonstrates that even for maliciously chosen sets X, we can perform the reconstruction fairly easily. This provides a useful avenue for attacking the problem of matching with mismatches: If we can compute the sets X mod P quickly for arbitrary primes P , then computing X with arbitrarily small probability of error is reduced to the problem of computing the intersection X̄. To see how these sets can be efficiently computed, consider once again the match count vector V , and the restrictions imposed upon it by our formulation of the problem. In most positions, V behaves essentially as Gaussian noise with mean m |Σ|−1 and roughly the same variance; in a few positions (those corresponding to the xi we wish to find), it “spikes” up to around pm. We have a signal which we wish to find, within a background of noise. Now consider what happens if we take the natural projection of V from Rn onto the subspace RZP . The signal remains, although its locations (i.e., the values xi ) are reduced modulo P ; and the level of “background noise” increases. Providing that P is large enough — that is, providing that we don’t allow the noise level to increase too much — we can find the set X mod P (subject to a given probability of error ) by taking the largest values of this projection. Algorithm 1.1. Let S, T, Σ, n, m, p, t, and  be given as specified in the problem description above, with 16 log(4n/) < m < min p2 √ ! √ 32n 8( n + 1) log(4n/) , . t log n p2 11 Then: 1. Set  log(2n/) k= log 8n − log (mt log n) L=  8n log (2kn/) mp2 − 8 log (2kn/) k̂ = dlog n/ log Le X̄ = {}. 2. Compute P = {x ∈ N : L ≤ x < L(1 + 2/ log L), x prime}. 3. For i = 1 . . . k, pick pi ∈ P and Σi ⊂ Σ with |Σi | = 12 |Σ| uniformly at random, and define φi : Σ → {−1, 1} by φi (x) = (−1)|Σi ∩{x}| . 4. For i = 1 . . . k, compute the vectors A(i) , B (i) ∈ Rpi where for 0 ≤ j < pi (i) Aj = l m n−j pi −1 X φi (Sj+λpi ) λ=0 (i) Bj = l m m−j −1 pi X φi (Tj+λpi ). λ=0 5. Compute vectors C (i) ∈ Rpi as the cyclic correlations of respective A(i) and B (i) pi X (i) (i) (i) Cj = Ar+j Br r=0 (i) using the FFT, and compute the set X (i) = {j ∈ N : 0 ≤ j < pi , Cj > mp/2}. 6. For all k̂-tuples (x1 , . . . , xk̂ ) ∈ X (1) × · · · × X (k̂) , compute the unique 0 ≤ x < p1 p2 . . . pk̂ with the given images modulo the respective primes; if x < n and x mod pi ∈ X (i) ∀i ∈ {k̂ + 1 . . . k}, then set X̄ = X̄ ∪ {x}. 9 7. Output X̄. 9 With the bounds on m as given earlier, the second part of this step simplifies somewhat since we will always have k̂ = k. We claim, without proof, that the algorithm is in fact useful for a wider range of values of m, including values for which we might have k̂ < k, and include this detail here for that reason. 12 Theorem 1.2. The output of Algorithm 1.1 will consist of the elements of X with probability at least 1 − ; further, if m 16 log(4n/) p2 and n  1, then the algorithm, not including the computation of the vectors A (i) , completes in (16 + o(1))Cn log(n/) log n log(n/) mp2 mp2 time, where C is a constant such that a length-L cyclic correlation can be computed in CL log L time using the FFT. Proof. We first observe that the restrictions on m provide that k = k̂ = 2, and √ n < L < n. Now defining σpi (n, m, j) = |{(x, y) ∈ Z × Z : 0 ≤ x < n, 0 ≤ y < m, x ≡ y + j mod pi }| , we have 1 2 (i)  Cj + σpi (n, m, j) = =  1 (i) (i) Ar+j Br + σpi (n, m, j) 2 r∈Zpi     X X  X   1   φi (Ty ) + σpi (n, m, j) φ (S ) i x     2 r∈Zpi =  X 0≤x (1 + δ)µ) < eδ (1 + δ)(1+δ) !µ where µ = E(X), δ > 0, and X is the sum of independent random 0-1 variables is equivalent to P (X > Y ) < exp(Y − µ + Y (log(µ) − log(Y ))) where Y is a constant greater than µ and X is the sum of independent random 0-1 variables. Proof. Taking Y = (1 + δ)µ, P (X > Y ) = P (X > (1 + δ)µ) !µ eδ < (1 + δ)(1+δ) = exp(µ(δ − (1 + δ) log (1 + δ))) = exp(µδ − µ(1 + δ) log (µ(1 + δ)/µ)) = exp(Y − µ − Y log (Y /µ)) = exp(Y − µ − Y (log Y − log µ)) = exp(Y − µ + Y (log(µ) − log(Y ))) as required. Lemma 1.3. Let X and Y be random variables and F, G : R → [0, 1] be monotonic differentiable functions such that P (X ≤ γ) ≤ F (γ) P (Y ≥ γ) ≤ G(γ) F (−∞) = G(∞) = 0 F (∞) = G(−∞) = 1 16 Then for any γ0 , P (X < Y ) ≤ F (γ0 ) + Z ∞ γ0 G(γ)F 0 (γ)dγ Proof. P (X < Y ) = EY (PX (X > Y )) ≤ EY (F (Y )) ! Z Y = EY F 0 (γ)dγ = ≤ ≤ Z ∞ Z−∞ ∞ Z−∞ γ0 −∞ −∞ P (Y ≤ γ)F 0 (γ)dγ G(γ)F 0 (γ)dγ 1 · F 0 (γ)dγ + = F (γ0 ) + Z ∞ γ0 Z ∞ γ0 G(γ)F 0 (γ)dγ G(γ)F 0 (γ)dγ We now provide a theorem concerning the probability that the “spikes” we seek to find come “close enough” to rising above the level of the noise. Theorem 1.4. Let xj be independent random variables (0 ≤ j ≤ n), which are respectively the sums of σ(j) independent Bernoulli trials (where the trials involved in xi are unrelated to the trials involved in xj for i 6= j). Further let the expected value of xj be µσ(j) for 1 ≤ j ≤ n and the expected value of x0 be µσ(0) + α. Then 2 −2α2 −1 for 16α σn < β, σ < log β < 2 , P ( {j : 1 ≤ j ≤ n, xj − µσ(j) > x0 − µσ(0)} > βn) < 2e −  2 √ α√ 2 √ − − log β σ , where σ ≥ max σ(j). Proof. From the Hoeffding bounds, we note that for 1 ≤ j ≤ n and 0 ≤ δ ≤ 1, P (xj − µσ(j) ≥ δα) ≤ e P (x0 − µσ(0) ≤ δα) ≤ −2δ 2 α2 σ(j) ≤e −2(1−δ)2 α2 e σ(0) 17 −2δ 2 α2 σ ≤ −2(1−δ)2 α2 σ e . p Making the substitution δ̂ = α1 −σ log(β)/2 (note that 0 < δ̂ < 1 from the bounds assumed on log β above), applying the Chernoff bound in the form given in Lemma 2 1 provides: 1.2, and noting that e−cx is convex for x2 > 2c P ( {j : 1 ≤ j ≤ n, xj − µσ(j) ≥ δα} ≥ βn) ≤ e =e   −2δ 2 α2 2 2 σ +nβ −2δσ α −log β    −2α2 δ̂ 2 −2α2 δ 2 2 2 2 n e σ −e σ δ̂ −δ + 2nβα σ ≤e =e nβ−ne n  4α2 δ̂ β σ − 2nβα σ  2 δ−δ̂ δ−δ̂  + 2nβα σ 2  2 2 δ̂ −δ 2 for all δ > δ̂. For clarity11 we note that the above is a Chernoffian bound on the (βn)th largest of the values x1 . . . xn . If we had σ(1) = . . . = σ(n) = σ, then the “typical” value of p the (βn)th largest value would be approximately µσ+ −σ log β/2 as this is the value which individual xi will exceed with probability roughly β; where Chernoff bounded the probability of a random variable diverging from its mean by a value exponential in the square of the divergence, we have done the same to the probability of the (βn)th largest value from a set of random variables diverging from the typical value of the (βn)th largest value (but with a different exponential constant). Now if we define X by Xα = x0 − µσ(0), Y to be the largest value such that {j : 1 ≤ j ≤ n, xj − µσ(j) ≥ Y α} ≥ βn, and F (δ) = G(δ) =   exp(−2(1 − δ)2 α2 /σ) δ ≤ 1 1 δ≥1 1 δ ≤ δ̂ 2 2 exp(−2nβα (δ − δ̂) /σ) δ ≥ δ̂ then the conditions of Lemma 1.3 hold. 11 ... and to use up space on a page which would otherwise be half-empty... 18 Taking γ0 = δ̂, we therefore have P ( {j : 1 ≤ j ≤ n, xj − µσ(j) > x0 − µσ(0)} > βn) = P (Y > X)   2 Z 1 2α2 (1−δ̂ ) −2(1−δ)2 α2 −2nβα2 (δ−δ̂)2 d − σ σ σ e dδ + e · ≤e dδ δ̂   2 Z 1 2 −2α2 nβ(δ−δ̂)2 +(1−δ)2 2α2 (1−δ̂ ) 4(1 − δ)α σ dδ = e− + e σ σ δ̂ !   2 2(1−δ̂)2 α2 Z 1 2 −2α2 (1+nβ) δ− nβ δ̂+1 2 2α2 (1−δ̂ ) 4(1 − δ)α σ 1+nβ σ = e− dδ e 1 + e σ(1+nβ) σ δ̂ !   2 2(1−δ̂)2 α2 2 Z 1 −2α2 (1+nβ) δ− nβ δ̂+1 2 2α2 (1−δ̂ ) 4(1 − δ̂)α σ 1+nβ σ < e− e 1 + e σ(1+nβ) dδ σ δ̂   2 Z nβ(1−δ̂) 2 2 2(1−δ̂)2 α2 2 2α2 (1−δ̂ ) −2α (1+nβ)x 1+nβ 1 + e σ(1+nβ) 4(1 − δ̂)α σ σ dx e = e− δ̂−1 σ 1+nβ < 2α2 (1−δ̂ ) σ e− = e− ≤ e− 2α2 (1−δ̂ ) σ 1+ − 2 2α2 (1−δ̂ ) σ  2(1−δ̂)2 α2 e σ(1+nβ) 4(1 − δ̂)α2 σ Z 0 δ̂−1 1+nβ 1dx + Z ∞ −2α2 (1+nβ)x2 σ e dx 0 !! !! √ 2(1−δ̂)2 α2 2 α2 4(1 − δ̂) 2πα(1 − δ̂) 1 + e σ(1+nβ) + p σ(1 + nβ) σ(1 + nβ) !! √ 2 4 2π 1 + e 16 +√ 16 16 2 2α2 (1−δ̂ ) σ < 2e− = 2e 2 2 2 √ α√ 2 √ − − log β σ , α2 (1−δ̂)2 2 α < 1 on the third line from the end. as desired, using the fact that σ(1+nβ) < σnβ 16 We now present an improved algorithm which is significantly faster than Algorithm 1.1, at the expense of increased complexity: Algorithm 1.2. Let S, T, Σ, n, m, p, t, and  be given as specified in the problem description above. Assume further that: (p 3 m < min n2 /2 , t(log n)2 Then: 19 ) p n/2 . 8p2 1. Set  log(2n/) k= log n − log mt(log n)2 1p β = k /2n 2 p 2 p x= − log β + log(4kt/)  L= 2nx mp2 − 2x k̂ = dlog n/ log Le X̄ = {}. 2. Compute P, pi , φi , A(i) , B (i) , C (i) as in Algorithm 1.1. (i) 3. Let X (i) be the set of size βpi + t of values j for which Cj values. takes the largest 4. For all k̂-tuples (x1 , . . . , xk̂ ) ∈ X (1) × · · · × X (k̂) , compute the unique integer 0 ≤ x < p1 p2 . . . pk̂ with the given images modulo the respective primes; if x < n and x mod pi ∈ X (i) ∀i ∈ {k̂ + 1 . . . k}, then set X̄ = X̄ ∪ {x}. 5. Output X̄. Theorem 1.5. The output of Algorithm 1.2 will consist of the elements of X with probability at least 1 − . Further, if p 3 n2  mp2 , n  1, and the A(i) are precomputed, then the algorithm operates in (2 + o(1)) p log(n/) + p 3 log(t/) mp2 2 Cn log n log(n/) mp2 time, where C is a constant such that a length-L cyclic correlation can be computed in CL log L time using the FFT. 20   (i) Proof. We consider first the sets X (i) . As in Theorem 1.2, 12 Cj + σpi (n, m, j) is the sum of σpi (n, m, j) Bernoulli trials, and has expected value σpi (n, m, j)/2 if j 6∈ X mod pi and expected value at least σpi (n, m, j)/2 + mp 2 if j ∈ X mod pi . Taking α = mp/2, σ = (n + L)m/L, we note that 1 −1 −2α2 = −x < log β < log < σ 2 2 16α2 4mp2 1p 16α2 < = < /2n ≤ β, σpi σL n+L 2 so Theorem 1.4 applies with n = pi . Consequently, for any j ∈ X mod pi we find: P (j 6∈ X (i) ) < 2e = 2e − −  2 √ α√ 2 √ − − log β σ r √ √ m 2 p2 L − − log β 2(n+L)m !2 √ 2 = 2e−( x− − log β ) = 2e− log 4kt/  = 2kt Since |X| ≤ t, we conclude that X mod pi ⊆ X (i) for all 1 ≤ i ≤ k with probability at least 1 − /2. We can now apply Theorem 1.3 and (including the probability that the X (i) are incorrect, and noting on the second line that L > 2n/m) find that  t log n log L k − /2 P (X̄ = 6 X) < 1 − n β + L  k mt(log n)2 − /2 <1−n β+ 2n  k 1p 1p k k ≤1−n /2n + /2n − /2 2 2 =1−  as desired. To establish the time bound, we first note that p √ n/2 n < m< 8p2 p2 21 L= √ 2nx n n, > > mp2 − 2x mp2 so we have k̂ = 2. Similarly, p n n 3 p > = 2n/, 3 2 mt(log n)2 n /2 so k ∈ {2, 3}. The only steps in the algorithm which take more than O(L) time are the computation of the C (i) , which takes CkL log L time using the FFT (by the definition of C), and step 4, which takes O((βL + t)2 ) time. However, (βL + t)2 ≤ 2β 2 L2 + 2t2 √ 3 n 2n2 2nx 1   2/k ·L+ 2 ≤ 2 2n mp2 − 2x m (log n)4 √ √ 3 n2 x n32 ≤ (1 + o(1)) · L + m mp2 = o(1) (xL + L) = o(kL log L), so the entire algorithm takes (1 + o(1))kL log L = = = (2 + o(1))Cnkx log L mp2 − 2x q 2 p 1 (2 + o(1))k Cn log L k log(n/) + log(t/) mp2 p 2 p n log(n/) (2 + o(1)) log(n/) + 3 log(t/) Cn log mp2 mp2 time. Essentially, this algorithm trades increased complexity in the process of reconstructing X for shorter FFT lengths pi ; by performing more processing of the vectors C (i) , it extracts more information, thereby reducing the necessary signal-to-noise ratio. As we will see in the next section, there is yet more information which can be extracted. 22 1.5 A Bayesian reformulation Bayesian analysis combines a “prior distribution” (the distribution expected in the absence of observations) of a set of parameters with the probability that the observations which were made would have been made given each possible value of said parameters, to obtain a “posterior distribution”, which indicates the likelihood of the parameters having any given value. Of particular recent interest, Bayesian analysis has been used to filter electronic mail, by decomposing messages into small pieces (usually words) and using past history to estimate the probability of each piece appearing in spam or non-spam messages [21]. It seems reasonable to consider this approach for filtering “good offsets” (those in X) from “bad offsets” (the rest). We have a well-defined prior distribution — we know that out of {0, 1, . . . n−m} there are t values in X and n−m−t+1 values not in X — and we are essentially making k observations about how well the strings match at each offset. Subject to the (false) assumptions that the sum of Bernoulli trials is exactly normal, and that values j 6∈ X never fall into the same class (modulo pi ) as (i) any values k ∈ X, we would have (noting that (Cj + σpi (n, m, j))/2 is distributed as the sum of σpi (n, m, j) Bernoulli trials) −x2 2σpi (n,m,j) e (i) dx P (Cj = qσpi (n, m, j) + x|j 6∈ X) ≈ p πσpi (n, m, j)/2 −(x−pm)2 2σpi (n,m,j) e (i) P (Cj = qσpi (n, m, j) + x|j ∈ X) ≈ p dx πσpi (n, m, j)/2 and thus, to a first approximation: (1) (k) P (j ∈ X|Cj . . . Cj ) P (j 6∈ X|Cj . . . Cj ) (1) (k) = Ye t n−m−t+1 «2 „ (i) − Cj −pm 2σpi (n,m,j) i ≈  e „ « (i) 2 − Cj 2σpi (n,m,j) t exp 2mp n−m−t 23 X i  (i) Cj − mp 2   σpi (n, m, j)  where j is reduced modulo pi as appropriate. This hints at a O(kn) algorithm, by computing k, pi , C (i) as in Algorithm 1.2, computing the sum (i) X Cj − mp/2 i σpi (n, m, j) for all j and finding the t largest values. As noted, however, this fails both theoretically (the distribution is close to, but not exactly, normal), and in practice: Offsets j 6∈ X will have their “scores” pulled upwards by the proximity (in terms of being congruent modulo primes in P ) of elements of X. In particular, if X = {r, 2r . . . (t − 1)r, xt }, and r is one of the pi randomly chosen, then the sum above will have larger values at j ∈ rZ − X than it has at j = xt ; while this is not a problem for random sets X, this (and similar constructions) makes it impossible to sufficiently bound the probability of error for arbitrary X. (i) This problem can be avoided, however, by filtering the vectors Cj somewhat; if (i) (i) (i) we take Dj = max(Cj , δ), for some appropriate δ, and consider the sums of Dj (i) instead of Cj , then we can guarantee a small probability of error regardless of X. Furthermore, it is not necessary to compute the sum for all j; rather, since we are only interested in the largest t values, we can start by considering the largest elements of each vector D (i) and stop once it is clear that the remaining sums will be less than all of the t largest values found so far. Algorithm 1.3. Let S, T, Σ, n, m, p, q, t, and  be given as specified in the problem description above. Then: 1. Set tmp2 log n δ= n   log(nt/) k= log(1/4δ)  p √  −8n log 2k /nt − δ  p L= √ . mp2 + 8 log 2k /nt − δ 24 (i) (i) 2. Compute P, pi , φi , Aj , Bj and C (i) as in Algorithm 1.1, and σpi (n, m, j) as defined in the proof of Theorem 1.2. 3. Compute the vectors D (i) such that (i) exp (−Dj ) = √ δ + exp (i) −mp(Cj − mp/2)    2σpi (n, m, j)  . 4. Using a priority queue, enumerate and output the t values of j for which takes the largest values and 0 ≤ j < n.12 Theorem 1.6. Provided that √ P (i) i Dj n < L < n, the output of Algorithm 1.3 will be the elements of X with probability at least 1 − . If n2 > Lk , the algorithm operates in asymptotic order 4Cn log(nt/) log n log(nt/) mp2 mp2 time, where C is a constant such that a length-L cyclic correlation can be computed in CL log L time. Proof. We note that since step 4 finds the values 0 ≤ j < n such that P (i) i Dj takes on the largest values, it suffices to prove that the sum attains larger values at the t values j ∈ X than at any of the n − t values j 6∈ X. Using the fact that, for x a sum of n Bernoulli trials,     2 α(x − E(x)) α E exp ≤ exp , n 8n we note that for j ∈ X, (i) E(exp(−Dj )) = √   δ + E exp  (i) −mp(Cj − mp/2)   2σpi (n, m, j)     √ −m2 p2 (mp)2 · exp ≤ δ + exp 4σpi (n, m, j) 8σpi (n, m, j)   √ −m2 p2 = δ + exp 8σpi (n, m, j)   √ −mp2 pi ≤ δ + exp 8(n + pi ) 12 This algorithm is slightly non-trivial but can found in many places, e.g., Knuth[31]. 25 Similarly, for 0 ≤ j 0 < n, j 0 ∈ 6 X, noting that P (j 0 ∈ X mod pi ) < δ, we have (i) (i) (i) E(exp(Dj 0 )) < δ · E(exp(Dj 0 )|x ∈ X mod pi ) + E(exp(Dj 0 )|x 6∈ X mod pi )     p 2 p2 2 −m (mp) · exp < δ · δ −1 + exp 4σpi (n, m, j 0 ) 8σpi (n, m, j 0 )   √ −mp2 pi ≤ δ + exp 8(n + pi ) (i) (i) Consequently, observing that since since Dj and Dj 0 (for j 6= j 0 ) derive from the sums of distinct subsets of the elements of the match count vector V , they will be independent, we have for all pairs j ∈ X, j 0 6∈ X, X (i) X (i) X (i) (i) Dj 0 − Dj )) Dj 0 ) ≤ E(exp( P( Dj ≤ i i i = Y i (i) (i) E(exp(Dj 0 ))E(exp(−Dj )) 2 −mp2 pi ≤ δ + exp 8(n + pi ) i   2k √ −mp2 L δ + exp ≤ 8(n + L)  = nt Y √  and, noting that there are fewer than nt such pairs (j, j 0 ), we obtain the required probability of error. To establish the time bound, we note that step 4 can be performed using a heap in time equal to log L times the number of positions j which must be considered; this number is easily bounded by L2 n−2/k in the same manner as the error bound. It is important to note that this filtering relies upon knowledge of p; if the value used is too low, useful “signal” will be filtered away, while if the value used is too high, the probability of obtaining incorrect results will be higher than desired. In some cases, it may be desirable to use a somewhat less powerful but more robust (i) (i) approach, by computing the ranks rj of the elements Cj within their respective q (i) C (i) , and considering the sum of log pi − log rj ; alternatively, one may choose to entirely ignore the danger of an increased error rate for worst-case inputs, and work (i) with the Cj directly. 26 1.6 Final notes In order to avoid over-complicating the exposition above, we have omitted some details which, while worthy of note, do not effect the main arguments. We present them here, without proof: 1. Since in the above algorithms we always have m < L ≤ {pi }, it is not necessary to perform the general correlation of two length-L vectors to compute the vectors C (i) ; rather, it is necessary to perform a length-L correlation of a length-L vector with a length-m vector. This reduces the factor of log L to a factor of log m, reducing the algorithms presented to run in order n log(nt/) log m mp2 time. 2. In some cases, it may be preferable to compute the number of matching characters directly, rather than applying mappings φi onto {−1, 1}. This increases the cost of computing the correlations by a factor of |Σ|, but reduces the necessary correlation length L by a factor of |Σ| /4; more significantly, however, it allows for general goodness-of-match functions δ to be used, which is important in some contexts. 3. As in the paper by Atallah et al. [3], it is possible to use mappings φ : Σ → {ω i } instead of mapping onto {−1, 1}. This has the effect of reducing the necessary correlation length L by a factor of two; but it also makes it necessary to work in C instead of R, which in most cases will result in the correlation taking (at least) twice as long. 4. We have required that, subject to the requirement of approximately matching in positions given by the set X, the two strings S, T are random and have characters taken uniformly from Σ. In some contexts, this is not reasonable: In particular, DNA sequences often have short strings which repeat a large number of times, while computer binaries often have large regions containing mostly zero 27 bytes. To resolve these problems, we can borrow a technique which is often used in the field of DNA sequence alignment: If a sequence is repeated several times, the repeating characters can be replaced by “ignore” symbols which φ maps to 0. Alternatively, and somewhat better in the context of computer binaries, commonly-occurring characters (such as the “0” byte) can be partially masked, by weighting them (or rather, the mapping φ) by some factor depending upon the character frequency. In practice, we find that weighting by the inverse square root of character frequency seems to produce good results13 . 5. If the values n, m, p, t,  are fixed in advance, the string S can be indexed simply by precomputing (or choosing) L, P, pi , φi , A(i) . If, however, some of these parameters are unknown, values P, pi , φi , A(i) may be precomputed corresponding to values Li = 2−i n for various i, taking O(n log n) time and producing an index of size O(n); once the necessary parameters are known, the “correct” value of L can be computed and the portion of the index corresponding to the smallest Li ≥ L can be used (at the expense of a slight time cost due to the larger than necessary vector lengths). Indeed, if implementation details make this desirable, specific convenient pi can be chosen for the index — with the caveat that the most convenient values 2i are likely poor choices if the source data is computer-generated. 6. In the last algorithm above, the values σpi (n, m, j) were used. For the values of n, m, L which can occur in the above algorithms, however, we have σpi (n, m, j) ≈ nm/pi , and computations involving σpi (n, m, j) can be replaced – at a very slight cost in accuracy – by corresponding computations involving nm/pi . 7. The restrictions placed upon the input parameters, and the values assigned to L, have naturally erred on the side of caution, and practical implementations will 13 Another tempting possibility is to deflate the input strings, and weight each individual character according to the inverse of the length of the string represented by the symbol in which it is found; by virtue of deflate’s effectiveness as a compression mechanism, the weights so obtained would most likely be a good estimate of the “information” transmitted by each character, and thus of the importance of it correctly matching. 28 generally be able to reduce the o(1) term by adjusting the choice of parameters. 8. Given that, in practice, most input strings will be non-random, it may be useful to apply several different mappings φ for each prime pi and add their resulting vectors C (i) together, since this will reduce the probability of choosing an “unlucky” φ. 9. In practical applications, the critical element for determining whether these al(i) gorithms can be used is the distribution of the values Cj . For random inputs (as constructed earlier), these values will be approximately normally distributed apart from outliers corresponding to values in X (these outliers might not necessarily appear to lie outside of the obvious distribution); however random or (i) non-random the process is which produces the inputs used, if the values Cj are distributed in an approximately normal manner then the algorithms described here are likely to succeed (although it may be necessary to make L larger by a multiplicative constant in order to compensate for non-random behaviour of the inputs). With these final notes, we have a practical algorithm for indexed matching with mismatches which operates in sublinear time for constant error rates. In the following chapters, we shall turn our attention to two applications of this algorithm, first in the delta compression of compiled programs, and second in the remote synchronization of files. 29 Chapter 2 Delta Compression of Executable Code 2.1 Introduction Historically, delta compression — that is, generating “patches” encoding the difference between two files1 — has been used most often in the context of program source code. Revision control systems typically keep the most recent version of a file along with reverse deltas, in order that older versions can be reconstructed [55], while updates are normally distributed as forward deltas — better known as patches. These source code patches have advantages beyond that of mere delta compression; they are also human-readable and (usually) commutative and reversible, which makes them especially useful for collaborative software development. Given that executable files are typically not human-readable, and that people rarely have cause to examine an old version of an executable file, delta compression of executable code is only used for one major purpose: software updating. In the past few years, however, this purpose has become increasingly important: As software security flaws are discovered and exploited increasingly rapidly — often by autonomous worms — it becomes vitally important that end-user systems be kept up to date. In this context, delta compression of software updates has two major advantages; not only can it dramatically reduce transmission costs, but it allows software to be updated more quickly, reducing both the potential for security flaws to be exploited before the affected software can be updated, and increasing the likelihood that patches will be 1 Or, equivalently, compressing a file by using a second “reference” file. 30 applied promptly: With security updates often in the range of 5–10 MB, and many users still connecting to the Internet via modems running at 33.6 kbps,2 it can easily take half an hour for uncompressed patches to be downloaded, exceeding the patience of most users. Delta compression of text files has usually been performed with two basic operations: copying and insertion3 . Typically, lines common to the old and new versions of a file will be copied, while lines which only appear in the new file will be recorded as part of the patch. This works well with text files because changes tend to be localized; a few lines might change from one version to the next, but most of the lines would remain unchanged. Unfortunately, this method fails when applied to executable files: Unlike the situation with text files, when executable files change there are usually differences spread throughout; as a result, only small regions of the old and new versions can be matched together, making it necessary for a large proportion of the new file to be stored in the patch. One solution to this problem relies upon knowledge of the structure of executable files, and the machine code, of the files’ target platform. By disassembling the old and new executable files and comparing the resulting code, it is possible to generate very compact patches [6]; however, since this approach requires a detailed understanding of the target platform, it is both highly complex and completely non-portable. Throughout this chapter, we shall refer to the “new” and “old” files; these are also sometimes known as the “current” and “reference” files. 2.2 Differences of executable code To construct compressed deltas of executable files in a portable manner, it is important to consider how executable files change at the byte level from one version to another. We group these changes into three categories: zeroth-order changes, first-order changes, and second-order changes. 2 Most users have modems capable of running at 56 kbps, but poor quality wiring tends to limit potential speeds. 3 Some programs are in fact even more restricted, using only deletion and insertion; in the context of text files, where blocks of text normally do not change order, this is equivalent. 31 Zeroth-order changes are those which are innate to the compilation process — changes which take place even when the same source code is compiled twice. While in most cases there are no such changes, some UNIX programs contain human-readable time, date, and host stamps [44], and Microsoft’s Portable Executable format [37] contains a timestamp in the file header. These zeroth-order changes can be problematic for some applications — they can result in files being spuriously identified as “modified”, causing unnecessary patches to be distributed or setting off intrusiondetection systems — but since they are normally very small relative to the files as a whole, they have little impact on patch generation. First-order changes are those which can be directly attributed to changes in source code. While there is generally some expansion when compared to the source code delta — when optimizing compilers reorder instructions, they will inevitably interleave modified instructions with unmodified instructions, and the precise allocation of processor registers may be modified over an even larger region — these changes will be localized and of an extent proportional to that of the source code changes. Since the bytes of executable code affected by first-order changes belong to instructions which are essentially new, they are best compressed by well-understood (non-delta) compression algorithms. Second-order changes are those which are induced indirectly by first-order changes. Every time bytes of code are added or removed, the absolute addresses of everything thereafter, and any relative addresses which extend across the modified region, are changed. As a result, a small first-order change will result in addresses being modified throughout an executable file. This explosion of differences, where a single line of modified source code can cause up to 5–10% of the bytes of executable code in a file to be modified, is what causes conventional delta compression algorithms to perform so poorly on executable files. As a result, efficient delta compression of executable code can be largely considered to be the problem of locating and compactly encoding these second-order changes. 32 2.3 Matching Since first-order and second-order changes are fundamentally different in nature, it is important to distinguish between them, so that they can be handled differently. The natural approach to this task is to attempt to match portions of the new file against the old file; differences between closely matched regions should be treated as second-order differences (i.e., as differences between regions of binary code generated from the same source code), while regions of the new file which cannot be matched to any part of the old file should be treated as first-order differences4 . In conventional delta compression methods, this matching is required to be exact — two strings of bytes match if and only if they are equal. Two commonly used methods used for this matching are suffix trees and hashing; depending upon the application, these are sometimes limited to fixed windows, in order to minimize memory usage. For matching in executable files, however, exact matching is inappropriate; instead, we want to find code which matches apart from certain mutable elements (such as addresses and possibly register allocations). In platform-specific tools (e.g., [6, 43]), this is easy: The code can be disassembled, the mutable elements stripped out, and the remaining instruction listings compared. This problem becomes similarly easy given access to the source code and a (possibly modified) copy of the compiler used: By having the compiler record the lines of source code responsible for each region of binary code in the two files, the problem is reduced to one of comparing the source code for the two files. If either source code or platform-specific knowledge is unavailable, however, the problem becomes one of matching with mismatches — we wish to find regions from the two files which match in their “opcode” bytes, but not necessarily in the bytes encoding memory addresses or register allocations. We note that this problem is related to the problem of DNA sequence alignment; in both problems, two strings are given from which approximately matched substrings 4 Providing that the encodings used for the first-order and second-order differences are reasonably “stable” (i.e., not overly influenced by the presence of a small number of bytes which do not follow the same model as the majority), it not necessary that this matching be exactly correct. 33 must be found. There is, however, a quantitative difference: In DNA sequence alignment, the percentage of matched base pairs is usually quite high (often 90% or more) while the length of matched regions is quite low (tens of base pairs) due to mutations which add or delete characters; when matching executable code, in contrast, the percentage of matched characters can be lower (often down to 50%) but the matched regions tend to be longer (hundreds or thousands of characters). A trivial quadratic algorithm can be obtained by dynamic programming (or equivalently, by finding the shortest path through a directed graph); each byte within the new file can be aligned against any position in the old file, or left unmatched. Scores are computed based on the number of matched bytes (bytes in the new file which are equal to those against which they are aligned), the number of mismatched bytes (those which are not equal to the bytes against which they are matched), the number of unmatched bytes, and the number of realignments (positions where two successive bytes in the new file are not aligned against successive bytes in the old file). Unfortunately, such a quadratic algorithm is far too slow to be useful in practice 5 . This algorithm is closely related to the well-known Needleman-Wunsch algorithm for DNA sequence alignment [40] and its derivatives. 2.4 Block alignment Recognizing that small changes in source code tend to leave large portions of the resulting binaries unchanged apart from internal addresses, we apply the algorithm for matching with mismatches from Chapter 1. Taking the old file S, we construct an √ index S̄ with k = 2 and L = 4 n log n in time linear in the file size6 . Next, we divide √ the new file T into blocks of length n log n, and using the index S̄ locate where in the old file each such block matches best. If we are lucky, every sufficiently large region which matches (with mismatches) between the two files will contain at least one block with the correct alignment; but 5 In delta compression of text files, O(n2 ) algorithms are not uncommon; but in such cases, n is usually on the order of 103 lines, in contrast to the 105 –107 bytes typical of binary files. 6 The constant 4 is to some extent arbitrary; the FFT and block lengths can be adjusted depending upon the amount of time available, the importance of obtaining correct matchings, and the importance of identifying small matched blocks. 34 there will almost certainly be blocks which were aligned in incorrect places, and the boundaries between adjacent blocks with different alignment offsets will usually not correspond to the boundaries between the underlying regions. Consequently, we now “tweak” this alignment: We scan, first forwards, then backwards, through the new file, considering in turn each of the boundaries between aligned blocks. Based on the contents of the new and old files, we move these boundaries in order to reduce the number of mismatches; where a range of possible boundaries would result in the same number of mismatches, we place the boundary on a multiple of the largest power-of-two possible7 . We also remove blocks which, as a result of this process, shrink below some threshold; this is done because such blocks would be too small to have been found by our algorithm for matching with mismatches, so we have no reason to think that they are correctly aligned. At this point, most or all large regions which match (with mismatches, but without indels) between the two files should have been found. We now make a final pass through the list of blocks, counting the number of matching characters, and remove any blocks or parts of blocks which fail to match in at least 50% of their characters, on the assumption that if the best matching we can find for a region is that poor, then it probably doesn’t match at all, but is instead a completely new region (i.e., a first-order change). Each application of our algorithm for matching with mismatches takes O(L log L) √ √ = O( n log n log n) time, and we apply the algorithm m/ n log n times, for an operation count of O(m log n). The other stages (the initial indexing, and the “tweaking”) run in O(n + m) time, for a total operation count of O(m log n + n). In practice, this method is quite fast — especially on processors with vector8 floating-point arithmetic9 . The majority of the time consumed is spent performing single-precision FFTs, and since these are among the first complex operations which 7 Due to issues involving instruction prefetching and code caches, compilers usually attempt to align blocks of code on 8-, 16-, or 32-byte boundaries. 8 Or, as it is more popularly known these days, SIMD. 9 This includes all of the most widely used processor architectures: Intel processors have “SSE”, PowerPC processors have “Altivec”, and AMD processors have “3DNow”, all of which support simultaneous operations on four single-precision floating-point values. 35 processor (and library) vendors optimize, block alignment tends to perform even better than a theoretical operation count might suggest. Block alignment also performs quite well from the perspective of memory usage: The index, and the resulting alignment, fit within O(n1/2+ ) bytes of memory. For pairs of binaries which are quite similar, block alignment also tends to result in very small patches. 2.5 Local alignment While block alignment is very effective when applied to two binaries with very small source code differences (e.g., security updates), it tends to perform poorly in the presence of more extensive changes. Because it starts by aligning blocks of length √ O( n log n) (typically a few thousand bytes), it is incapable of detecting and aligning smaller regions, even if those regions match extremely well. For these, we return to the methods used for delta compression of binary data files: We search for regions from the two files which match perfectly. First we suffix sort the string S#T #, where the character ‘#’ is usually called the “EOF” character and is sorted before any characters in Σ; this can be done in O((n + m) log(n + m)) time [34], or even in O(n + m) time [26, 29, 32] with a significantly larger implicit constant. We then compute the longest common prefix vector (in linear time [27]), and then scan through the list of sorted suffices in both directions in order to obtain, for each position in T , the offset within S from which the largest number of matching characters can be found, and the length of this matching substring10 (also in linear time). Given these locally-optimal alignments, we can generate a set of alignments for the entire file as follows: Starting at the beginning of the file, we set the “current” alignment to be the offset (between the new and old files) where the string starting at byte zero matches the furthest, and allow the “next” alignment to iterate forwards through our array of locally optimal alignments. Any time that the “current” alignment, extended forward to the end of the matching block associated with 10 An alternative, and more commonly used approach, involves hashing fixed-length blocks [35, 57], which is faster but often fails to find the longest matching substrings. Suffix trees are also commonly used, but in light of recent algorithmic improvements have no advantages over suffix sorting. 36 the “next” alignment, contains enough mismatches (typically 8 mismatches is a reasonable threshold), the “current” alignment is output and replaced by the “next” alignment. This produces a list of non-overlapping alignments; starting from these “seeds”, we now extend the alignments forwards and backwards to the extent that they continue to match at least 50% of characters. In practice, the time taken is dominated by the construction of the suffix array, and reasonably good results are produced for pairs of executables produced from significantly different source code; but for binaries with only very small differences, the performance is limited by the inability of locally optimal alignments to reflect large blocks which match with significant numbers of mismatches11 . This local alignment √ is also slower than block alignment, and consumes O( n) times as much memory. 2.6 Combining local and block alignment In order to gain the advantages of both block alignment and local alignment, we return to dynamic programming, but avoid the O(nm) running time which would result from allowing arbitrarily alignments by first pruning the graph using block and local alignment. The block alignment works as described in section 2.4, with the exception that the final filtering stage is removed: The old file is indexed, the new file is split into blocks which are individually aligned against the old file, and the boundaries between blocks are moved (or removed entirely) in order to maximize the number of matching characters. The local alignment proceeds up to the point of finding the longest matching substring starting from each position in the new file: The two files are suffix sorted together, the longest common prefix vector is computed, and the suffix array is scanned forwards and backwards. We now iteratively construct a pruned graph, and find the shortest path through it, as follows: Instead of including nm vertices corresponding to the n positions in 11 One case where this occurs is in tables of addresses, where there may be over a thousand bytes which mismatch in one or two out of every four bytes (the low order byte(s) of each 32-bit address). Since the largest perfect matches within such a block are only three bytes long, the block will remain entirely undetected by a method which relies only upon locally optimal alignments. 37 the old file against which each of the m bytes in the new file could be aligned, we include only 64m. The 64 vertices associated with each position pos in the new file comprise: 31 computed from the offsets ([position in old file] minus [position in new file]) associated with the longest matching substrings starting at positions immediately following pos; 31 from offsets which have the shortest distances from the origin to position pos − 1; one corresponding to the matching predicted by block alignment; and one which is the “not matched” position12 . At byte zero in the new file, all the vertices are initialized to distance zero; subsequently, the distance from (x, y) to (x + 1, y + 1) is 0 if the corresponding bytes from the old and new files match, and 2 if they mismatch; the distance from the “not matched” position for pos − 1 to the “not matched” position for pos is 1; and a distance of 20 is assigned to the remaining paths from one step to the next13 . Once the last byte of the new file is reached, we find the vertex from that final step with the minimum distance, and follow its path backwards; in so doing, we have constructed an alignment of the new file against the old file. 2.7 Delta encoding Once the old and new files have been matched against each other, we turn to encoding their differences. The matching is easily encoded by listing, in order of the new file, the offsets and sizes of blocks which are copied from the old file and the sizes of new blocks. We encode these integers in little-endian, base-128 format, and use the most significant bit of each byte to identify the most significant byte14 ; we refer to the resulting sequence of bytes as the “control string”. Those regions of the new file which are not matched against any part of the old file (and which, therefore, are believed to be zeroth-order and first-order differences) are extracted and concatenated; this forms the “extra string”. 12 Like many other numbers in this chapter, the values 64 and 31 are chosen simply because they work well without being excessively slow. 13 These values are obtained by experimental observation, and will not be ideal for all input data. 14 This commonly used encoding allows for small integers to be expressed compactly, while not imposing any upper limit on the integers to be encoded. 38 What remains is the problem of encoding the second-order changes identified. As noted before, platform-specific information can make this task much easier; one approach [43] involving complete disassembly of the executables into assembly language removes these second-order differences completely, since upon re-assembly, the new addresses are used. Another approach uses extra data emitted by the compiler to zero all such internal addresses and store them separately [53]; this emasculated file is then delta compressed using well understood mechanisms [25, 52, 56]. In order to efficiently encode the second-order changes in a naı̈ve manner, i.e., without platform-specific knowledge, the original source code, or a modified compiler, we recall three points: First, these changes are primarily the result of addresses (relative and absolute) being modified. Second, references to the same, or nearby, data or code targets will normally be modified in the same manner (code and data tend to be moved around in blocks). Third, data and code targets which are referenced close together have a good chance of being located together (locality of reference). Together, these points suggest that when matched blocks of code from the new and old files are compared, not only will most of the differences occur in addresses, but the numerical differences between the addresses in the new and old files will take on certain values far more commonly than others — in short, those differences will be highly compressible. Now, without being able to disassemble the executable code into its constituent instructions, knowing that the differences represent a small number of commonlyoccurring differences is immaterial in itself; but since computer binaries invariably encode addresses as binary integers, we can obtain a very compressible difference string by simply taking the bytewise differences between corresponding bytes of the new and old files. The presence of multi-byte integers diminishes the potential compression somewhat, however, since a bytewise subtraction of 1200 - 11FC yields 0104, while 1210 - 120C yields 0004; to avoid this, we turn to multi-precision arithmetic subtraction, and write the difference in balanced notation (i.e., with digits in the range [-128 . . . 127]); thus 1200 - 11FC = 1210 - 120C = (00)(04), while 1280 - 11C8 = 12B8 - 1200 = (01)(-48). Noting that this depends upon machine byte order, 39 we compute not one “difference” between each pair of matched regions, but instead four: First, the bytewise differences, which perform reasonably well regardless of machine byte order; second, the Lilliputian multi-precision difference; third, the Blefuscuan multi-precision difference; and fourth, a “correction map” simply containing the values from the new file in each place that the new and old files differ15 . Each of these four constitutes a “difference string”; we will decide later which one shall be used. Now, we note that a difference string is in fact a union of two entirely different data sets. It specifies where in the executable file addresses have been modified, and it also specifies by what amount they have been modified. Just as combining similar data before compression tends to improve compression ratios, combining dissimilar data before compression tends to reduce compression ratios16 , so we split this string into two parts: First, a “difference map”, which is of the same length but contains bytes equal to zero or one, depending upon whether the corresponding byte is nonzero, and second, a “non-zero difference string”, containing all the non-zero bytes without the intervening zeroes17 . In this manner, we separate locally repetitive data (as noted before, in any region, the differences tend to take a small number of values repeatedly) from globally structured data (as a result of instruction encoding lengths and the positions within instructions where addresses are encoded, the locations where corrections must be made tend to form distinctive, and compressible, patterns). The control string, non-zero difference string, and extra string are compressed independently, either with zlib [2] (a Lempel-Ziv compressor) or libbzip2 [51] (a block sorting compressor); in general, the control and non-zero difference strings compress most efficiently with zlib, since they contain local repetition, while the extra string compresses best with libbzip2, since executable code contains significant structure, which a block sorting compressor can utilize. 15 We have yet to find any files (not specifically constructed for this purpose) for which the “correction map” results in smaller patch sizes; but it seems a natural item to include. 16 Birds of a feather compress better together. 17 The “correction map” is handled slightly differently: It is transformed into a difference map marking all the positions where the two files differ, and a non-zero difference string containing the values from the new file at all these locations. 40 The difference map is compressed in a different manner: Recognizing that it contains entirely structural redundancies, we first take the Burrows-Wheeler transform [10] of the data, which serves to cluster related data together — in this case, causing the 1s to cluster together; next we enumerate the positions of the 1s, thus forming a strictly increasing sequence; finally, we recursively divide this sequence in half, and encode the value at the midpoint using arithmetic compression. We now construct the patch file in five parts. The first part is a header containing a “magic” string; a byte specifying the differencing mode used (correction, bytewise, Lilliputian, or Blefuscuan) and the compression method used (none, zlib, or libbzip2) for each of the control, non-zero difference, and extra strings; the sizes of the new and old files; the MD5 hashes [50] of the new and old files18 ; the sizes of the compressed and uncompressed control, non-zero difference, and extra strings; the size of the compressed difference map; and the position of the EOF character after the BurrowsWheeler transform has been performed. The other four parts, in order, are the compressed control string, the compressed non-zero difference string, the compressed difference map, and the compressed extra string. We have implemented this method in a tool named ‘bsdiff 6’19 2.8 Performance In order to evaluate the delta compression performance of bsdiff 6 when generating patches between two different versions of an executable, we use 15 pairs of DEC UNIX Alpha binaries from the exposition of Exediff, the working of which is specific to that platform [6] (five pairs from that paper are not included here; four of these were artificial, while one pair was unavailable). In Table 2.1 we list for each of these pairs the original size of the new version of the file, both uncompressed and bzip218 It has recently been shown [58] that it is quite easy to find collisions in the MD5 hash function, but this is irrelevant to the purpose for which we are using it: We include these hashes simply as a safeguard against error, (whether human, in the form of attempting to apply a patch to the wrong file, or computer, in the form of incorrectly applying the patch). 19 bsdiff versions 0.8 through 4.2 were previously published by the author, and ranged from the traditional copy-and-insert to a method similar to what we have described here, but using only local alignment, and using a more primitive delta encoding. Version 5 never existed except as a descriptor for experimental work which later became version 6. 41 42 Program agrep: 4.0 → 4.1 glimpse: 4.0 → 4.1 glimpseindex: 4.0 → 4.1 wgconvert: 4.0 → 4.1 agrep: 3.6 → 4.0 glimpse: 3.6 → 4.0 glimpseindex: 3.6 → 4.0 netscape: 3.01 → 3.04 gimp: 0.99.19 → 1.00.00 iconx: 9.1 → 9.3 gcc: 2.8.0 → 2.8.1 rcc (lcc): 4.0 → 4.1 apache: 1.3.0 → 1.3.1 apache: 1.2.4 → 1.3.0 rcc (lcc): 3.2 → 3.6 Average Compression Original 262144 524288 442368 368640 262144 524288 442368 6250496 1646592 548864 2899968 811008 679936 671744 434176 100% bzip2 114338 222548 193883 157536 114502 222178 193892 2396661 642725 233056 708301 221826 180708 179369 155090 36.22% Xdelta 14631 109252 98632 75230 80346 177434 144927 1100430 463878 139409 549250 889 111421 191920 84456 20.83% Vcdiff 10886 93935 80325 60658 79962 189926 144746 1013581 462588 119510 422288 667 103611 242292 76227 19.66% zdelta 7162 64608 51723 38544 63282 147594 115980 2519221 345385 80017 274652 373 69895 200511 52324 19.52% .RTPatch 5910 37951 25764 20712 58124 140549 105510 351759 301879 51195 140284 265 48033 216867 34098 10.88% Exediff 3531 23200 18473 15688 41554 106154 80799 284608 185962 38121 76072 303 42038 231357 22019 8.41% Table 2.1: Sizes of updates produced by bzip2, Xdelta, Vcdiff, zdelta, .RTPatch, Exediff, and bsdiff 6 bsdiff 6 4265 24642 16240 12432 44327 109680 80447 212032 219684 31632 88022 187 25927 163249 22691 7.67% compressed, along with the size of patches produced by three freely available binary patch tools (Xdelta [35], Vcdiff [33], and zdelta [56]), a widely used commercial tool (.RTPatch [45]), Exediff, and bsdiff 620 . In the interest of a fair comparison, and after communication with one of the authors of that tool (Robert Muth), we re-compressed Exediff’s patches with bzip2 rather than gzip where this resulted in a reduction in patch size. The average compression factor shown is computed as the arithmetic mean of the individual compression factors [patch size]/[original size], weighted by the square root of the file size, in order to avoid over-weighting either the largest or the smallest files used21 . (This can be seen as a compromise between using a constant weight and a weight linear in the file size; it might also be possible to make an argument that this weighting is appropriate by considering the sizes of patches as being produced by random walks, but no such considerations were in fact made in choosing this weighting.) There is a very clear trend; from the largest (worst) patches to the smallest (best) patches, the order is usually bzip2, Xdelta, Vcdiff, zdelta, .RTPatch, {Exediff, bsdiff 6}, with Exediff and bsdiff 6 being optimal in 7 and 8 cases respectively; overall bsdiff 6 performs slightly better than Exediff. There are two notable exceptions to this trend: First, zdelta performs exceptionally poorly on the netscape 3.01 → 3.04 upgrade; we believe that this results from the large file size (6.5 MB) exceeding the capabilities of zdelta and causing it to effectively revert to regular (non-delta) compression 22 . The second exception is the apache 1.2.4 → 1.3.0 upgrade, where bsdiff 6 was the only delta compressor to produce a smaller “patch” than bzip2. Examining the source code for these two versions of Apache, it is not surprising that there is little similarity for the various delta compressors to grasp: The two versions share less than half of their source code. This is reflected in the patch produced by bsdiff 6: Out of 20 It would have been useful to compare the performance of other platform-specific tools, but since those tools are both specific to a different platform, and commercially licensed, this was not possible. 21 Different authors express the performance of compression algorithms in a variety of manners; what we refer to as a “compression factor of 10%”, many authors would refer to as a “compression ratio of 10x”, while yet others would refer to “90% compression” (presumably on the hypothesis that 100% compression would eliminate the file entirely). 22 Note that zdelta uses the deflate algorithm, which is less effective at compressing executable code than bzip2. 43 the 671744 bytes in apache 1.3.0, only 263509 were delta compressed; the remaining 408235 were compressed with bzip2 and stored as “extra” bytes. More important than the above consideration of upgrading between versions (“feature updates”), however, is security updates. These are fundamentally different, in that the source code modifications are typically extremely small — of the first 13 security advisories issued for the FreeBSD operating system in 2004, 5 were corrected by patches of less than ten lines, and only one exceeded 50 lines in length [19, 20]. We take the i386 build of FreeBSD 4.7-RELEASE and a snapshot of the RELENG 4 7 security branch as a corpus for comparison here. In Table 2.2 we show the performance of Xdelta, Vcdiff, zdelta, .RTPatch, bsdiff 6, and bsdiff 6 using block alignment only, grouped according to the file type being patched (Exediff is not considered here, because it is exclusive to a different platform). Note that the 3 ‘text’ files (one C header file and 2 configuration files) all have very small differences; the performance of the various tools here reflects little more than the overhead in their respective patch formats. For the 79 ‘binary’ files, (69 executables, 7 library archives, and 3 shared object files), the tools perform in the same order as before; in order from the largest to the smallest patches: Xdelta, Vcdiff, zdelta, .RTPatch, bsdiff 6. We note, however, that the gap between best and worst performance is much larger for security patches; while the various tools produced ‘upgrade’ patches averaging from 7.67% to 20.83% of the original file sizes — a spread of less than a factor of three — the ‘security’ patches averaged from 1.27% to 9.28% — a difference of more than a factor of seven. This reflects the greater relative importance of second-order changes in the files; where the source code changes — and thus the first-order changes in the executable files — are small, the resulting patch sizes depend almost totally upon the efficiency of encoding second-order changes. In light of the vastly reduced memory consumption, it is useful to note the delta compression performance of bsdiff 6 when operating in “block alignment only” mode: Given that the patches are on average only 8% larger, this may be a method worth considering further. 44 45 File type Executables (static) Executables (dynamic) Library archives Shared Object C headers Configuration Total Average Compression # files 53 16 7 3 1 2 82 Total size 15056508 8002788 7600906 1438072 11402 10183 32119859 100% bzip2 6725178 3399743 1635094 599937 3810 4407 12368169 38.51% Xdelta 1779523 1009139 93480 96651 238 401 2979432 9.28% Vcdiff 1275057 664869 57902 77386 66 81 2075361 6.46% zdelta 773574 422074 38797 49174 55 62 1283736 4.00% .RTPatch 373788 182843 30661 20323 169 280 608064 1.89% bsdiff 6 250509 128859 16476 13338 126 214 409522 1.27% block alignment only 263676 133907 31240 13716 129 313 442981 1.38% Table 2.2: Sizes of security patches produced by bzip2, Xdelta, Vcdiff, zdelta, .RTPatch, bsdiff 6, and bsdiff 6 using block alignment only 2.9 Conclusions By considering how executable files change at the byte level from one version to another, and by utilizing the algorithm for matching with mismatches from Chapter 1 and (optionally) suffix sorting, we can generate patches which are equal to or smaller than those produced by a carefully tuned platform-specific tool. In light of the great advantages of a naı̈ve approach — greater simplicity, less potential for correctness and/or security errors, and portability across platforms — we believe this is a very useful approach to the distribution of software updates in general, and security updates in particular. Unfortunately, in the context of “open source” software, binary patches have one critical flaw: If the “old” file is not exactly as expected, it will be impossible to correctly apply the patch, meaning the the entire “new” file must instead be transmitted23 . We will turn to this problem in the next chapter. 23 By using checksums or cryptographic hashes, failing to correctly apply a patch can be assumed to be detected. 46 Chapter 3 Universal Delta Compression 3.1 Introduction As noted at the end of Chapter 2, while delta compression can provide considerable advantages, it also has a significant limitation: Each delta is generated from a pair of “new” and “old” files, and any attempt to apply a patch to the wrong “old” file is very unlikely to result in the correct “new” file being produced. As a consequence, normal delta compression is entirely unusable for updating files which have been modified (or in the case of ‘Open Source’ software, compiled) on the destination machine; it requires that copies of all published binaries be kept (either intact, or as reverse deltas) in order for future delta compression to be performed1 ; it can be impractically slow to build patches if there are a large number of old versions — in the context of security patches, even adding a couple of hours may be problematic — and finally, even once all the necessary patches have been generated, it is necessary either that each system is examined in order that the appropriate patch may be downloaded (which may be impossible for systems not connected to the Internet, or in other high security environments2 , and is often undesirable even if it is possible), or that several different patches are distributed in a single package — which would severely diminish the advantages of delta compression. 1 This is a particular concern at present, since delta compression is only starting to become widely used; many vendors find that they would like to use delta compression, but did not keep copies of older binaries. While it would be possible to make a gradual transition towards using delta compression — using deltas whenever possible, and downloading the entire “new” files when necessary — this isn’t a particularly useful approach, since the majority of security patches usually affect files which have never been updated before. 2 Banking systems, for example. 47 In light of these disadvantages, we consider the problem of universal delta compression with respect to a distance function d: Problem: Let a distance function d be chosen, let a file S be fixed, and let parameters R (known as the “radius”) and  (the acceptable probability of failure) be given. Generate a (possibly randomized) file S 0 , with S 0 < |S|, such that for any file T with d(S, T ) < R chosen independent of the randomization used in constructing S 0 , S can be computed from {T, S 0 } with probability at least 1 − . In short, generate a single patch which can be distributed to everyone, rather than needing to generate individual patches for each old version of the file being updated. Note that the probability  of error arises out of the randomized construction of S 0 — we are not concerned by the possibility that, given S 0 , one can construct files T within the given radius for which the algorithm fails. Some authors refer to this as a “one-round synchronization protocol” or a “zero-feedback synchronization protocol”; given that a zero-feedback protocol is hardly a protocol at all, we opine that this falls more appropriately into the category of delta compression than that of protocols for file synchronization3 . While there are theoretical solutions to this problem involving colouring hypergraphs [41], these require exponential computational time; as such, they are of no practical value. Throughout this chapter, we will refer to the “sending machine” and the “receiving machine”, these being the parties holding the strings S and T respectively. 3.2 Error correcting codes If the distance function d is the Hamming distance, there is a well known algorithm for this problem [1]: Take some error correcting code C with the property that 3 Nevertheless, there is a natural way of converting an algorithm for universal delta compression into a protocol for file synchronization: One participant computes and transmits deltas S 0 computed with increasing values of R, and the second participant attempts to apply these patches in turn and sends a single bit back after each patch to indicate if the most recent patch was sufficient. (We assume that a checksum can be transmitted in order that the second participant can recognize if S has been reconstructed correctly.) 48 C(x) = x.P (x) for some parity function P , with distance at least 2R+1, and compute and send the parity block S 0 = P (S). At the receiving machine, on receipt of P (S), form the string T.P (S) and decode according to the error correcting code. Given that |{i : Si 6= Ti }| ≤ R and the code C has distance at least 2R + 1, T.P (S) will be decoded to the codeword S.P (S), from which the string S is extracted. In contexts where the transmission of the parity block is not subject to errors (as is the case when constructing compressed deltas — we trust patch integrity to lower level mechanisms), this can be improved slightly if we restrict ourselves to using cyclic codes: Rather than taking S as the data block of an error correcting code and transmitting the parity block, we can take S as a codeword with errors, and compute and transmit S 0 = syndrome(S) = S(x) mod g(x), where g(x) is the generator polynomial for the code. Since S − S 0 is a codeword4 , we can compute and decode the string T − S 0 to find that codeword, and then add S 0 to retrieve S. This reduces the problem of error correction from one with codewords of length |S| + S 0 to one with codewords of length |S|. For computer files, which are naturally interpreted as a sequence of symbols in GF (28 ), an obvious choice of codes is the Reed-Solomon codes [48]: Operating on symbols from GF (2n ), a Reed-Solomon code exists of length 2n − 1 with distance δ + 1 and δ parity symbols for any 0 ≤ δ < 2n − 1. If the necessary block length is not one less than a power of two, these codes can easily be shortened (effectively by padding with zeroes) to provide any desired length; this construction is used quite commonly, most notably in the error correcting codes used on CD-ROM disks. If S can be formed from T by a sequence of Ns substitutions of total length Ls bytes (e.g., if S = “ABCDEF” and T = “ABCXYF” then Ns and Ls can be taken to be 1 and 2 respectively) and if M bytes are grouped into each symbol, then the distance d(S, T ) can be bounded from above by min(Ls , 2Ns + (Ls − 2Ns )/M ) by considering the number of blocks which can be corrupted by a sequence of k consecutive errored symbols. Consequently, since the Reed-Solomon codes need a length-2d parity block to correct d errors, and each symbol is M bytes in length, a ReedSolomon code can correct Ns substitutions of total length Ls bytes by transmitting 4 Note that although S − S 0 is a codeword, it is typically not the closest codeword to S. 49 min(2Ls M, 2Ls + 4Ns (M − 1)) = min(2Ls dlog256 |S|e, 2Ls + 4Ns (dlog256 |S|e − 1)) bytes. Comparing this to the trivial asymptotic combinatorial lower bound (for 1  Ns  Ls  |S|) of       |S| − Ls + 1 Ls (|S| − Ls )e2 Ls − 1 Ls + log256 · ≈ Ls + Ns log256 Ns Ns − 1 Ns2 obtained by considering the number of ways that Ns blocks of total length Ls can be located within a total length of |S| and using the approximations log n! ≈ n log(n) − n and (1 + 1/n)n ≈ e, we note that this performs fairly well for Ns  Ls and for small files, but transmits considerably more data than necessary when the changes occur in the form of isolated bytes in a large file. 3.3 Randomized large-codeword error correction over GF (28 ) Where there are a significant number of isolated byte-substitutions, the above approach can be improved, at the expense of randomization and a small probability of error, as follows: Rather than considering the entire string S as a codeword in a ReedSolomon code over GF (2k ) (for strings of up to (2k − 1) bk/8c bytes) and performing the error correction in a single step, we first apply an error-reducing code. Dividing the string into d|S| /255e random distinct subsequences5 of length 255 (this can be done in a variety of ways, but the simplest approach is to construct a random permutation function φ — for example, by generating |S| random numbers and sorting them — and taking the sequences φ(255i) . . . φ(255i + 254)), the sending machine computes and transmits a 2R-byte Reed-Solomon syndrome for each subsequence. Given these, the receiving machine uses Reed-Solomon decoding over GF (28 ) to attempt to correct the substitutions. For each 255-byte random subsequence, if R or fewer substitutions were present, the resulting codeword will be correct; otherwise, at most R new errors 5 Naturally, the selection of distinct subsequences would be performed in a pseudo-random manner, with the initial PRNG state either fixed or transmitted as part of the error-correcting block. The purpose of this randomization is simply to ensure that “bad” inputs, where the errors are placed in a deliberately inconvenient manner, are not likely to be encountered. 50 will have been introduced by the “decoding”6 . As a result, providing that R is chosen sufficiently large, the number of errors left after this randomized decoding will be significantly reduced; a single large-field Reed-Solomon code can then correct the remaining errors. Suppose that strings S, T of length n differ by Ns substitutions (in any positions). Then each random 255-byte subsequence of T will contain an average of µ = 255Ns /n bytes which differ from the corresponding bytes in S. Suppose further that n > 216 and 2 exp(−µ(2 log(2) − 1)) < 1/2 − 1/ log 256 n (for our purposes, we require that the number of errors is at least 2–5% of the file length; if there are fewer errors, then we take R = 0 and proceed directly to the error-correction stage, since the errorreduction would be counterproductive). Then for some appropriate 0 < δ < 1, let R = µ(1 + δ) and compute syndromes as above. These will allow for individual 255-byte subsequences to be decoded with a probability of error less than the Chernoff bound of exp(δµ)(1 + δ)−R , which for δ < 1 is less than exp(−µδ 2 (2 log(2) − 1)). Consequently, the number of errors remaining after this first phase will be 2Ns exp(δµ)(1 + δ)−R + √ O( n) (where the “implicit constant” can be effectively computed given a desired maximum probability  of errors remaining), and (by our assumed bound above) the second error-correcting phase can correct these while still achieving a reduction in the number of bytes transmitted. We note that these bounds, while algebraically complex, can be computed very efficiently numerically; in practical scenarios, one would take the values n, Ns , and compute for each R = 0, 1, . . . , 127 the probability of any individual 255-byte sequence being decoded correctly, and thereby the size of the syndrome necessary in the second phase in order to obtain the desired probability of error. To take a simple example, suppose we wish to correct a string of length |S| = 106 bytes having Ls = Ns = 105 errors, and we are permitted a probability  = 10−3 of error. Using a single Reed-Solomon code, we would need to operate over GF (219 ) (since we can group two bytes into each symbol), and we would need a parity block of 2 · 105 symbols, thus we would need to transmit 475000 bytes. If we instead divide 6 In fact, for large values of R, codewords with too many errors are most likely to simply be returned as “undecodable”, since they are likely to not fall within an R-error radius of any codeword. 51 S into 3922 random subsequences of length 255 (padding the final subsequence with zero bytes), and take R = 37 (i.e., send a syndrome of 74 bytes for each 255-byte subsequence), then each individual subsequence can be correctly decoded with probability 99.14%, and the number of remaining errors is less than 4359 with probability 1 − . Now we can send a much smaller parity block over GF (219 ), and correct these remaining errors using 20701 bytes, for a total of 310929 bytes transmitted, less than the original requirement of 475000 (at the expense of potentially getting the wrong answer if the error-reducing stage left too many errors remaining). 3.4 Rsync revisited The rsync algorithm, patented in 1995 [46, 47] and independently invented and popularized7 by Andrew Tridgell [36, 57] is a two-round protocol for file synchronization 8 . For a parameter B, known as the block size and typically in the range of 300–700 bytes, the receiving machine computes and transmits checksums of bytes iB up to (i + 1)B − 1 of T for all 0 ≤ i < |T | /B. The sending machine, upon receiving these checksums, computes the checksums of bytes i up to i + B − 1 of S for all 0 ≤ i < |S| and searches for collisions between the two lists. The sending machine then encodes S as a sequence of raw bytes and blocks from T and transmits this, allowing the receiving machine to reconstruct S 9 . The key observation which makes this algorithm practical is that a “rolling” checksum can be used, allowing the checksums of |S| blocks of length B to be computed in O(|S|) time, combined with a stronger checksum10 which is only computed if the rolling checksum matches. This makes it difficult to construct files which “break” rsync, although it is still easy to 7 The name of the algorithm comes from the free implementation written by Andrew Tridgell. In his PhD thesis [57], Andrew Tridgell refers to the patent [46] and describes it as “a somewhat similar algorithm”; however, I have been unable to discern how rsync differs from the algorithm described in that patent. Having no legal expertise, I naturally cannot comment on the validity of the patent, the claims it makes, et cetera. 9 In essence, the receiving machine sends an index to the sending machine, which then uses this in constructing a binary patch; however, due to the large block size (typically 300–700 bytes), the resulting patches are much larger than those constructed by other methods. 10 Here an unfortunate implementation decision was made in rsync: The “strong checksum” chosen is the MD4 [49] algorithm, which is now considered to be entirely broken [14, 58]. Although the author maintains that the cryptographic security is irrelevant [57], we believe this assessment is rather optimistic, since the existence of files which rsync is unable to synchronize breaks one of the widely-held assumptions about the tool. 8 52 construct files which force rsync to take O(B |S|) computational time. While such files are unlikely to occur in practice, they pose a security risk, since anyone who can convince an rsync server to distribute such a file can thereafter execute a very easy denial-of-service attack against the server, simply by using rsync to fetch that file 11 . To see the relevance of rsync to our problem, we first convert it into a three-round protocol known as “reverse rsync”, which was patented [22, 23] and independently reinvented a couple of years later [8]. Under this protocol, the sending machine first splits S into blocks of length B and computes hashes for each of them (these can be precomputed and stored for later use); these hashes are transmitted, and the receiving machine now computes and compares |T | checksums against this list. Any blocks for which the receiving machine can identify matching blocks from T are copied; a list of the remaining (missing) blocks is sent back. Finally, the sending machine, having received this list of blocks, sends the appropriate bytes. While this has the disadvantage of requiring three rounds rather than two, it has several significant advantages: First, some checksums can be precomputed and stored on disk; second, the most cpu-intensive part of the protocol — comparing the checksums received to the local file — is moved to the client, dramatically increasing the number of connections which can be handled by a single server; and third, the final part of the protocol — sending the requested blocks of S — is in fact a subset of the HTTP/1.1 protocol [16], which provides for a “Content-Range” header allowing individual parts of a file to be requested. Consequently, the “reverse rsync” protocol can be implemented at the sending side by generating a file containing the block checksums, and thereafter using any of the many well-tested, efficient, and secure HTTP servers available. This three-round protocol can be converted into an algorithm for universal delta compression as follows: The checksums are sent as before, but rather than waiting for the receiving machine to provide a list of missing blocks, the sending machine also computes and sends the parity block for S for some convenient cyclic error correcting code. Upon receiving these two components, the receiving machine decodes as follows: Using the checksums as before, it finds blocks in T and places them in the appropriate 11 It is always possible to execute a denial-of-service attack through imposing heavy load on a server by making a large number of requests; such attempts are limited, however, by the attackers’ available bandwidth. This attack is far more serious because it does not require significant bandwidth. 53 positions; once these are in place, the parity block is appended, the missing blocks are labeled as “erasures”, and the error correcting code is decoded, filling in all the gaps between the matched blocks. Since the block size B is much larger than log |S|, this error correcting can be done with a shortened Reed-Solomon code over GF (256B ) in a number of bytes equal to the number of bytes not matched from T . For reasons of efficiency, each B-byte block would best be split into smaller blocks — decoding a length |S| /B Reed-Solomon codeword over GF (256B ) is far slower than decoding B/k independent length |S| /B Reed-Solomon codewords over GF (256k ) — but this does not affect the size of deltas produced12 . For typical files, there is however a difference between the bandwidth used by this one-way rsync and the traditional algorithm: When transmitting the unmatched blocks, rsync normally applies deflate compression, which often reduces the bandwidth used by 50–60%, while the error correcting parity block (or syndrome) will in almost all cases not be compressible in such a manner. This effect could be somewhat mitigated by splitting S not into blocks of constant length B, but rather into blocks which can be individually deflated to the same size. The error correcting code would then be applied to these deflated blocks, and S would be reconstructed by re-inflating. While an interesting possibility, this would probably be too complex to be of practical value. Suppose that S can be formed from T by a sequence of Ns substitutions of total length Ls , Ni insertions of total length Li , and Nd deletions (of any length). Then out of the |S| /B length-B blocks in S, at most min(Ls , 2Ns + (Ls − 2Ns )/B) are not in T as a result of substitutions; at most min(Li , 2Ni + (Li − 2Ni )/B) are unmatchable due to insertions; and at most Nd are unmatchable due to deletions. For Ns , Ni , Nd  |S|, the expected number of mismatching blocks is approximately Ns + Ni + Nd + (Ls + Li )/B. Consequently, the transmitted parity block should be of length at least (Ns + Ni + Nd )B + Ls + Li bytes in order to accommodate a typical placement of insertions and substitutions, and if worst-case placement of the 12 Theoretically, operations in GF (2k ) can be performed in O(k 1+ ) time using the FFT; but for k less than several hundred the best implementations are either O(1) after computation of a lookup table of size 2k (for k up to around 20) or O(k 2 ) time (for k from around 20 up to several hundred). 54 insertions and substitutions is to be accommodated, the parity block should be of length at least (2Ns + 2Ni + Nd )B + Ls + Li bytes. The checksum block needs to contain checksums for |S| /B blocks. In order to avoid excessive computation, the rolling checksum should be of length at least log 2 |T | bits; this will keep the computation of the strong checksum to a reasonable time limit for random files. The strong checksum should contain an additional log 2 (|S| /(B)) bits, where  is the accepted probability of failure (due to accidental hash collision); together, these two hashes consume log2 (|S| |T | /(B)) bits, for a total of   |S| |T | |S| log256 B B bytes in the checksum block. We note that the danger of malicious inputs causing an increase in the required computational time can be reduced somewhat by using a keyed checksum function (and including the key in the patch file); more significantly, however, the computational burden is entirely at the receiving side, so there is no potential for denial of service. Making the substitution N = Ns + Ni + Nd for the benefit of clarity, if we take v ! p u u |S| |T | |S| N log256 B=t N  then we can construct a patch of size v u u Ls + Li + t2 |S| N log256 |T |2 |S| N 2 ! which will apply to most files T which differ from S by at most Ns substitutions of total length Ls , Ni insertions of total length Li , and Nd deletions. With a patch of size v u u Ls + Li + 2t|S| N log256 |T |2 |S| N 2 ! bytes and a random checksum function, any T (even those with inconveniently placed edits) within that radius can be “patched” with probability at least 1−; the extra fac√ tor of 2 arises from the possibility that the insertions and substitutions are located in such a manner as to maximize the number of blocks modified. For random inputs, 55 or if a random keyed hash is used, this operates in O(|S|1+ ) time if fast algorithms p are used, or O((|S| N + (Ls + Li ) |S| N )1+ ) time if classical algorithms are used; in both cases, the implicit constants are much larger on the receiving side than on the sending side, since for the Reed-Solomon codes (which dominate the running time) decoding is significantly more expensive than encoding. Practical implementations may find a different (and faster) choice of error correcting code more suitable. It is interesting to contrast this against the previous section; while error correcting codes, by themselves, are quite efficient at correcting substitutions but are entirely unable to correct insertions or deletions, this “one-way rsync” approach is able to correct insertions, deletions, and substitutions equally well. For many files (e.g., text, source code, archives) this is sufficient; but some files types, such as executable code, tend to have far more substitutions than indels, and for such files neither of these two approaches will produce very good results. 3.5 Block alignment To fill this gap — that is, to construct universal patches for files which contain some indels, but differ most often by substitutions — we turn once more to the randomized algorithm for matching with mismatches presented in Chapter 1. In addition to the asymptotically superior computational running time of the algorithm, this algorithm has another very important advantage: It operates using an index of size only O(n log(n)/m) floating-point values. In order to take advantage of this, we proceed as follows: Taking B as fixed for the moment, we compute L = 4 |S| log |S|/B. Following the approach taken in Chapter 1, we select random primes p1 , p2 in the interval [L, L(1 + 2/ log L)), and random mappings φ1 , φ2 : GF (256) → {−1, 1}. In “real-world” usage, the strings S, T are most likely non-random; consequently we “mask” characters which occur more often than others: We count the number of times nx that each byte value x √ appears in S, and form φ0i (x) = φi (x)/ nx . We now apply these two mappings and P (i) projections and compute A(1) ∈ RZp1 , A(2) ∈ RZp2 with Aj = k φ0i (Sj + kpi ). (i) Since each Aj is the sum of only |S| /pi + O(1) terms, each coming from a single 56 character, we don’t need to transmit these with arbitrary precision13 ; instead, we find (i) the minimum and maximum values, and scale and round all the Aj to integers in the range [0, 2b ), where b = dlog2 (|S| /L)/2e14 . The sending machine now transmits S̄ = (p1 , p2 , φ1 , φ2 , A(1) , A(2) ) using B )(1 + o(1)) |S| log |S| log2 ( 4 log|S| 2B bytes. Note that, for any given block size B, this index is larger than the checksum block needed by rsync by a factor of B ) 4 log |S| log( 4 log|S| |T ||S| log( B ) — typically around a factor of ten — but has the advantage of being able to match blocks containing substitutions. Upon receipt of this index, the receiving machine performs length-pi forward FFTs on the A(i)15 , splits T into blocks of length B bytes, and for each block computes the vectors B (i) , their Fourier transforms, and thereby the vectors C (i) . Using priority (1) (2) queues, the values 0 ≤ j < n are computed for which Cj +Cj is maximized for each block, indicating that each particular length B block of T probably approximately (1) (2) matches S starting at offset j. Furthermore, the sums Cj + Cj for each block indicate roughly how well the blocks match — not perfectly, but accurately enough to distinguish between a block from T which approximately matches in S and a block from T which does not exist anywhere in S (and therefore obtains an essentially random value j). After computing where each length B block from T best matches against S, we sort these matchings according to their positions in S 16 . Where two blocks overlap, (1) (2) we compare the values Cj + Cj which were computed for the two overlapping blocks; the block with the higher score is retained intact, while the other block is 13 Recall that, due to the masking used, these values are no longer integers. This range ensures that the rounding errors are of the same magnitude as the contribution from a single character in S, i.e., negligible. 15 This pre-computation is naturally faster than recomputing the forward FFT each time we wish to compute the convolution of A(i) with B (i) . 16 This allows us to take advantage of blocks which have been re-ordered; if we are concerned solely with edit distance (which does not permit re-orderings) then this step may be omitted. 14 57 shortened (to start or end where the higher scoring matching finishes) or entirely removed (in the case of low-scoring blocks which are completely covered by other blocks). As a result of the parameters chosen (k = 2, L = 4 |S| log |S|/B), any of the length-B blocks from |T | which do not contain any indels (with respect to S) and contain at most B/2 substitutions will be found. As such, placing these blocks in their respective positions will construct a string which differs from S by at most Nd + Ni + Ns substitutions of total length (Nd + Ni )B + Li + 2Ls , since at most Nd + Ni blocks are lost due to indels. The sending machine now transmits a Reed-Solomon parity block containing 2((Nd + Ni )B + Li + Ls )/M + 2Ns symbols (i.e., capable of correcting half that number of errors) over GF (256M ), where M = dlog256 |S|e; this requires 2(Nd + Ni )B + 2Li + 2Ls + 2 dlog256 |S|e Ns bytes. The receiving machine then uses this to correct the errors in the string it had previously constructed: The substitutions (of total length Ls ) are, in the worst case, isolated and thus corrupt one symbol each, while the regions lost due to insertions (of total length Li ) and indels (which damage at most (Nd + Ni )B bytes) are in large blocks and therefore corrupt entire symbols at once17 . p Taking B = |S| log 256/N log |S|, we can therefore construct a patch of size  2Ls + 2Li + 2 log256 |S|(Ns + 2 p  |S| (Nd + Ni ) log 256) (1 + o(1)) bytes which can be applied with probability approaching 1 to any string T which differs from S by a sequence of at most Ns substitutions of total length Ls , Ni deletions of total length Li , and Nd insertions. 3.6 Practical improvements While the above is quite satisfying theoretically — it reduces the “cost” of a bytesubstitution from O((|S| /N )1/2+ ) (where N denotes the total number of indels and substitutions) to O(log |S|) — we can make significant practical improvements. 17 Note that the cost of reconstructing missing data (e.g., insertions) has doubled compared to rsync: In this case, we have not identified such regions as erasures. 58 First, rather than taking L = 4 |S| log |S|/B, we take L = 16 |S| log |S|/B. This allows us to correctly align not only blocks which are indel-free and contain up to 50% mismatches, but also to align part of most blocks containing a single indel and up to 50% mismatches18 : Considering the indel as dividing the block into two parts, at least one of the parts will be large enough (i.e., provide enough “signal”) to be found with high probability. Conversely, any indel-free region of length at least B is likely to have at least one sub-region correctly aligned — that is, providing that no two indels are located within B bytes of each other, we will have an alignment which is correct apart from the positions of the boundaries between aligned regions, which may be incorrectly placed by up to B/2 bytes in either direction. To improve this alignment, we take each such boundary in turn, and adjust it in order to maximize the number of matching characters19 . While the receiving machine does not have a copy of S and therefore cannot perform this adjustment perfectly, it has A(1) and A(2) , and can act to maximize the dot product of these with the result of mapping φi onto the potentially matching characters. Since each (i) Aj is the sum of contributions from at most |S| /L + O(1) = B/(16 log S̄) + O(1) characters, this corrects the boundaries from being incorrect by an average of B/4 √ bytes to being incorrect by an average of B/(8 2π log S̄) bytes, which is roughly a 50-fold improvement for the sizes of strings which concern us. We20 now have an alignment between S and T which covers S and at least approximately reflects the “correct” alignment between the two strings: Inter-indel regions of length less than B may be missing, and the boundaries between such regions (i.e., the positions of indels) will be not be placed in exactly the correct positions, but the alignment is at least fairly close. We now make one final improvement: Noting that in some areas this alignment will be entirely incorrect (either because the correct alignment has too many indels, or due to insertions), we take each block in turn and shorten it in an attempt to replace probable errors with erasures. This is performed 18 For that matter, an indel-free block containing 75% mismatches is likely to be correctly aligned, but this isn’t useful to us, since the cost of correcting those errors would exceed the cost of reconstructing an erased block. 19 cf. section 2.4. 20 I excuse the (grammatically questionable) use of the first person plural on the basis that I write on behalf of both myself and the computer which is carrying out the algorithm described. 59 by considering again the dot product of the vectors A(i) with the result of mapping φi onto the potentially matching characters, but this time we estimate the contribution which would be obtained from each character if the matching were correct. This cannot be done with any degree of precision, as a result of the weighting applied to each character in constructing the A(i) , but it nevertheless is somewhat useful when there are large regions of S which do not correspond to any part of T . After the adjustment has been performed to the alignment, the errors and erasures are corrected; rather than using a single Reed-Solomon code for this purpose, we refer back to section 3.3 and use a two-phase randomized code, where the number of parity bytes is calculated based on the desired maximum probability of error  and the radius R within which we expect the string T to lie (or, equivalently, the number of errored and erased bytes we expect to find at this point). Finally, we note that for practical purposes, the optimal value of B given above (even if it could be computed in advance) is likely unsuitable: while the method described here operates in almost-linear time at the sender (and is dominated by the cost of Reed-Solomon encoding), it takes O((|S| |T | /B 2 )1+ ) = O((|T | (Ni +Nd ))1+ ) p time at the receiving side. As in Chapter 2, we instead take a fixed B = 2 |S| log |S|, unless we know that the two files are extremely closely related21 , in which case a larger value may be preferred. 3.7 Performance To demonstrate the performance of the methods described here, we refer back to the two corpora used in Chapter 2, the first consisting of 15 “upgrades” and the second of 82 “security patches”. For each of these pairs of files, we compute: 1. The total number of bytes transmitted (in either direction) by rsync -z, i.e., the normal two-phase rsync algorithm using deflate compression on the unmatched blocks. 21 i.e., unless we have reason to believe that Ni + Nd < log 4 log |S|. 60 61 Program agrep: 4.0 → 4.1 glimpse: 4.0 → 4.1 glimpseindex: 4.0 → 4.1 wgconvert: 4.0 → 4.1 agrep: 3.6 → 4.0 glimpse: 3.6 → 4.0 glimpseindex: 3.6 → 4.0 netscape: 3.01 → 3.04 gimp: 0.99.19 → 1.00.00 iconx: 9.1 → 9.3 gcc: 2.8.0 → 2.8.1 rcc (lcc): 4.0 → 4.1 apache: 1.3.0 → 1.3.1 apache: 1.2.4 → 1.3.0 rcc (lcc): 3.2 → 3.6 Average Compression Original 262144 524288 442368 368640 262144 524288 442368 6250496 1646592 548864 2899968 811008 679936 671744 434176 100% bzip2 114388 222548 193883 157536 114502 222178 193892 2396661 642725 233056 708301 221826 180708 179369 155090 36.22% bsdiff 6 4265 24642 16240 12432 44327 109680 80447 212032 219684 31632 88022 187 25927 163249 22691 7.67% rsync -z 45721 220324 193090 153151 120856 236643 205472 1996455 634918 240298 826093 18482 180729 212517 152800 33.54% one-way rsync 82752 447340 385975 313492 244152 492940 412375 4349706 1390304 503250 2644877 24058 581319 662780 362140 78.03% Block matching index 11848 18299 16458 13725 11953 16011 15452 86789 31875 18087 42320 20850 19677 19636 17628 erasures 2823 20769 19760 9775 145889 371685 268397 817578 964476 106942 392902 0 159302 555078 103265 errors 5020 38380 28319 19873 22201 38708 47393 439258 122590 36012 170194 80 53692 93948 31103 R1 23 69 65 53 214 244 238 93 214 108 87 0 127 255 124 R2 1615 4818 3753 2992 3149 4036 4533 37019 11781 5358 18653 231 6385 0 4319 Total 39149 171675 137678 97095 239245 527505 438582 2368208 1443340 253337 1080736 21399 373551 691561 238518 52.51% Table 3.1: Bandwidth used by bzip2, (two-way) rsync, one-way rsync, and our block matching universal delta compressor for “upgrade” patches. 2. The minimum size of patch needed by the “one-way rsync” algorithm described in Section 3.4 in order to correctly apply the update with probability of failure  < 0.00122 . 3. The size of index, the number of errors and erasures, the number of parity bytes needed per 255-byte subsequence (R1 ) in the error-reduction phase, and the number of parity characters needed in the error-correction phase (R2 ), and finally the minimum necessary total patch size needed by our block matching delta compressor with the improvements listed in Section 3.6 (again, subject to an acceptable probability of failure of 0.001). In the case of rsync and one-way rsync, the block size was fixed at 300 bytes; for our block matching delta compressor, where a constant block size would lead to p quadratic running-time, the block size was 2 |S| log |S|. Note that for the two universal delta compressors, the patch sizes computed are minimum sizes — patches which included more parity data would apply with a lower probability of error. The exact sizes of patches used will necessarily depend upon the method in which this is being used: If one has all of the potential files T readily available, and the purpose of using universal delta compression is simply to allow for a single file to be broadcast instead of a large number of different patches, then the computed minimum patch sizes would likely be used; if, on the other hand, one has no knowledge of the files T , one would probably send a sequence of patches computed using increasing amounts of parity data. We expect that, in practice, most uses would fall between these two extremes — that the files T would be unknown to the sending machine, but that it could generate a number of “typical” files T , and use these to estimate the amount of parity data needed. As in Chapter 2, we compute the average compression as the the average value of [patch size]/[original size]. In Table 3.1, we show the results for the 15-element “upgrade” corpus. Two points are immediately evident: First, as we expect from the structure of executable files, 22 A probability of failure is inevitable in any practical universal delta compression algorithm; but by sending a strong hash (e.g., SHA1 [38]) along with the patch, failures can normally be recognized and resolved by other mechanisms. 62 even with a block size of 300 bytes there are very few matching blocks for rsync to utilize. As shown by the size of “one-way rsync” patches, only about 25% of the blocks can be matched; the remaining 75% are encoded literally. These few matching blocks allow rsync -z to narrowly beat bzip2 overall, but the vast majority of the compression it achieves arises from the deflation of the literal data. Second, while our block matching universal delta compressor performs quite poorly overall (and in two cases fails to achieve any compression at all), it performs relatively well in cases where the differences are modest: On the first four pairs of files, it generates smaller patches than rsync, and in a few other cases it is fairly close to rsync in performance. Here again we note that rsync has the advantage of being able to deflate literal data — were it not for that ability, our block matching delta compressor would produce smaller patches in almost all cases. Consequently, the best approach for “universal” delta compression in the case of such significant differences is a depressing solution: Apply normal bzip2 compression, and ignore the old files entirely. Finally, as expected, neither rsync nor either of the universal delta compressors achieve performance even remotely approaching that of bsdiff 6. In Table 3.2, we show the results for the 82-element “security patch” corpus, and the situation is quite changed. Both rsync and one-way rsync perform significantly better, matching roughly 50% of blocks. While one-way rsync still performs on average worse than bzip2, it produces smaller patches on three file types — library archives, C header files, and (textual) configuration files — which all share the property of not being affected by the “cascade of differences” introduced by linking together executable code23 . Of greater note is the performance of our block matching universal delta compressor: On all the file types, it performs roughly 3 times better than bzip2, and with the exception of the text files (where rsync has an advantage as a result of using a much smaller index) it uses considerably less bandwidth than rsync, despite lacking the advantages of feedback and deflation of literal data. Again, however, none of the universal delta compressors approach the performance of bsdiff 23 Address changes are still spread throughout any individual modified object file, but not into other object files contained in the same library archive. 63 64 File type Executables (static) Executables (dynamic) Library archives Shared Object C headers Configuration Total Average Compression # files 53 16 7 3 1 2 82 Total size 15056508 8002788 7600906 1438072 11402 10183 32119859 100% bzip2 6725178 3399743 1635094 599937 3810 4407 12368169 38.51% bsdiff 6 250509 128859 16476 13338 126 214 409522 1.27% rsync -z 5356778 2549330 502267 396485 706 989 8806555 27.42% one-way rsync 10106284 4999430 1288456 752056 1042 1619 17348887 54.01% block matching 2273899 1100731 489293 180050 1581 1421 4046975 12.60% Table 3.2: Bandwidth used by bzip2, (two-way) rsync, one-way rsync, and our block matching universal delta compressor for “security” patches. 6; in fact (referring back to Table 2.2), even such a poor delta compressor as Xdelta produces smaller patches. While the compression performance makes this approach at very least interesting, it is not yet clear whether the computational performance will be adequate for widespread usage. Although the computation necessary for aligning the two files is not excessive, it is vastly more expensive than that involved in rsync or the application of normal deltas; furthermore, while the error-reducing step (requiring only ReedSolomon decoding of 255-byte codewords) is quite fast24 , the final error-correcting stage is likely to be too slow unless it is replaced by a different error-correcting code. In the end, the practicality of this algorithm will most likely be determined more by future advances in processors and networks than by its own merits. If the present trend where networks are increasing in speed faster than processors continues, it is possible that the entire field of delta compression will become obsolete, as it becomes increasingly simple to retransmit entire files; if, on the other hand, computing capacity starts to increase at a greater rate25 , then this algorithm, regardless of its current computational requirements, may become entirely practical. 24 Commercially available libraries cite performance of several megabytes per second on commodity processors. 25 While recent developments seem to indicate a slowing down of advances in processor speed, the trend towards increased internal parallelism may, at least for some problems, reverse this trend. 65 Appendix A: Source code and data The “upgrade” and “security” corpora used in Chapter 2 and Chapter 3 were obtained from the paper by Baker, Manber and Muth [6], via Robert Muth, and from FreeBSD Update [44] respectively, and may be obtained from the author1 . An earlier version of the software described in Chapter 2, namely bsdiff 4.2, has been published under an open source license and is available (at least at the time of publication) from the following web site: http://www.daemonology.net/bsdiff/ Under the statutes of Oxford University [42], the remaining software may be claimed by the University if it ‘may reasonably be considered to possess commercial potential’. At present, it is not yet clear if the University will so elect. If the University does not claim ownership of the remaining software, it will also be released under an open source license after it has been “cleaned up” somewhat and prepared for widespread usage. 1 We note that the binaries were originally distributed under a variety of licenses; but we believe that their inclusion in delta compression corpora, and their use for the purpose of evaluating the effectiveness of delta compression algorithms, falls well within the legal definitions of “fair dealing” or “fair use”, thus exempting them from normal copyright restrictions. 66 Epilogue: A Rapid String Similarity Metric In Chapter 1 we introduced the approach of projecting vectors onto subspaces of relatively prime dimensions and used it to construct an algorithm for matching with mismatches. Lest the reader be left with the impression that this approach is useful solely for this single problem, we present one final “late-breaking” algorithm. Recall from Chapter 1 that the match count vector V ∈ Rn is defined by Vi = m−1 X δ(Si+j , Tj ) j=0 for some appropriate function δ(x, y). Consider how this vector behaves when the two strings S, T differ by a small number of indels (and up to a constant proportion of substitutions). The vector V will have a few very large values, at positions corresponding to the offsets of the indel-free blocks which match between the two strings, and will otherwise have values which cluster around µm, where µ is the mean value of δ when applied to random inputs. Now consider the variance σ 2 V = P 2 P Vi /(m − 1) − ( Vi )2 /(m(m − 1)). For pairs of strings which are similar (or rather, which share large indel-free regions), the positions in V which have unusually large values will translate directly into a large variance, whereas for strings which are dissimilar, the variance will be comparatively small. Now project V onto a subspace RZp . The “noise threshold” is increased, but again for sufficiently large p, similar strings will result in a larger variance σ 2 (Rn → RZp )V than dissimilar strings. But wait! The vector V can be estimated as the cyclic correlation of two vectors A = φ(S) and B = φ(T ). The cyclic correlation is computed as the inverse Fourier 67 transform of the pointwise product of the Fourier transformed inputs1 . And the variance of an inverse Fourier transform is the sum of the squared norms of the nonDC components2 . We thus have the following: Algorithm. Let some φ : Σ → {−1, 1}, p ∈ N be fixed. Then given a string S, we compute: 1. For j = 0 . . . p − 1, X Aj = φ(Sj+kp ) k 2. For j = 0 . . . p − 1, Āj = X Aj exp(2ijkπ/p) k 3. For j = 1 . . . (p − 1)/2, v u u(p−1)/2 X 4 2 u Āj S̄j = Āj /t j=1 Then the dot product S̄ · T̄ of the “digests” of two random strings S, T will be approximately equal to 1/2, the dot product S̄ · S̄ of the digest of a string with itself is equal to 1, and other pairs of strings will lie in between, in accordance with their similarity. Proof. It suffices to take the first half of the Fourier transform (i.e., the terms Ā1 . . . Ā(p−1)/2 , since for real inputs (such as we have) the remaining terms are simply the conjugates with the same (Euclidean) norms. That random strings result in a dot product of 1/2 follows from consideration of the Ā as being random vectors of fixed norm; the remaining result follows from the argument above concerning the influence of large indel-free regions on the match count vector V . 1 2 Subject to the second string being reversed, naturally. Recall that the Fourier transform is a rotation in Rn , i.e., it preserves 2-norm length. 68 Bibliography [1] K.A.S. Abdel-Ghaffar and A. El Abbadi. An optimal strategy for comparing file copies. IEEE Transactions on Parallel and Distributed Systems, 5:87–93, 1994. [2] M. Adler and J. Gailly. zlib compression library, 2003. http://www.zlib.org. [3] M.J. Atallah, F. Chyzak, and P. Dumas. A randomized algorithm for approximate string matching. Algorithmica, 29(3):468–486, 2001. [4] K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. In Proceedings of the Prague Stringology Conference ’02 (PSC’02), pages 9–17, 2002. [5] R.A. Baeza-Yates and C.H. Perleberg. Fast and practical approximate string matching. In Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching (CPM ’92), LNCS 644, pages 185–192. Springer-Verlag, 1992. [6] W.S. Baker, U. Manber, and R. Muth. Compressing differences of executable code. In ACM SIGPLAN Workshop on Compiler Support for System Software (WCSS), pages 1–10, 1999. [7] L.I. Bluestein. A linear filtering approach to the computation of the discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, AU- 18(4):451–455, 1970. [8] G. Brederlow. reverse checksumming. rsync@lists.samba.org mailing list, April 2001. 69 [9] R.P. Brent, J. van de Lune, H.J.J. te Riele, and D.T. Winter. On the zeros of the Riemann zeta function in the critical strip. II. Mathematics of Computation, 39(160):681–688, 1982. [10] M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm. SRC Research Report 124, Digital Equipment Corporation, Palo Alto, California, 1994. [11] W. Chang and T. Marr. Approximate string matching and local similarity. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM ’94), LNCS 807, pages 259–273. Springer-Verlag, 1994. [12] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. [13] J.W. Cooley and O.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19:297–301, 1965. [14] H. Dobbertin. Alf swindles Ann. RSA Laboratories Crypto Bytes, 1(3), 1995. [15] P. Dusart. The k th prime is greater than k(ln k + ln ln k − 1) for k ≥ 2. Mathematics of Computation, 68(225):411–415, 1999. [16] R. Fielding, J. Gettys, J. Mobul, H. Nielsen, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol — HTTP/1.1. RFC 2616, 1999. [17] M.J. Fischer and M.S. Paterson. String matching and other products. In R. Karp, editor, Complexity of Computation — SIAM-AMS Proceedings, volume 7, pages 113–125, 1974. [18] FreeBSD Project. The FreeBSD operating system. http://www.freebsd.org/. [19] FreeBSD Project. FreeBSD CVS repository, 2004. http://www.freebsd.org/cgi/cvsweb.cgi/. [20] FreeBSD Project. FreeBSD security information, 2004. http://www.freebsd.org/security/. 70 [21] P. Graham. A plan for spam, 2002. http://www.paulgraham.com/spam.html. [22] C.J. Heath and P. Hughes. Data file synchronization. Australian Patent 763524, 1999. [23] C.J. Heath and P. Hughes. Data file synchronization. United States Patent 6,636,872, 2000. [24] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. [25] J.T. Hunt, K.-P. Vo, and W.F. Tichy. Delta algorithms: an empirical analysis. ACM Transactions on Software Engineering and Methodology, 7(2):192–214, 1998. [26] J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In Proceedings of the 13th International Conference on Automata, Languages, and Programming, LNCS 2719, pages 943–955. Springer-Verlag, 2003. [27] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longestcommon-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM ’01), LNCS 2089, pages 181–192. Springer-Verlag, 2001. [28] V.J. Katz. A history of mathematics. Harper Collins, 1993. [29] D.K. Kim, J.S. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, LNCS 2676, pages 186–199. Springer-Verlag, 2003. [30] D.E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison Wesley, third edition, 1997. [31] D.E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison Wesley, third edition, 1997. 71 [32] P. Ko and S. Alaru. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, LNCS 2676, pages 200–210. Springer-Verlag, 2003. [33] D.G. Korn and K.-P. Vo. Engineering a differencing and compression data format. In Proceedings of the 2002 USENIX Annual Technical Conference, pages 219–228, 2002. [34] N.S. Larsson and K. Sadakane. Faster suffix sorting. Technical Report LU-CSTR:99-214, Department of Computer Science, Lund University, 1999. [35] J.P. MacDonald. File system support for delta compression. Master’s thesis, University of California at Berkeley, 2000. [36] P. Mackerras and A. Tridgell. The rsync algorithm. Technical Report TR-CS96-05, Department of Computer Science, Australian National University, 1996. [37] Microsoft Corporation. Microsoft Portable Executable and Common Object File Format Specification, Revision 6.0, 1999. [38] National Institute of Standards and Technology. The secure hash algorithm (SHA-1). NIST FIPS PUB 180-1, U.S. Department of Commerce, 1995. [39] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. [40] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970. [41] Alon Orlitsky. Interactive communication: Balanced distributions, correlated files, and average-case complexity. In IEEE Symposium on Foundations of Computer Science, pages 228–238, 1991. [42] Oxford University. University statutes and congregation regulations, Statute XVI, Part B: Intellectual property, 2003. http://www.admin.ox.ac.uk/statutes/. 72 [43] S. Peleg. Difference extraction between two versions of data-tables containing intra-references. United States Patent 6,546,552, 2003. [44] C. Percival. An automated binary security update system for FreeBSD. In Proceedings of BSDCon ’03, pages 29–34, 2003. [45] Pocket Soft Inc. .RTPatch, 2001. http://www.pocketsoft.com. [46] C.F. Pyne. Remote file transfer method and apparatus. United States Patent 5,446,888, 1995. [47] C.F. Pyne. Remote file transfer method and apparatus. United States Patent 5,721,907, 1998. [48] I.S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of the Society of Industrial and Applied Mathematics, 8(2):300–304, 1960. [49] R. Rivest. The MD4 message-digest algorithm. RFC 1320, 1992. [50] R. Rivest. The MD5 message-digest algorithm. RFC 1321, 1992. [51] J. Seward. bzip2, 2002. http://sources.redhat.com/bzip2/. [52] M.V. Sliger, T.D. McGuire, and J.A. Forbes. File update performing comparison and compression as a single process. United States Patent 6,496,9741 , 2002. [53] M.V. Sliger, T.D. McGuire, and R.M. Shupak. Preprocessing a reference data stream for patch generation and compression. United States Patent 6,466,999, 2002. [54] Sun Tsu Suan-Ching, 4th century AD2 [55] W.F. Tichy. RCS – a system for version control. Software – Practice & Experience, 15(7):637–654, 1985. 1 We are surprised that this patent was granted, given that it does not appear to cover any methods not previously published in [25]. 2 A more accessible description of this algorithm is presented in [30]. For more historical background, the reader is encouraged to consult [28]. 73 [56] D. Trendafilov, N. Memon, and T. Suel. zdelta: An efficient delta compression tool. Technical Report TR-CIS-2002-02, Polytechnic University, CIS Department, 2002. [57] A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999. [58] X. Wang, D. Feng, X. Lai, and H. Yu. Collisions for hash functions MD4, MD5, HAVAL-128 and RIPEMD. Cryptology ePrint Archive, Report 2004/199, 2004. 74