Git’s “similarity index” computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.
To compute the similarity index for two files S (source) and D (destination), Git:
- reads both files
- computes a hash table of all of the chunks of file S
- computes a second hash table of all of the chunks of file D
The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).
The hash value for a file chunk is computed by:
- start at the current file offset (initially zero)
- read 64 bytes or until
'\n'character, whichever occurs first
- if the file is claimed to be text and there is a
'\n', discard the
- hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file
Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D “copies from S” nD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a “literal add” of nD – nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.
Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of “chunks” to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)
These two values (
literal_added) computed in
diffcore_count_changes are handed over to function
diffcore-rename.c. It completely ignores the
literal_added count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the
src_copied number matters:
score = (int)(src_copied * MAX_SCORE / max_size);
max_size is the size in bytes of larger of the two input files S and D.
Note that there is an earlier computation:
max_size = ((src->size > dst->size) ? src->size : dst->size); base_size = ((src->size < dst->size) ? src->size : dst->size); delta_size = max_size - base_size;
and if the two files have changed size “too much”:
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE) return 0;
we never even get into the
diffcore-delta.c code to hash them. The
minimum_score here is the argument to
--find-renames, converted to a scaled number.
double), so the default
minimum_score, when you use the default
-M50%, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.
[Edit: this is now obsolete:]
Until Git version 2.18.0, there was no way to control this for
git status always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the
rename_score field of the diff options.
git status. In Git 2.18.0 and later,
git status has the same
--find-renames option as
git diff. The
status.renames option in the Git configuration enables any default detection, and if unset,
git status obeys the
diff.renames setting; see the
git config documentation and the
git status documentation.