Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Colliding with the SHA prefix of Linux's initial Git commit (kernel.org)
227 points by 2bluesc on Dec 30, 2024 | hide | past | favorite | 64 comments


Great example of Hyrum's Law, https://www.hyrumslaw.com/.

Comments about SHA256 are irrelevant - you can misuse the prefix of a SHA256 hash just as easily. The issue is that people got used to human-readable hash prefixes of 10-12 characters as "unique" for all intents and purposes, despite the fact that there were never any uniqueness guarantees for prefixes and git has always handled collisions with short object IDs as ambiguous - it's just that it's so rare to happen in the real world that lots of script writers treated that "mostly unique" as a guarantee.

IMO support for short object IDs is a mistake, as is any behavior that "works this way 99.999% of the time, but hey developer don't forget you need to also code for that .001% edge case". I'm always just copying and pasting things around anyway, so it really doesn't make much difference to me if I'm copying 12 chars or 64.


>so it really doesn't make much difference to me if I'm copying 12 chars or 64.

It doesn't have to be 12 characters. I often type short git hashes - maybe 5 characters - when jumping between commits.


First twelve and last twelve characters are the same:

    $ echo -n retr0id_662d970782071aa7a038dce6 | sha256sum
    307e0e71a409d2bf67e76c676d81bd0ff87ee228cd8f991714589d0564e6ea9a  -
    
    $ echo -n retr0id_430d19a6c51814d895666635 | sha256sum
    307e0e71a4098e7fb7d72c86cd041a006181c6d8e29882b581d69d0564e6ea9a  -
* Via: https://news.ycombinator.com/item?id=38668893


Oh hey, it's me. Shortly after then I did a write-up on how I crafted them, which was discussed here: https://news.ycombinator.com/item?id=38718314


Excellent article. Thanks for resharing it here.


There were some plans to migrate to SHA256, but somehow it still hasn't happened.

The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger. This is particularly important for tools that rely on it for dependency management, local caches, etc.


TFA has nothing to do with SHA-1 or SHA-1 collisions. It's about abbreviated hash values introduced for readability by humans. Now these values are used by auxiliary scripts. Which again has little to do with git proper. It's just what the kernel community writes into commit messages and what scripts they use to parse those messages.


> The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger

Really now? Mind if I challenge you?

I have on my machine a git repo with commit '75eb4e3b1369706a4dcd61cc80e49660ac341ea4'.

If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.


> If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.

Calculating that SHA1 collision is going to be a bit more expensive than $10k, by a couple of orders of magnitude.

Finding it in the wild is improbable, but calculating it is definitely possible, and has been done before. http://shattered.io/


Shattered didn't produce a collision for an arbitrary hash, it produced two documents with the same hash (which is a slightly easier problem, about 100,000x faster).

SHA1 is certainly insecure at this point, but not even close to trivially so.


Indeed. We can't even do this for md5, let alone sha1.

Preimage attacks are very different from collision attacks.


That is enough to distribute malicious code though, at least in certain scenarios. Someone might create a setup where reviewers check/sign one version of the source code, and what gets distributed is another version with the same hash.


Code review in the Linux kernel still happens by email to a large degree.

Further up in contribution tree there is additional signing. Would that further complicate the insertion of a false commit? I am not convinced that signing is used all the way down to every contribution.


Linux probably has enough eyeballs on its source to make attacks like that unlikely anyway, but Git isn't just used by Linux.


Can you create a proof of concept and show it here?


What's your point?


My point is that you need to put the money where your mouth is.


You're talking about hundreds of millions of dollars to calculate that. "Put your money where your mouth is".

We know it's theoretically possible. And we also know that this theoretical possibility is within the reach of a couple of countries.


If your goal was to prove that SHA1 collisions are unimportant, far too hard for any group to exploit within the next X years of processing improvements... That means math.

In contrast, this "challenge" stuff is just chasing outage endorphins and internet points.

Think it through, and it's pointless. Any refusal or negative result is utterly compromised and confounded by things like: How trustworthy you appear; whether the amount is reasonable; whether the random commenter has the skillset, free time, and financial assets to try; whether they're part of a larger group they can recruit; etc.


My goal is to see the actual proof of concept that whatever the person I replied to is feasible. Not the daily BS from security wannabes that start with "In certain scenarios it is possible to X and Y" and then never show proof.

"In certain scenarios I could be a ninja": it means absolutely nothing without proving that I actually have the skills and I could actually use them.

It is not pointless, but if you claim something show the proof.


The math is the proof of concept when an attack costs that much money to pull off. Or the various papers that show successful attacks on reduced-round versions of the hash.

Do you not accept those? What would you accept as a proof of concept?


I expected a proof of concept for this statement:

_That is enough to distribute malicious code though, at least in certain scenarios. Someone might create a setup where reviewers check/sign one version of the source code, and what gets distributed is another version with the same hash._


Well the proof of concept without actually having two colliding files is really simple, so I thought it was generally understood.

Here's the easiest to explain way: Upload the malicious version of the file to github. Send an innocuous patch to the kernel devs that creates a file with the same hash. It gets accepted, and anyone that downloads the kernel from github gets the malicious version. Done. That's a small fraction of linux downloaders, but this is just the proof of concept.


A proof of concept became much easier with C11 unicode identifiers, and email patch review. You can trivially hide Cyrillic chars eg. between whitespace changes or other trivial "optimizations". Even without collisions.

And with the current surge of GPU's even collisions are realistic now. The H100's are not doing much when not in training.


>which is a slightly easier problem, about 100,000x faster

Where did you get this number from? I was under impression that this is completely infeasible (just like we can generate a collision good md5 in seconds, but we still can't do a preimage attack).


The is not a full git commit hash collision. It has to do with a git note which only needs to matche a 12 character prefix of the git commit.


While you corrected one mistake you added a new one:)

Those are git trailers, see git-interpret-trailers(1).

git-notes(1) is something completely different and not used by the kernel.


people don't really care, because current collision methods are mitigated.

Git does not actually use "sha1", despite what all the docs say, it uses "sha1dc", which is just like sha1 except for inputs which can cause collisions, in which case it either fails with clear error message or returns completely different value.

https://news.ycombinator.com/item?id=17825441

so don't worry, git hashes _are_ enough to know you are distributing/executing the legitimate code.

(not to mention you need a preimage attack to replace known commit, and this is not yet possible with sha1)


From your link:

>In this case "hash" will be the same as SHA1(input) in all cases, except those where the input is detected to be malicious (as in the SHAttered attack)

I don't see how this can be more than a fundamentally forward-incompatible sticking plaster over the problem. The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?) -- it's that today's SHA1DC() implementation can only detect and optionally correct today's known attacks, so each new attack necessitates a new, incompatible version of SHA1DC().


Each new _unrelated_ SHA1 attack will need an update in SHA1DC. But it has to be truly unrelated, as the collision detection method is fairly robust. I recommend reading the original "Counter-cryptanalysis" paper [0] for details on how attacks work and how they can be mitigated (there is certain internal state in SHA1 that is used in all known attacks). BTW, this paper has an interesting anecdote: apparently Flame malware had exploited MD5 collisions using novel unpublished attack method... and yet it was detected by collision detector (section 3.2). Another example is that SHA-mbles attack, published 3 years after "Counter-cryptanalysis", was detected as well, with no required code changes.

No, there is nothing "fundamentally incompatible" in the new SHA1DC method. After all, git came out in 2005, 11 years before SHA1 attacks were known, so it used regular SHA1. The collision detector was added in 2017 and nothing broke, because false positive chance is 2^-90 [1].

I have not heard of any new SHA1 collision results, but if they are based on no-difference differential paths, git has nothing to worry about. And if they are not, it may be possible to extend DC detector to seamlessly detect and prevent those attacks, and then only upgrade git clients, keeping backward and forward compatibility for data.

Of course there is always a chance that someone will come out with all-new SHA1 preimage attack that cannot be detected without high rate of false positive, so it's prudent to switch git to sha256. There is a lot of work being done: git's sha256 mode went out of beta in 2.42 (2023), but neither github nor gitlab support it.

But since the current state of git's sha is that there is nothing broken, and git git commit hash _is_ "enough to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger", there is no real pressure.

[0] https://marc-stevens.nl/research/papers/C13-S.pdf

[1] https://github.com/cr-marcstevens/sha1collisiondetection


> The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?)

Its not detecting "intent" it is detecting that the hash is one vulnerable to the attack, which is extremely unlikely to happen by accident, so if you see it you can assume malice.

It might be a band-aid, and sha256 is certainly a much better solution, but its more robust than it sounds at first glance (since it sounds crazy at first glance)


Sure it's plaster, but plaster can last a good while. Sufficiently new attacks don't come around all that often, and every hash has a risk of new attacks showing up.


fwiw, I wasn't too familiar with this usage, looked it up and sticking plaster is in the sense of bandaid here https://en.wikipedia.org/wiki/Adhesive_bandage

But agreed that it's a rather tougher patch than I'd originally thought, reading elsewhere in this thread.


Switching to SHA-256 and switching to longer substrings of hashes for identification are basically orthogonal problems. The former is hardly going to help with the latter, except in the we already broke everything so why not take the chance to break some more sense.


Compatibility between remotes using one or the other hasn't arrived yet, and git doesn't want to break compatibility. But you can create SHA256 one's today. [0]

[0] https://lwn.net/Articles/898522/


The hash space is atoms-in-the-universe range; this is a collision in a much, much smaller subset of that space


I think multiple hashes is the way to go to avoid collisions. it can even be something simple like md5. the chances of finding a collision that matches two or more algorithms is near impossible. Obviously that doesn’t work for passwords, but for verifying that data hasn’t been tampered with, it works.


Relatedly: Kees's keynote on Linux security from a month ago was great: https://www.youtube.com/watch?v=orO8czP5Bxw


Presumably the problem is that these tools only take the abbreviated hash into account. Not also the subject:

    <abbrev. hash> ("<subject>")
You also have another data point. You only need to search in the history from the commit that you are reading. Assuming that the "Fixes" commit is an ancestor of the commit whose commit footer you are reading.

I always just assumed that tools would take all the data into account. Which means that you both need to collide with the abbreviated hash as well as the subject. Now I don't do that since I just copy-paste the hash, but I would quickly notice in case the subject is different (and likely the commit message and the diff just look irrelevant).

I don't understand why the Linux Kernel has this hard-coded rule[2] -- again, you were going to get collisions eventually, so the tools should have just taken all the data into account (at least the subject) from the start. The recommendation in the Git project is to use `git show -s --pretty=reference`, without any fiddling with the abbreviation:

    <abbrev. hash> (subject, ISO date)
Although the Git maintainer uses `--abbrev=8` since git-show will just use a longer abbreviation in case the output would be ambiguous[1].

They could have used this instead if they wanted simpler, future-proof tooling:

    Fixes: <full hash>
Just like tools like git-revert and git-cherry-pick do.

[1]: https://lore.kernel.org/git/xmqq34j5h7v9.fsf@gitster.g/

[2]: Edit: hard-coded as opposed to Git just figuring out how long the abbreviation should be based on how many objects there are.


> Presumably the problem is that these tools only take the abbreviated hash into account. Not also the subject:

Well the first mentioned script:

> > Tools like linux-next's “Fixes tag checker”,

has `get_full_hash`[1] which uses the subject to search through the abbreviated matches.

Edit: And that check was added two weeks ago by Kees [2].

[1]: https://github.com/kees/kernel-tools/blob/trunk/helpers/chec...

[2]: https://github.com/kees/kernel-tools/commit/5bf6a1e71df59a23...


git itself does not use a fixed size abbreviation but determines the length necessary with the birthday paradox formula. It just happens to be seven characters most of the time, because most repos are small.

This is just for the abbreviated hash which git uses only for specific cases like display in the git log and similar, where it is a user experience improvement and relatively safe.


I know. I don’t understand why the Linux Kernel uses a hard-coded abbreviation instead of just letting Git figure out how long it should be (edited my comment now).


The length of the required abbreviation grows over time, while content you put in commit messages sticks around. Meaning a commit done in year 2005 may have gotten away with "Fixes: abcd" and was unambiguous at that time, whereas after 20 years of growth, there could now be multiple other commits with that prefix.

By that logic, any length of abbreviation will eventually fail if stored for a long time, those abbreviations should be seen as temporary for user interaction only. For the "Fixes:" footer they should have gone with full hash, it's hidden away in the footer so it doesn't disrupt anyone. Aanyone interacting with it will double click to copy it regardless if it is 8 or 40 characters long. Abbreviating it simply adds no value.


Well a fix has to be for a parent commit so future commits and future growth aren't nearly as important.


> with the birthday paradox formula.

At what probability? Can that probability be configured as a global Git option?


So as noted you can configure the length, but you can't actually adjust the probability used by the auto-length formula.

The formula calculates how many bits are needed to store the approximate number of objects, then uses twice that many bits, rounding up.

For example, at 16000 objects git is still using the minimum length of 7 characters. But at 17000 objects it now takes 15 bits to store the object count, so it wants 30 bits of hash, which means 8 characters.



I love when the word "hopefully" is used in technical documentation ))


All technical documentation is wishful thinking, they're just being explicit about it.


Cute. I've seen CPU commit prefix brute-forcers but not GPU ones. On the CPU, you can generate chosen 7-8 digit prefixes pretty easily (within a few minutes, I forget exactly). 12 digits is, of course, a massively bigger (factor of 2^16) space.


Would tooling break if the radix was changed from 16 to something higher? You could double your resistance by just changing to base 32 representation in the same length


So much would break if you changed the character set. And there would be pretty much no benefit. Making it a hundred times harder* is not nearly enough to prevent short collisions.

* The exact amount would depend on repo and method. Using base 32 and changing nothing else would slow collisions by 2^length.


> would slow collisions by 2^length.

It would add a bit to each position, going from the current 48 bits, to 60 bits. Or 4,096 more resistant to _accidental_ collision. You can have as many bits as you want you just make it more time consuming to create one, but if you're entire infrastructure blithely relies on one never _intentionally_ being submitted, you've solved nothing really.

Moreover, if you find a commit with a collision, it's very easy to slightly alter it's contents to alleviate the problem, presuming we're only having to contend with unintentional conflicts.


The length of the short hash automatically adjusts to keep the chance of accidental collisions low. There's no need to lower the risk of accidental collisions.

So changing the character set doesn't notably improve the accidental situation, and it doesn't notably improve the deliberate situation. And it would be a huge pain to do.


In the kernel case the length of the short hash is fixed at 12 characters. This is part of the article.


Okay, sorry, I was talking about the normal short hash rather than this specific tool. And I first looked before the link was changed.

Still, they already figured out a solution. They don't need multiple mechanisms for the same thing.


We could just encode hashes on base64 instead of hex and get several times more entropy with the same size text


base64 only gets you 1.5x more entropy per symbol (measuring in bits)


Oh, right. It's logarithmic. That'll teach me to post before coffee.


In other words, 6 out of 20 bytes match. Good luck colliding the other 112 bits.


The problem is some (broken and wrong but in the wild) tooling relies on substrings like that


Somebody ought to tell them not to do that! [Spoken in the tone of 'You can't park there sir!']


https://people.kernel.org/kees/colliding-with-the-sha-prefix... is the actual link; the lwn article is just a link to it with a one sentence summary and a short quote.


Changed. Thanks!

(Submitted URL was https://lwn.net/Articles/1003797/)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: