Incorrect hashes for all hash algorithms beyond 4 GiB of input
nh2 opened this issue · comments
I found a problem together with @chpatrick:
Click to show imports
:set -XTypeApplications
import qualified Crypto.Hash as Hash
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
> let hash = show . Hash.hash @BS.ByteString @Hash.SHA1
> let a = BS8.replicate 5000000000 'a' <> BS8.pack "hello"
> let b = BS8.replicate 5000000000 'a' <> BS8.pack "world"
> a == b
False
> hash a == hash b
True
Two different ByteStrings hash to the same SHA1.
Trying SHA256, same issue:
> let hash = show . Hash.hash @BS.ByteString @Hash.SHA256
> hash a == hash b
True
Conclusion
fromIntegral
is bad.
Until we deprecatefromIntegral
, Haskell code will always be subtly wrong and never be secure.
I proposed this 3 years ago (easier-to-read threaded view here).
Concretely:- If we don't fix this, people will shy away from using Haskell for serious work. I found this bug in production, and you can imagine how happy I am about that. Rust and C both do this better.
- If the authors of key crypto libraries fall for these traps (no blame on them), who can get it right? We should remove the traps.
Details
Demonstrating that the the overflow occurs at the uint32
boundary:
Good:
> show $ Hash.hash @BS.ByteString @Hash.SHA1 (BS8.replicate (2^32 - 2) 'a' <> BS8.pack "b")
"78c1411ef0ae4e0328660b7dddc08c06000eb9f0"
> show $ Hash.hash @BS.ByteString @Hash.SHA1 (BS8.replicate (2^32 - 2) 'a' <> BS8.pack "c")
"21f556aba357dd19d7bf01a2a7f05482162ed3fc"
Bad:
> show $ Hash.hash @BS.ByteString @Hash.SHA1 (BS8.replicate (2^32 - 1) 'a' <> BS8.pack "b")
"da39a3ee5e6b4b0d3255bfef95601890afd80709"
> show $ Hash.hash @BS.ByteString @Hash.SHA1 (BS8.replicate (2^32 - 1) 'a' <> BS8.pack "c")
"da39a3ee5e6b4b0d3255bfef95601890afd80709"
Location of the problem:
For SHA1 (note uint32_t len
):
cryptonite/cbits/cryptonite_sha1.c
Line 159 in 10dc63c
For all hash algorithms (note Word32
for the length parameter):
cryptonite/Crypto/Hash/Types.hs
Lines 59 to 60 in 10dc63c
This is where the silent overflow occurs due to fromIntegral
:
Line 91 in 10dc63c
This analysis is out of my league of expertise but if I follow this correctly, we are in jeopardy of descending blockchain security if the changes are not made soon. However, what is the likelihood of a blockchain with a block size of 4GB?
The whole Num
class is a cesspit of evil; fromIntegral
is just one of the paper cut, but there's hundred of other issues in there (silent overflow, silent underflow, questionable functions for certain type..).
Cryptonite is sadly pre-foundation, and some of these issues would be addresses by moving to basement/foundation, where some of the Num
evilness is addressed (or tried to be address, there's some limitation inherent in the language). For example: through TryFrom
, From
, IntegralUpsize
IntegralDownsize
([1] [2] you cannot convert to a "smaller type" without going through an Maybe type), better sign vs unsigned separation [3], and moving other operation to their own class (Add, Sub, ..).
About removing the trap from the language, good luck, but I have stopped holding my breath about Haskell correctness and ability to run in production safely (lazyness trap, memory leaks, partial functions), and instead my recommendation is to use rust for serious production work (the num hierarchy is sensible, overflow/underflow are detected, TryFrom
and From
is available by default, conversion can be linted off: e.g. prevent 'as' cast, and I don't need a magic ball to predict performance and/or runtime profile).
Related to this function itself:
- You really shouldn't be using the one-shot interface for such a big contiguous data; it is practical, but usual way is to use stream interface to make sure you don't pin more than 4gb of contiguous ram. it's also more practical to use the stream interface for not calling a 3min C call depending on the case (runtime lock, interruptability, etc..).
- Fixing it:
- make the function return something optional (
Maybe Context
orEither ContextError Context
): this breaks the interface of 99% of people that use this function with small data where this will never happen, this also has a speed impact, and suddenly the correct API make it "complicated" for people. - make the function partial : people will complain that their prod stuff is now exploding
- exception: same as 2)
- Add some documentation
- Auto chunking ? That cost in speed, should we expose the chunking parameter ?
- Move the hash interface to uint64_t; This would need some creative thinking for the old hashes algorithm which are limited to 32 bits length. At the moment we paper over this limitation in a non-correct way)
- make the function return something optional (
Probably 5 is the only sensible way, considering that we don't break APIs (unless required)
[1] https://hackage.haskell.org/package/basement-0.0.11/docs/Basement-From.html
[2] https://hackage.haskell.org/package/basement-0.0.11/docs/Basement-IntegralConv.html
[3] https://hackage.haskell.org/package/basement-0.0.11/docs/Basement-Numerical-Number.html
Why not make all sizes of type size_t
?
IMHO, (vi) above is the closest to being right.
Thanks for the quick merging of our fix in #331!
Note though as stated in there,
I think
hashFinalizePrefix
could have the same issue if you give it a huge buffer though, and this doesn't fix it.
Maybe @ocheron (author of that recently added functionality) could chime in whether a similar change is needed there?
@mouse07410 it would be in C, but the boundary between C and Haskell is badly defined.
- CSize in haskell is defined as a Word64 (on 64 bits arch) and Word32 (on 32 bits arch) http://hackage.haskell.org/package/base-4.14.0.0/docs/Foreign-C-Types.html
- the bytestring interface returns an
Int
for the size, which has (mistakenly) copied to the bytearray interface for compat. - bytestring use
fromIntegral
converting fromCSize
toInt
. Probably incorrect on edge cases (like a, currently impossible, > 2^63 data), or a 30 bitsInt
. - bytestring allocation used to be done by C call to malloc (which takes a size_t), but transformed from/to
Int
. - bytestring memory allocation is now done by the haskell rts, which is defined as
Int#
(the inherent type behindInt
). - haskell
Int
has very loose guarantees : it has to be at least a 30 bits signed integer. it doesn't have any guaranteed relation tosize_t
, although in practice on ghc it does fits (by luck ?). Alsosize_t
is unsigned, whereasInt#
is signed.
It's quite likely that haskell Word
would be closer to size_t, but also no guaranteed.
- You really shouldn't be using the one-shot interface for such a big contiguous data; it is practical, but usual way is to use stream interface to make sure you don't pin more than 4gb of contiguous ram.
@vincenthz I'm not sure that pinning would make a difference for my data, I need to have that ByteString in memory anyway.
calling a 3min C call depending on the case (runtime lock, interruptability, etc..).
Is there any runtime lock? Hashing is a safe
foreign import.
Regarding interruptibility, that's right, it cannot be interrupted. What do you suggest for streaming a strict ByteString
, cryptonite-conduit
like this?
{-# LANGUAGE ScopedTypeVariables #-}
import Conduit
import qualified Crypto.Hash as Hash
import Crypto.Hash.Conduit (sinkHash)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import Data.Conduit.List
-- | Chunk a `BS.ByteString` into list of smaller `BS.ByteString`s of no more
-- than /n/ elements.
--
-- Naturally /O(n)/ to create the full output list, but
-- /O(1)/ concerning the input ByteString length, because it uses `BS.splitAt`,
-- which provides a view into the original ByteString (but keeps it alive).
chunkedBy :: Int -> BS.ByteString -> [BS.ByteString]
chunkedBy n bs = if BS.length bs == 0
then []
else case BS.splitAt n bs of
(as, zs) -> as : chunkedBy n zs
{-# INLINEABLE chunkedBy #-}
main :: IO ()
main = do
let a = BS8.replicate 5000000000 'a' <> BS8.pack "hello" -- 5 GB ByteString
digest :: Hash.Digest Hash.SHA1 <-
runConduit $
sourceList (chunkedBy (4096 * 1024) a)
.| sinkHash
print digest
I took chunkedBy
from https://hackage.haskell.org/package/hw-prim-0.6.3.0/docs/src/HaskellWorks.Data.ByteString.html#chunkedBy, is there a more accessible package that provides that conveniently?
hashFinalizePrefix
has of course the same issue but don't need similar change IMO.
I hope no one use that with a large buffer.
Whereas changing the mutable hash interface is highly welcome. #331 did only 50% of the hash API.
I hope no one use that with a large buffer.
Then I think it should be changed to at least error
out, because hope is not something we should build software on, especially not security software.
#331 did only 50% of the hash API.
Then this ticket should be re-opened (I cannot do that). Which parts are missing?