sts10 / tidy

Combine and clean word lists

Home Page:https://sts10.github.io/2021/12/09/tidy-0-2-0.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestion for Attributes: Shannon & Brute Force Thresholds Revisited

bwbug opened this issue · comments

commented

Your treatment of entropy per letter in the attributes (and as explained in your blogs) is interesting, but I think that to ensure a passphrase will not be too short, one must consider the number of words in the passphrase (as Reinhold did when he came up with a 19-letter minimum for a 6-word passphrase). Basically, the character-based entropy must be at least as large as the word-based entropy. This leads to a required lower bound for the minimum average word length in a given passphrase, from which one can determine the minimum number of letters required in a passphrase that has a specified number of words (by multiplying the minimum average word length by the number of words). So my suggestion is that the attributes table should report this lower bound for the required average word length.

If the actual entropy per letter is E (e.g., 2.62 bits for Shannon, 4.70 bits for brute-force), and the number of entries in the word list is W (e.g., 7776 for a diceware list using 5 dice), then it can be shown that the required lower bound for the average word length in a generated passphrase is given by the expression (log2W)/E.

The above expression is for the case of no separators, but if separators are included, the resulting increase in the character-based entropy relative to the word-based entropy will be minimal. The calculation becomes much more complex, so I think it would make most sense to just report the conservative value obtained for the assumption of no separators.

One could compare the average word length for the entire word list to the computed minimum required average word length, but to guarantee that none of the generated passphrases have an insufficient number of characters, the shortest word length in the word list would have to be greater than or equal to the lower bound for the average word length.

It may be possible to compute or estimate the overall entropy reduction caused by rejecting passphrases that are too short, but I would have to think more about how this could be done.

Ah, I did this work a few years, so this is a bit of test for me. Very cool to have someone interested in this stuff!

Basically, the character-based entropy must be at least as large as the word-based entropy.

Agree.

I think you're suggesting the attributes table include a value called "minimum average word length". Let's assume 4.7 bits of entropy for a moment (putting Shannon aside for now). I think the meaning of this value would be saying to the user "Hey, whenever you generate passphrase using this list, make sure the average length of the words in your passphrase is above X". Do I have that right?

If so, I object to that metric for a few reasons. First of all, for the EFF long list, this new "minimum average word length" would be 12.925/4.7 or 2.75. Since the shortest word is 3 characters, there's no possible way to generate a passphrase below the minimum length.

But secondly, I have intended to make Tidy a tool to make word lists that are good and safe with wide room for user error. By this I mean, I do not want to pass caveats to the user. "Ensure your passphrase's average word length is above X" would be one of these caveats. Basically, I want to encourage list-makers to make lists that are as "fool-proof" as possible. (Making it easy to remove prefix or suffix words is another example of this: that way, the passphrase generator need not require separators between words).

My method for preventing these caveats is to take conservative approaches.

Let's look at the metrics that Tidy currently displays. Here they are for the EFF long list:

List length               : 7776 words
Mean word length          : 6.99 characters
Length of shortest word   : 3 characters (aim)
Length of longest word    : 9 characters (zoologist)
Free of prefix words      : true
Free of suffix words      : false
Entropy per word          : 12.925 bits
Efficiency per character  : 1.849 bits
Assumed entropy per char  : 4.308 bits
Above brute force line    : true
Above Shannon line        : false

The key lines for our discussion are the last three. I'll use W as word list length (7,7776 in the case of the EFF long list).

"Assumed entropy per char(acter)" is log2(W) divided by length of the shortest word on the list (here's the code). For the EFF list, it's log2(7776)/3 which is 4.308.

I use the shortest word on the list to be conservative. Users could get a passphrase completely made up of words that are of the shortest length.

Once we've calculated "assumed entropy per character", we compare it to two constants to make the metric more human-understandable to users. If "assumed entropy per character" <= 4.7, Tidy reports that the list is "above" the "brute force line". For the EFF list, 4.308 is indeed less than 4.7, so we're safely "above" the brute force line (reported in the attribute table as "Above...": "true"). If it's less than 2.62, we're under the "Shannon line". Since 4.308 is greater the 2.62, the EFF list does not clear this line.

In general, I want Tidy to encourage the creation of lists that are as safe as possible, with as few caveats as possible. I hope that makes sense in connection to your comment.

commented

Thank you for your thoughtful response. Your reasoning makes sense, and it's easy enough to compute the the "minimum required average word length" just using a calculator, if one is interested in this metric.

There is a benefit delving into a more in-depth analysis when one wants to optimize the trade-off between passphrase entropy and length (for example, for a 7776-word list to clear the "Shannon line" using your conservative approach, the minimum word length would have to be 5, so a user will have to type a large number of characters to enter their passphrase; the more nuanced analysis would allow shorter words on the list, as long as the generated words have an average length of 4.9).

Yep, I gotcha. The "lines" are admittedly strict/unforgiving in the current set-up... maybe too strict to be useful? Haha I also haven't gotten much feedback from tidy users who are not me, so this is a good discussion.

I guess we could do like "Chance that a 6-word passphrase is below the 'Shannon line'", expressed as a percentage? For the EFF list, we'd count the number of words with fewer than 5 characters and go from there?

commented

I guess we could do like "Chance that a 6-word passphrase is below the 'Shannon line'", expressed as a percentage?

Yes, something like this is what I meant when I wrote "It may be possible to compute or estimate the overall entropy reduction caused by rejecting passphrases that are too short", although I was thinking of some kind of metric in terms of entropy (e.g., perhaps an effective entropy per word, taking into account passphrases that would be rejected for being too short).

However, I believe that whether this is expressed as a probability (percentage) or effective entropy, the result will be dependent on the number of words assumed for the passphrase. Thus, it may not be possible to calculate a universal metric for this, unless you are open to using a 6-word passphrase as representative benchmark.

For the EFF list, we'd count the number of words with fewer than 5 characters and go from there?

I have a feeling that the calculation will not be trivial. My initial thought is that it may involve a 6-dimensional joint probability distribution...

I have a feeling that the calculation will not be trivial. My initial thought is that it may involve a 6-dimensional joint probability distribution...

Ah, maybe you're right. If you want to work out some kind of formula, I might be able to implement it for Tidy in Rust. Maybe we could create a sort of general 0 to 100 "score"?

commented

I'll think about how to do this. But the bottom line is that the "score" (whether expressed as a 0-100 score, or in terms of some entropy metric) will be different depending on the assumed number of words in the passphrase. So unless you are OK with presenting the results in a tabular form (e.g., for word counts in the range 4-8), we would have to see whether a conservative (worst-case) metric can provide meaningful/actionable information, or else, abandon the idea.

Not exactly on-topic, but I did add two formulas and two tables to the bottom of the readme that might be of interest here (and would welcome a double-check of my math).

Maximum word list lengths to clear the Brute Force Line

Formula:

Where S is the length of the shortest word on the list, 26 is the number of letters in the English alphabet, and M is max list length: 2S * log2(26) = M

(or in Python: max_word_list_length = 2**(S*4.7))

shortest word length max list length
2 675
3 17559
4 456419
5 11863283

Maximum word list lengths to clear the Shannon Line

Formula:

Where S is the length of the shortest word on the list and M is max list length: 2S * 2.62 = M

(or in Python: max_word_list_length = 2**(S*2.62))

shortest word length max list length
2 37
3 232
4 1428
5 8779
6 53974

Before creating this table, I hadn't realized just how "strict" the Shannon line is. Makes me wonder if it's so strict as to not be helpful.

commented

would welcome a double-check of my math

For the first table, your max list length values are only accurate to 2-3 significant digits, as a result of rounding errors in taking the logarithm. You can make the max list length values more accurate by taking advantage of the fact that S·log2(26) = log2(26S ) and that for any base B, by definition of the logarithm, we get BlogBX = X. As a result, we get M = 26S, which can be exactly evaluated without rounding errors; thus, your table entries should be 676, 17576, 456976, and 11881376.

The Shannon entropy was apparently computed based on an empirically estimated "entropy per word" of 11.82 bits, and dividing that by 4.5 letters/word, which Shannon calls "the average word length in English" without documenting any source for this claim. Now, 11.82/4.5 = 2.6266666..., so rounded to 3 significant digits, it would actually be more appropriate to call it 2.63 than 2.62. However, since the claimed average word length is only given to 2 significant digits, the ratio can also be known only to 2 significant digits — i.e., 2.6 bits/letter. A more detailed uncertainty propagation analysis based on assumed Type B uncertainties of ±0.005 in the 11.82 value and ±0.05 in the 4.5 value indicates that the confidence interval for the ratio corresponds to the range 2.60-2.66 (i.e., 2.63±0.03 bits/letter). Thus, the corresponding values of the maximum list length should not be given to a precision greater than 2 or 3 digits.

Footnote:
I found some oddities in the math of that 1951 Shannon paper. For example, in deriving the value of 11.82 bits/word, he claims that the critical word rank n for which the sum p1 + p2 + p3 + ... + pn =1 (where each term pk=0.1/k) is n=8727. However, by my calculations, the sum of the first 8727 terms of this harmonic series is only 0.9651, and that one must include words up to rank n=12366 in order for the sum to evaluate to unity. Second, my attempts to reproduce the calculation of Equation (7) did not yield a value of 11.82 (instead, the summation was found to equal 9.14 for n=8727 and 9.72 for n=12366). Admittedly, the linked PDF contains some scanning/OCR artefacts, so some of the mathematical expressions may not be correctly represented. I should probably check out the original version of the 1951 article, and also read the 1948 Shannon paper that is cited as Reference 1. But the methods described in the 1951 paper do not appear to be so complex, so I was very surprised that I could not reproduce the published numbers!

For the first table, your max list length values are only accurate to 2-3 significant digits, as a result of rounding errors in taking the logarithm. You can make the max list length values more accurate by taking advantage of the fact that S·log2(26) = log2(26S ) and that for any base B, by definition of the logarithm, we get BlogBX = X. As a result, we get M = 26S, which can be exactly evaluated without rounding errors; thus, your table entries should be 676, 17576, 456976, and 11881376.

AMAZING! I've updated the brute force table in the readme.

Fixing the actual brute force computation in the code itself is a bit trickier, thanks to some rounding tomfoolery in Rust.

2.0_f64.powf(26_f64.log2()) // evaluates to 25.999999999999993, rather than 26
26_f64.log2() // evaluates to 4.700439718141092 ; I think it should be 4.700439718141093

(Code playground, for those interested.)

So, unless you object, I'm just going to hard-code the decimal in the test: assumed_entropy_per_character <= 4.700439718141093

However, since the claimed average word length is only given to 2 significant digits, the ratio can also be known only to 2 significant digits — i.e., 2.6 bits/letter... Thus, the corresponding values of the maximum list length should not be given to a precision greater than 2 or 3 digits.

Fascinating work! Going to change the Shannon test to 2.6 for now. Maybe we should just... do a bit of our own analysis with some more modern word corpus?

I found some oddities in the math of that 1951 Shannon paper... Admittedly, the linked PDF contains some scanning/OCR artefacts, so some of the mathematical expressions may not be correctly represented. I should probably check out the original version of the 1951 article, and also read the 1948 Shannon paper that is cited as Reference 1.

Again, super impressed you went this far here. Hope it's just a scanning issue? Let me know if I can help. The PDF I linked to was just a first search result. I should read more of those papers myself at some point.

commented

Ha! Thank you for the kind words, but I was really just trying to figure out if I could get a little better precision on the "2.62" value. Either there's something screwy with Shannon's numbers (not impossible -- this is what an electronic calculator looked like in 1951!), or I have completely misunderstood the methods used in his paper.

Maybe we should just... do a bit of our own analysis with some more modern word corpus?

I think we may have to either look for a different analysis in the literature, or come up with our own. It seems to me that Shannon's value is derived entirely from data on frequency of words in grammatically correct written text (sentences, paragraphs, etc.). What we need instead is some kind of analysis based on the frequency of letters appearing in those words. Perhaps the act of dividing the "entropy per word" by the average word length does give an appropriate value, but that seems like mathemagic to me, so I would like to understand the reasoning behind the analysis before I trust it. In addition, it is not yet clear to me that Shannon's use of the word entropy is identical to our own use of this term. Thus, I would personally take the Shannon entropy value with a grain of salt until I have better understood the underpinnings of his methods.

Fixing the actual brute force computation in the code itself is a bit trickier, thanks to some rounding tomfoolery in Rust.

Can you explain why the limited precision of Rust's floating point calculation causes a problem in the code? Not sure I understand why there is a need to calculate 2log26 in the code, or why an error on the order of 10-15 will cause issues.

Can you explain why the limited precision of Rust's floating point calculation causes a problem in the code? Not sure I understand why there is a need to calculate 2log26 in the code, or why an error on the order of 10-15 will cause issues.

I was confused as well until just now -- have resolved the rounding issue.

Think the culprit was this function:

pub fn calc_entropy_per_word(list_size: usize) -> f64 {
    (list_size as f64).ln() / (2_f64.ln() as f64)
}

If I change it to

pub fn calc_entropy_per_word(list_size: usize) -> f64 {
    (list_size as f64).log2()
}

the rounding issue I found resolves itself.

Here's a Rust code playground the shows the original issue:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e8234d2494140adf09272282a7c1ee20

commented

I think you can simplify the evaluation of

assumed_entropy_per_character <= brute_force_line_max_entropy_per_character

by first calculating a value G that represents the number of letter guesses required, if given an entropy per letter:

G = 26 (randomly drawn letters from [a-z])
G = 22.6 = 6.1 (randomly drawn letters from written English)

If W is the length of the word list, and M is the number of letters in the shortest word, then your boolean test above becomes equivalent to checking whether

WGM

For the brute force case, this should require no floating-point arithmetic, and therefore would be expected to yield a more precise result.

For the brute force case, this should require no floating-point arithmetic, and therefore would be expected to yield a more precise result.

Very smart. Got it working and haven't gotten any surprises yet.

G = 22.6 = 6.1 (randomly drawn letters from written English)

Just want to confirm you like 6.1 here, rather than having more digits. Guessing we're set on limiting Shannon's calculations to two significant digits, but just checking.

NOTE: Using your new method of calculating the Shannon line disrupts the table in the readme, as it's currently calculated. Now, a list with shortest_word-length of 3 can be 226 words and still be above the Shannon line. Does that sound right?

For Shannon line max length:
Using 6.1M, then rounding down to nearest integer (can't have fractions of words on the list), I get

shortest word length max list length
2 37
3 226
4 1384
5 8445
6 51520
commented

Yes, 6.1 is limiting the result to two significant digit, because of the corresponding uncertainty in the 2-significant digit value of 4.5 letters/word that was used in Shannon's calculation. If you want to use 3 significant digits, that wouldn't be terrible, either.

And, yes, because of imprecision due to rounding, you will get slightly different results if you use 2.6 vs. the the equivalent of log26.1 (≈2.6088). Since Shannon's reported value is not precise, if you want a max list length cut-off to the nearest word, you'll have to decide whether you consider 2.6 or log26.1 (or even something else, like 2.63, or 11.82/4.5) to be the canonical representation of Shannon's limit.

you'll have to decide whether you consider 2.6 or log26.1 (or even something else, like 2.63, or 11.82/4.5) to be the canonical representation of Shannon's limit.

Glad we finally got to "We're stuck cuz we can't tag Claude Shannon"!

I guess I'll leave the Shannon calculations using 6.1 for now. Version 0.2.66 has this math in it -- we can make that a "release" if we don't find more things to fix in next few hours/days. Thanks again for your insight and encouragement.

Related: I just ordered two information theory textbooks: Elements of Information Theory 2nd Edition by Cover and Thomas; and Information and Coding Theory by Jones and Jones. The former gets better reviews it seems like. Maybe they'll have some clues for us, both on these entropy calculations and on uniquely decodable stuff.

commented

I played around a little with the entropy definition. If a corpus contains N words that appear with equal probability, then the Shannon entropy per word is log2N, which kind of makes sense. But Shannon's calculation of entropy per letter takes into account only the average word length in the corpus, and completely ignores the relative frequency of individual letters in the corpus — to me, this suggests that Shannon's notion of "entropy per letter" is not suited to characterizing the vulnerability of word lists to letter-based brute-force attacks.

For example, consider the following corpus:

"Come here, Spot," said Jane.
"Bow wow," said Spot.
"I will not come."

It contains 4 words of frequency 2 (probability 1/7), and 6 words of frequency 1 (probability 1/14). From this, the Shannon entropy per word is -4·(1/7)·log2(1/7) – 6·(1/14)·log2(1/14) = 3.236 bits/word, and the average word length in the corpus in 3.571 letter/word. Thus, we would get an entropy per letter of 3.236/3.571 = 0.9062 bits/letter.

Note that this calculation makes no use of the knowledge that the corpus contains only 17 distinct letters, and that the relative probabilities of these letters range from 41% (o) to 6% (b,h,j,r). If the corpus was altered so that the dog's name was Fuzz instead of Spot, then the entropy per word and entropy per letter would be not be changed, in spite of the resulting changes to the letter frequency distribution. If the corpus was altered only to change the girl's name, then the calculated "entropy per letter" would increase if the name was shorter (e.g., Jan), or decrease if the name was longer (e.g., Janice).

Look forward to hearing about what you learn in your reading.

to me, this suggests that Shannon's notion of "entropy per letter" is not suited to characterizing the vulnerability of word lists to letter-based brute-force attacks.

I think you're asking and answering the right question here.

TBH when I picked that 2.62 number out of that Shannon paper (without reading far beyond that section), I wasn't attuned to all these ideas and Shannon's limitations. My goal was to provide a response to a (hypothetical) question of: "OK, this list is above the brute-force line that assumes all letters are equally likely in random order, but what about the different, slightly more enlightened attack of brute-force-guessing words rather than letters?" (And I think I figured it'd be cool to cite a 70-year-old paper from the father of information theory.)

I've now got enough of a handle on this whole entropy thing that it makes sense to me that the goals and methods Shannon used are distinct from what I was/am looking for when setting up this second, "tougher" line calculation. (Not mention changes in English language in last 70 years.)

Thus, I'm open to removing the Shannon line concept from Tidy entirely, at least for the time being.

commented

I think that the concept of an attacker requiring (on average) fewer than 26 guesses per character if they know your passphrase is in English is a very important one. The value of 2.62 from Shannon's 1951 paper may not be directly relevant, but I would not be surprised if it is close to the "correct" value.

So, you may elect to remove the references to Shannon's 2.62 value if you don't feel comfortable that this value is rigorously correct, but I think it is important to not abandon this line of analysis completely.

Until we can find a better number to use, perhaps one of the reported attributes can be the number of required guesses per letter (in a character-level brute force attack) that would make the character-level attack break even with the word-level attack?

It will probably be a formula like GM=W, then solve for G:

G = W1/M

But it's late, and I should re-check this math, and also determine if the result represents an upper bound or lower bound.

commented

So my math from last night seems to check out. Thus, if you (temporarily?) decide to 86 the "Shannon Line", I would recommend that you report the following value in your attributes table instead:

G = W1/M

where W is the word list size, and M the length of the shortest word on the list.

The best descriptor I can come up with for G right now is the minimum guesses per character. Thus, if an attacker must generate more than G guesses for each character in the passphrase in order to crack it using a character-level attack, then the passphrase could be considered safe from this type of attack (because it would be more efficient to do a word-based brute force attack). Conversely, if the number of guesses required to generate each character in the passphrase is lower than G, then the passphrase is susceptible to a character-level attack (at least if the passphrase contains any words of length M).

Some examples for W=7776:

  • For M=2, we get G=88.2. If we know that the attacker is not going to need to try more than 26 characters before they find the correct one, then this combination of word list size and minimum word length is unsafe (because 26 < 88.2).

  • For M=3, we get G=19.8. If our threat model includes only brute force attacks that randomly select from every character in the English alphabet, with equal probability, then the number of guesses required for each character would be 26. Thus, for this threat model, the word list is safe (because 26 > 19.8). However, every letter in the alphabet is not equally likely to occur in a passphrase, so a clever attacker could make use of letter frequency statistics to greatly reduce the number of guesses attempted for each character (for example, in the EFF long word list, half of the words use a subset of only 4-6 characters at each position in the word); in the face if this type of attack, the word list is not safe if the number of guesses required per character falls below 19.8.

The proposed measure is less intuitive than the "Shannon line", but if you remove the Shannon entropy analysis from tidy -A, then I think it is best to provide some guidance for those concerned about attacks that don't make use of the full alphabet.

I think I follow you, but I want to learn/know more about G, with an eye toward hopefully (a) improving its name or (b) mathematically converting into something more intuitive without losing any of its descriptive power or accuracy.

Two questions for now:

  1. Is G more accurately described as assumed guesses per character?
  2. Is G related to a value I label as "assumed entropy per character"? If so how?

Notes on assumed entropy per character

Tidy currently calculates a variable named assumed_entropy_per_character to measure a list against the brute force line.

assumed_entropy_per_character = log2(list_length) / shortest_word_length
or
assumed_entropy_per_character = log2(W) / M

As an example, for the EFF long list (7,776 length, shortest word is 3 characters), "assumed entropy per character" comes out to 4.308 bits. This is the number Tidy formerly compare against 4.7001 to see if the list is above or below the brute force line.

Thanks to your help, Tidy no longer uses this method, due to rounding inconsistencies, but in principle, "assumed entropy per character" still is tightly associated with the brute force line, hence my wondering if it's useful in our discussion of G and creation of a replacement for the Shannon Line.


A separate, wild thought

In the assumptions behind our construction of the brute force line, are we wrong to assume a brute-force attack that must go through all 26 characters? More generally: Won't, on average, an attacker guess correctly about halfway through the given space? As a contrived example, if I tell you I'm thinking of a randomly chosen (lowercase) letter from a to z, won't you, on average, guess it on or around your 13th guess? If so, does this mean we need to re-think the brute force line?

commented

Is G related to a value I label as "assumed entropy per character"? If so how?

Yes:

assumed_entropy_per_character = log2G

G = 2assumed_entropy_per_character

Is G more accurately described as assumed guesses per character?

Not sure if I would say this is any more or less "accurate", but if you prefer this descriptor because it captures the relationship to the assumed_entropy_per_character quantity, then I don't think I would object to that.

The bottom line is that because we (currently) do not have a reliable entropy threshold (for comparing against assumed_entropy_per_character) for the case in which an attacker is leveraging prior information about letter frequency, I wanted to transform the somewhat abstract entropy measure into something with a more concrete interpretation (number of guesses). Basically, if we have to hand-wave some approximation of the "Shannon entropy" for now, let's hand-wave something more concrete (number of guesses) rather than something more abstract (entropy).

More generally: Won't, on average, an attacker guess correctly about halfway through the given space?

I had the same thought when I was typing up my post about G this morning. There's some validity to this concern, but at the same time there are pitfalls that one has to be careful about avoiding. For example, if we assume a lower case letter can be guessed (on average) in 13 tries, it is not true that we can guess a three-letter word (on average) in 133 = 2197 tries, since this will only cover one-eighth of the search space. Instead, the average number of tries for a 3-letter word is 263/2 = 8788. I think this amounts to basically subtracting 1 bit of entropy, for the case of uniform frequency distributions (every letter equally probable).

Unfortunately, the larger problem we are tackling here involves decidedly non-uniform frequency distributions, so it gets more complicated. A lot of it comes down to how you define your variables. For example, does G represent an average number of guesses, or a maximum or minimum? Only after you define the variables can you derive equations to evaluate those variables.

I think it's best to defer further ruminations on this until after we have consulted some of the literature. For example, Shannon's entropy calculation –Σpilog2pi could be interpreted as some sort of weighted average of the entropies associate with each term i. So it would be worthwhile to check what the premises are for Shannon's definition of entropy.

commented

Haven't delved any further into the primary literature yet, but I found a nice synopsis of Shannon entropy in a post by Abhimanyu Dubey on Quora. Some excerpts:

the entropy of an event is the average...value of information conveyed in that event

In this context, "event" refers to

any discrete probabilistic event with a fixed number of possible outcomes, such as a coin toss or a roll of a die

...or the random selection of a word from a larger corpus.

Also, in the above definition of entropy, "information" has a specific interpretation:

Note that we are representing information here by the amount of space it takes to store the information that an event has occured. This is only one representation, alternate representations form the larger family of Rényi entropies

So the Shannon entropy for a language corpus seems to be the average amount of space (in bits) required to store the information that a specific word from the corpus came out in a random draw. So for a corpus with 2n unique words that appear with equal probability, we'll need n bits to store information about which word was picked (e.g., for 8 words, we can represent each word using an integer in the range 0-7, which requires 3 bits — 000 through 111). What I'm less clear on right now, is how to interpret all of this when there is a non-uniform frequency distribution in the corpus.

What I'm less clear on right now, is how to interpret all of this when there is a non-uniform frequency distribution in the corpus.

I've seen some references to entropy definitions other than Shannon's, e.g. Tsallis Entropy and Renyi Entropy. While I have no idea if either of those two concepts are helpful to our problem(s), I'm encouraged that there are other conceptions of entropy out there. Wonder if something fits our needs just right.

commented

My philosophy is to think from the perspective of the adversary. I think the most likely approach that would be used to reduce the number of guesses required per character is a Markov chain. Thus, we need an analysis of the computational complexity or the performance of such algorithms. Or even an entropy...

Here's a quick Google Scholar search that yields 188 hits:

https://scholar.google.com/scholar?q=markov+chain+entropy+%22passphrase%22

There's even more published work if you change the keyword "passphrase" to "password".

commented

The more I think about these issues, the more confused I make myself. To make things more clear, I think it is best if we, in turn, consider three different assumptions regarding the attacker:

  1. The attacker knows our word list.
  2. The attacker does not know our word list, but has some prior knowledge about the exact or approximate letter frequency statistics (e.g., the lists comprises English words only).
  3. The attacker has no prior knowledge about our word list, other than the set of allowed characters (e.g., [a-z]).

In addition, we may occasionally choose to assume that the attacker knows the number of words in our passphrase, and whether or not a separator characters is used (although it should not be too difficult to relax these two assumptions when needed).

I am starting to believe that under Assumption 1, a character-level brute force attack will never be more efficient that a word-level brute force attack, if one wants a 100% guarantee of success. I'll have to think more about whether or not this is also true if the attacker is satisfied with a lower probability of success

To help me think about some of these issues, I created a toy word list, which contains 41 words all made from 4 letters of the alphabet (a, e, s, t):

aest41.txt

Under Assumption 2 would be an attacker who knows that the list words can only contain 4 possible letters. Let's assume that they also know the number of words, and the fact that no separator is used (the word list is uniquely decodable).

If only a single word is used in the passphrase, then naïvely, a character-level brute-force attack seems it would be worthwhile if the number of characters in the passphrase (n) satisfies

4n < 41

This requires n<3, and because there is only a single word in the phrase, n is also equal to the word length. Thus, we conclude that "unlucky" passphrases are those that consist of a 2-letter word. Based on our list of 41 words, only a single passphrase (2%) is vulnerable:

aa (a type of lava)

If two words are used in the passphrase, then the potentially vulnerable passphrases would be those consisting of n≤5 letters (because 45<412<46), of which there are 9 (0.5%)

aaaa
aasat
aasea
aasee
aatea
sataa
seaaa
teaaa

Under Assumption 3, the number of randomly generated passphrases to check would be 265 = 11,881,376 for a 5-letter passphrase (or 12,356,630 if considering all passphrase lengths in the range 1-5); clearly, this approach confers no advantage to guessing which of the 412 = 1681 possible 2-word passphrases is correct.

Under Assumption 2 if the attacker know only that all letters except four (a, e, s, t) have a zero probability of appearing in the passphrase, then the number of generated passphrases would be 45 = 1024 (or 1364, if also testing shorter passphrases). This seems* advantageous compared to the keyspace size of 1681 2-word phrases.

The question (or at least one question) is how much more efficient can the attack get by leveraging Markov chain statistics (probabilities about what letters follow previous letters or letter sequences)?

*A second important question is whether we should compare the number of required brute-force guesses generated at the character-level to the full set of possible passphrases (1681), or only to the size of the vulnerable subset of the keyspace (e.g., the 9 two-word passphrases with no more than 5 characters).