1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

Train a generative model.
Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.

Disclaimer: for research purposes only.

In the press

Get the data

Download any Torrent client.
Here are some magnet links you can find on Reddit:
- magnet:?xt=urn:btih:09250e1953e5a7fefeaa6206e81d02e53b5b374a&dn=BreachCompilation.tar.bz2
- magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337

Deep Learning

Stay tuned!

Map the password list for each email

Generate the JSON files containing emails <-> list of passwords. Output folder is ~/BreachCompilationAnalysis.

python3 read.py --breach_compilation_folder ~/BreachCompilation

Make sure you have enough free memory (8GB should be enough).
It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
Uncompressed output is 13G.

Output is of the form:

> less ReducePasswordsOnSimilarEmailsCallback-z-b.json # emails starting with zb.
{
    "zb-email1@yahoo.com": [
        "pass1",
        "pass2"
    ],
    "zb-email2@yahoo.com": [
        "pass1",
        "pass2",
        "pass3"
    ],
    [...]
}

About

Deep Learning model to analyze a large corpus of clear text passwords.

Languages

Language:Python 100.0%