Improving performance of your publicsuffix library?

Question

Improving performance of your publicsuffix library?

d33tah opened this issue 6 years ago · comments

Hi!

I'd really love to use your publicsuffix library, but it's currently very slow (even compared to a Python implementation, 74s vs 5s per 1M domains). Any chance there are any easy fixes that could improve that?

Take a look at the Docker image I created and those few lines from results.txt:

https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831#file-results-txt-L36

Cheers,
d33tah

Rushmore Mushambi commented 6 years ago

OK

Rushmore Mushambi · Answer 1 · Wed Jun 27 2018 05:07:21 GMT+0800 (China Standard Time)

Hi,

I've never actually tried to optimise the performance of this library but this is something I plan to do when I get more free time.

Having said that, I also noticed a couple of issues with the benchmark code itself. For example:-

cargo run --release public_suffix_list.dat

You are including the overhead of cargo run. You are not supposed to run Rust programs like that in production. You should just run the binary directly:

./target/release/publicsuffixtest public_suffix_list.dat

Also println! is slow.

Those two are what jumps at me when looking at that code. We would need to profile to know why exactly it's that slow.

Lastly, I took a quick look at the documentation of that Python module and it says:-

Please note that the host part of an URL can contain strings that are not plain DNS domain names (IP addresses, Punycode-encoded names, name in combination with a port number or an username, etc.). It is up to the caller to ensure only domain names are passed to the get_public_suffix() method.

So parse_domain and get_public_suffix aren't exactly identical. parse_domain does more work to actually ensure that the input is actually a domain name as per DNS standards.

Jacek Wielemborek · Answer 2 · Wed Jun 27 2018 06:05:08 GMT+0800 (China Standard Time)

@rushmorem I tried to address your comments in the revision no. 2:

https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831/revisions

Is there a way to disable the checks in your library in order to get a fairer comparison?

Rushmore Mushambi · Answer 3 · Wed Jun 27 2018 06:42:54 GMT+0800 (China Standard Time)

Well, from their README it sounds to me like the Python code is only doing this part:-

Domain::find_possible_matches(&domain, list)
            .and_then(|res| Domain::find_match(input, &domain, res))

However, both find_possible_matches and find_match are private. So if you want to try you could clone this repo and add a public method that does only that. Something like:-

impl List {
    pub fn public_suffix(&self, domain: &str) -> Result<Domain> {
        Domain::find_possible_matches(domain, self)
            .and_then(|res| Domain::find_match(domain, domain, res))
    }
}

But even that part of the code still does a few allocations so there is definitely room for improvement.

Rushmore Mushambi · Answer 4 · Wed Jun 27 2018 06:55:19 GMT+0800 (China Standard Time)

I don't quite understand the jump from 74.80 to 103.39. Those changes you made have nothing to do with this library but still we see such a slowdown. Don't get me wrong though, this library itself needs quite a bit of perfomance optimisations. For starters, I collect into a Vec in a number of places which introduces allocations. I've always wanted to get rid of those but I haven't had time to do it yet.

Jacek Wielemborek · Answer 5 · Wed Jun 27 2018 16:25:35 GMT+0800 (China Standard Time)

The jump was because I did second test on my laptop, sorry for the confusion. 2018-06-27 0:55 GMT+02:00 Rushmore Mushambi <notifications@github.com>:

…

I don't quite understand the jump from 74.80 to 103.39. Those changes you made have nothing to do with this library but still we see such a slowdown. Don't get me wrong though, this library itself needs quite a bit of perfomance optimisations. For starters, I collect into a Vec in a number of places which introduces allocations. I've always wanted to get rid of those but I haven't had time to do it yet. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASBmk0yodfQVoFsE6nnJmtiR6vPiYVqks5uArvXgaJpZM4U3_kK> .

Rushmore Mushambi · Answer 6 · Wed Jun 27 2018 16:41:34 GMT+0800 (China Standard Time)

Ah, I see. No problem. Thanks for the explanation.

Jacek Wielemborek · Answer 7 · Wed Jun 27 2018 20:52:56 GMT+0800 (China Standard Time)

Consider revision 3 - repeated on my PC like rev 1, as opposed to rev 2 which was on laptop:

https://gist.github.com/d33tah/25db3f7e00970b9bd70cf0529fe77831/revisions

Looks like I shaved some time off by using my fork that has some functionality removed:

master...d33tah:master

But still the thing is 10x as slow. I looked at your code and while I'm not good at figuring out allocation schemes in Rust, I think that it's actually about data structures. You seem to be using a flat list while Python library uses a nested dictionary, split by dot, like this:

[14:52:12]>>> import pprint, publicsuffix
[14:52:14]>>> psl  = publicsuffix.PublicSuffixList()
[14:53:54]>>> pprint.pprint(psl.root[1]['uk'])
(0,
 {'ac': 0,
  'co': (0, {'blogspot': 0}),
  'gov': (0, {'service': 0}),
  'ltd': 0,
  'me': 0,
  'net': 0,
  'nhs': 0,
  'org': 0,
  'plc': 0,
  'police': 0,
  'sch': (0, {'*': 0})})

Rushmore Mushambi · Answer 8 · Wed Jun 27 2018 23:31:00 GMT+0800 (China Standard Time)

I looked at your code and while I'm not good at figuring out allocation schemes in Rust, I think that it's actually about data structures.

I'm pretty sure it's both, among other things.

You seem to be using a flat list while Python library uses a nested dictionary, split by dot

Yes, but it's not totally flat. The rules are stored as HashMap<String, Vec<Suffix>> with the first label being the key. Printing the value for "uk"

println!("{:#?}", list.rules.get("uk").unwrap());

gives us

[
    Suffix {
        rule: "uk",
        typ: Icann
    },
    Suffix {
        rule: "ac.uk",
        typ: Icann
    },
    Suffix {
        rule: "co.uk",
        typ: Icann
    },
    Suffix {
        rule: "gov.uk",
        typ: Icann
    },
    Suffix {
        rule: "ltd.uk",
        typ: Icann
    },
    Suffix {
        rule: "me.uk",
        typ: Icann
    },
    Suffix {
        rule: "net.uk",
        typ: Icann
    },
    Suffix {
        rule: "nhs.uk",
        typ: Icann
    },
    Suffix {
        rule: "org.uk",
        typ: Icann
    },
    Suffix {
        rule: "plc.uk",
        typ: Icann
    },
    Suffix {
        rule: "police.uk",
        typ: Icann
    },
    Suffix {
        rule: "*.sch.uk",
        typ: Icann
    },
    Suffix {
        rule: "service.gov.uk",
        typ: Private
    },
    Suffix {
        rule: "homeoffice.gov.uk",
        typ: Private
    },
    Suffix {
        rule: "blogspot.co.uk",
        typ: Private
    },
    Suffix {
        rule: "glug.org.uk",
        typ: Private
    },
    Suffix {
        rule: "lug.org.uk",
        typ: Private
    },
    Suffix {
        rule: "lugs.org.uk",
        typ: Private
    },
    Suffix {
        rule: "barsy.co.uk",
        typ: Private
    },
    Suffix {
        rule: "barsyonline.co.uk",
        typ: Private
    },
    Suffix {
        rule: "barsy.uk",
        typ: Private
    },
    Suffix {
        rule: "nh-serv.co.uk",
        typ: Private
    },
    Suffix {
        rule: "no-ip.co.uk",
        typ: Private
    },
    Suffix {
        rule: "wellbeingzone.co.uk",
        typ: Private
    },
    Suffix {
        rule: "gwiddle.co.uk",
        typ: Private
    }
]

So for foobar.gwiddle.co.uk it will scan through all the other uk extensions before it gets to gwiddle.co.uk. Using a faster implementation like IndexMap instead of HashMap should also help.

Jacek Wielemborek · Answer 9 · Wed Jun 27 2018 23:52:49 GMT+0800 (China Standard Time)

@rushmorem .com has a pretty long list, I can imagine that backfiring. It's still surprisingly slow though.

Jacek Wielemborek · Answer 10 · Wed Jun 27 2018 23:55:21 GMT+0800 (China Standard Time)

@rushmorem btw I might have commercial interest in those optimizations. Contact me over GMail if having this MR sponsored is an option.

Jacek Wielemborek · Answer 11 · Thu Jun 28 2018 00:18:37 GMT+0800 (China Standard Time)

I ran callgrind on 1K domains, got this - any idea how to explain do_lookup_x calls?

Rushmore Mushambi · Answer 12 · Sun Jul 08 2018 14:24:14 GMT+0800 (China Standard Time)

@d33tah Kindly clone https://github.com/rushmorem/psl and run the benchmarks against master and let me know the results. I haven't had a chance to setup benchmarks against the libraries you mentioned yet.

Heads up:-

The API has changed. Please refer to the README of the linked repo for instructions.
psl-compiled can take a while to compile (about 20 minutes on my laptop), as it compiles the list into native Rust code.
I haven't pushed the new crates to crates.io yet so please compile against master.

Rushmore Mushambi · Answer 13 · Sun Jul 08 2018 22:40:50 GMT+0800 (China Standard Time)

If you only care about particular TLDs, for example .com, .net and .org, you can pass them in as an environment variable called PSL_TLDS before running cargo build like so:-

PSL_TLDS="com, net, org" cargo build --release

This will not only boost your build times, it will also filter out all the other TLDS, only including the ones you care about in your binary. Thereby, making your binary smaller too. If you only care about one TLD, you can use PSL_TLD which is an alias for PSL_TLDS.

You can also pass in your own list or lists using the environment variables PSL_URL or PSL_PATH. As with their TLD counterparts, they have aliases for their plural versions.

Lastly you can derive your own list using using the psl-codegen crate. For example:-

extern crate psl;
#[macro_use]
extern crate psl_codegen;

// import the trait
use psl::Psl;

#[derive(Psl)]
// You can also use `url` instead of `path`. If you don't include this attribute,
// it will be downloaded from the official site during the build.
#[psl(path = "/path/to/list.dat")]
struct List;

// use the list

so psl-compiled is only there for convenience. However, if you want to use the entire list and not trim it down to improve build times using the methods above, then I highly recommend using a crate like psl-compiled which embeds the list outside of your own crate. This makes only the initial build expensive because subsequent ones will then be cached.

Rushmore Mushambi · Answer 14 · Mon Jul 09 2018 12:57:19 GMT+0800 (China Standard Time)

I have now pushed v0.1.0 of the psl crate to crates.io. Never mind about psl-compiled. I ended up pulling it into the psl crate because I needed it to implement serde::Deserialize for Domain and Suffix. It never made it to crates.io.

After updating your benchmark code to use psl v0.1.0 as follows

extern crate psl;

use std::io::{self, BufRead, Write};
use psl::{Psl, List};

fn main() {
    let stdout = io::stdout();
    let mut handle = stdout.lock();

    let list = List::new();
    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let domain_str = line.unwrap();
        let domain = match list.registrable_domain(&domain_str) {
            Some(x) => x,
            None => continue
        };

        handle.write(domain.as_str().as_bytes()).unwrap();
        handle.write(b"\n").unwrap();
    }
}

and using your Python code as is, here are the results after running your benchmark on my laptop...

~> time ./target/release/psltest <  domains.txt | wc -l
2.87user 1.04system 0:03.91elapsed 99%CPU (0avgtext+0avgdata 2932maxresident)k
0inputs+0outputs (0major+142minor)pagefaults 0swaps
1036595

~> time python main.py public_suffix_list.dat < domains.txt | wc -l
5.82user 0.04system 0:05.86elapsed 99%CPU (0avgtext+0avgdata 26176maxresident)k
0inputs+0outputs (0major+9273minor)pagefaults 0swaps
1048576

The psl crate is cheating though because it compiles the list in when building. The Python library first has to parse the list and load the data before it can start processing queries. Having said that, we can measure the time the Python library takes to parse and load the data by removing the for loop and running it again. That overhead is

~> time python main.py public_suffix_list.dat
0.35user 0.01system 0:00.37elapsed 100%CPU (0avgtext+0avgdata 26072maxresident)k
0inputs+0outputs (0major+9244minor)pagefaults 0swaps

So indeed the psl crate is now faster.

Rushmore Mushambi · Answer 15 · Mon Jul 09 2018 19:40:02 GMT+0800 (China Standard Time)

I was curious to see how a slightly more idiomatic version like

extern crate psl;

use std::{env, fs};
use psl::{Psl, List};

fn main() {
    let list = List::new();

    let filename = env::args().nth(1).expect("missing arg: filename");
    let domains = fs::read_to_string(filename).expect("file not found");

    for input in domains.lines() {
        if let Some(domain) = list.registrable_domain(input) {
            println!("{}", domain);
        }
    }
}

would perform. The answer was

~> time ./target/release/psltest domains.txt | wc -l
2.69user 1.12system 0:03.82elapsed 99%CPU (0avgtext+0avgdata 37808maxresident)k
0inputs+0outputs (0major+9040minor)pagefaults 0swaps
1036595

Jacek Wielemborek · Answer 16 · Mon Jul 09 2018 20:52:34 GMT+0800 (China Standard Time)

Hi @rushmorem!

Thanks for working on this! Looks like we came up with similar solutions - I experimented with a similar approach here:

https://gitlab.com/d33tah/psl-rewrite

I didn't really manage to beat Python though, so it looks I did something quite wrong. Your implementation has better performance, but makes me wonder - is there a way to speed compilation time? Would it be expensive to load the list on startup? I'm OK trading half a second for that.

Also, here's an error I get in the Dockerfile:

error: non-reference pattern used to match a reference (see issue #42640)
   --> root/.cargo/registry/src/github.com-1ecc6299db9ec823/psl-codegen-0.1.0/src/lib.rs:150:17
    |
150 |                 Type::Icann => syn::parse_str::<syn::Type>("Icann").unwrap(),
    |                 ^^^^^^^^^^^ help: consider using a reference: `&Type::Icann`

error: non-reference pattern used to match a reference (see issue #42640)
   --> root/.cargo/registry/src/github.com-1ecc6299db9ec823/psl-codegen-0.1.0/src/lib.rs:151:17
    |
151 |                 Type::Private => syn::parse_str::<syn::Type>("Private").unwrap(),
    |                 ^^^^^^^^^^^^^ help: consider using a reference: `&Type::Private`

error: non-reference pattern used to match a reference (see issue #42640)
   --> root/.cargo/registry/src/github.com-1ecc6299db9ec823/psl-codegen-0.1.0/src/lib.rs:224:9
    |
224 |         Lit::Str(s) => {
    |         ^^^^^^^^^^^ help: consider using a reference: `&Lit::Str(s)`

error: aborting due to 3 previous errors

error: Could not compile `psl-codegen`.

To learn more, run the command again with --verbose.
Command exited with non-zero status 101
210.13user 3.98system 1:19.51elapsed 269%CPU (0avgtext+0avgdata 332824maxresident)k
351480inputs+376888outputs (13589major+829854minor)pagefaults 0swaps

Jacek Wielemborek · Answer 17 · Mon Jul 09 2018 20:52:51 GMT+0800 (China Standard Time)

BTW, have you compared this against PyPy as well?

Rushmore Mushambi · Answer 18 · Mon Jul 09 2018 22:30:39 GMT+0800 (China Standard Time)

Hi @d33tah

Thanks for working on this!

It's my pleasure. I really enjoyed working on this. A nice side effect is that the library is now no_std, like I always planned to make it one day. Thanks for inspiring me to work on this. I had never run any sort of benchmarks against it before so I wasn't aware how slow it was.

I experimented with a similar approach here

Awesome! Thanks for the link, I will take a look.

Your implementation has better performance, but makes me wonder - is there a way to speed compilation time? Would it be expensive to load the list on startup?

Currently, I think you can only improve subsequent builds by caching the build artifacts. Unfortunately, the current design actually compiles the list to an actual Rust match expression like

let mut suffix = Info::Incomplete;
let mut index = 1;
match labels.next() {
    Some(label) => {
        // ... snip
        "com" => {
            suffix = Info::Suffix(index, Type::Icann);
            index += 1;
            match labels.next() {
                // ... snip
            }
            None => Some(suffix)
        // ... snip
        }
    None => None
}

so a lazy static won't cut it. Having said that though, this shouldn't affect your development experience. When developing, you can set PSL_TLD to some TLD like com or io etc. The less suffices the TLD has, the better but even something with a lot of suffices like com still compiles pretty quickly because that's still substantially less code to compile.

I'm OK trading half a second for that.

The nice thing about the new design is that the list is now a trait so it supports multiple implementations easily. I will try to implement a runtime list and see if the performance will be acceptable. If it goes well both implementations can live side by side.

Also, here's an error I get in the Dockerfile

Your compiler doesn't have the improved match ergonomics. I have forgotten in which version that RFC landed on stable but this does work on the latest stable Rust as shown by Travis. Is updating to latest stable an option for you? If not, I can update the code to support older compilers.

Rushmore Mushambi · Answer 19 · Mon Jul 09 2018 22:55:56 GMT+0800 (China Standard Time)

BTW, have you compared this against PyPy as well?

I hadn't tried with PyPy yet. Here are the results from PyPy:

~> time pypy main.py public_suffix_list.dat < domains.txt | wc -l
2.24user 0.06system 0:02.32elapsed 99%CPU (0avgtext+0avgdata 93272maxresident)k
0inputs+0outputs (0major+12805minor)pagefaults 0swaps
1048576

So we haven't beaten that one yet but we are very close. There is also the C based libpsl. Here are its results:-

~> time psl --load-psl-file public_suffix_list.dat --print-unreg-domain < domains.txt | wc -l
1.32user 0.03system 0:01.36elapsed 100%CPU (0avgtext+0avgdata 5264maxresident)k
0inputs+0outputs (0major+362minor)pagefaults 0swaps
1048576

So we still have a bit of work ahead of us if we plan to beat it as well.

Jacek Wielemborek · Answer 20 · Mon Jul 09 2018 23:52:36 GMT+0800 (China Standard Time)

@rushmorem

Your compiler doesn't have the improved match ergonomics.

It's the default one Ubuntu 18.04 and since it's LTS, I think it might hamper adoption - unless you aim for people who mostly use rustup. If it's a big deal, I would probably not bother.

And once again, thanks for working on beating PyPy! It would be amazing to get there. Possibly ignorant question, but perhaps it's a good idea to try to trim the data structures down so that they'd fit in CPU cache if possible?

Rushmore Mushambi · Answer 21 · Tue Jul 10 2018 00:04:36 GMT+0800 (China Standard Time)

After removing a couple of unnecessary checks, since we are already assuming that the domain itself is valid, our performance is now up to:-

1.81user 1.02system 0:02.84elapsed 99%CPU (0avgtext+0avgdata 37892maxresident)k
0inputs+0outputs (0major+9038minor)pagefaults 0swaps
1048575

Note that the number of domains we are returning are now much closer to those being returned by both Python and C. We are probably performing at least one extra check that the other guys are not doing.

I've been looking at your code

pub struct Node<T: 'static> {
    pub name: T,
    pub children: &'static [Node<T>],
}

Nice! As for why it's slower, both Vec and String allocates. That's obviously eating into your performance. The psl crate doesn't allocate at all, which is why we are now able to support no_std.

Rushmore Mushambi · Answer 22 · Tue Jul 10 2018 00:15:38 GMT+0800 (China Standard Time)

It's the default one Ubuntu 18.04

What's rustc's version so I can target that as the minimum requirement?

EDIT: It looks like it's 1.25.0.

Rushmore Mushambi · Answer 23 · Tue Jul 10 2018 05:49:13 GMT+0800 (China Standard Time)

I have just pushed version 0.1.1 to crates.io. If your compiler version is indeed 1.25.0, it will compile just fine for you. That version is also a bit faster than the PyPy version for me. At this point I would like to actually configure these benchmarks into the psl crate so we can easily run them all and track regressions.

perhaps it's a good idea to try to trim the data structures down so that they'd fit in CPU cache if possible?

Are you referring to the giant match statement? I don't know how much LLVM optimises it but I purposefully made it very simple so the compiler can easily understand what it's trying to do. At some point I would also like to try rust-pf. Depending on how the match statement is being optimised, it might give us faster lookups at the expense of more complex lookup code.

Jacek Wielemborek · Answer 24 · Tue Jul 10 2018 20:27:14 GMT+0800 (China Standard Time)

@rushmorem can't figure it out yet, but I'm still getting lower performance on 0.1.1:

6.20user 1.49system 0:07.69elapsed 99%CPU (0avgtext+0avgdata 2504maxresident)k
71224inputs+0outputs (0major+94minor)pagefaults 0swaps                                                                                                                                                                                                                                                                                                                     
1036595
Removing intermediate container a04ffd727d88
 ---> 2724d05e9676
Step 12/12 : RUN time python main.py public_suffix_list.dat < domains.txt | wc -l
 ---> Running in 742a5e80a3a2
4.65user 0.03system 0:04.69elapsed 99%CPU (0avgtext+0avgdata 21448maxresident)k
408inputs+0outputs (0major+3914minor)pagefaults 0swaps                                                                                                                                                                                                                                                                                                                     
1048576
Removing intermediate container 742a5e80a3a2
 ---> 3b0e965b7c0c
Successfully built 3b0e965b7c0c

The "71224inputs" looks interesting.

Jacek Wielemborek · Answer 25 · Tue Jul 10 2018 20:27:24 GMT+0800 (China Standard Time)

I updated the Gist so you can compare the test code.

Jacek Wielemborek · Answer 26 · Tue Jul 10 2018 22:39:00 GMT+0800 (China Standard Time)

Surprisingly, on the second run it's faster:

Step 11/12 : RUN time ./target/release/publicsuffixtest public_suffix_list.dat < domains.txt | wc -l
 ---> Running in e2e184277a01
3.46user 1.23system 0:04.70elapsed 99%CPU (0avgtext+0avgdata 2504maxresident)k
0inputs+0outputs (0major+94minor)pagefaults 0swaps
1036595
Removing intermediate container e2e184277a01
 ---> 6c7d0c0db412
Step 12/12 : RUN time python main.py public_suffix_list.dat < domains.txt | wc -l
 ---> Running in 33f63437179b
4.78user 0.04system 0:04.83elapsed 99%CPU (0avgtext+0avgdata 21280maxresident)k
0inputs+0outputs (0major+3908minor)pagefaults 0swaps
1048576
Removing intermediate container 33f63437179b
 ---> be261ae10029
Successfully built be261ae10029

Jacek Wielemborek · Answer 27 · Wed Jul 11 2018 04:49:20 GMT+0800 (China Standard Time)

Updated the Gist again - I suggest you clone it using git@gist.github.com:25db3f7e00970b9bd70cf0529fe77831.git if you have trouble keeping up with timestamps/changes. The latest Gist compares against PyPy and it's surprisingly much slower (4.03s vs 1.42s). I added an instruction that's supposed to rule out cache not being warm, but it looks like it's not that. Could you run the Dockerfile? Alternatively, I can run it on a few other machines to see if I can reproduce your timing.

Rushmore Mushambi · Answer 28 · Wed Jul 11 2018 11:58:48 GMT+0800 (China Standard Time)

Updated the Gist again - I suggest you clone it using

Thanks, I have done that. It's building now.

Alternatively, I can run it on a few other machines to see if I can reproduce your timing.

It's OK, I will use the Dockerfile. Trying to reproduce my results might be a bit tricky because our environments are so different. I run NixOS 18.03 and I typically develop using the nightly compiler. I'm currently using rustc 1.28.0-nightly (e3bf634e0 2018-06-28). We are most likely using different LLVM versions too so the optimizations being done might be different.

I have pushed version 0.2 to crates.io. It has a couple of breaking changes to make the API a bit more pleasant and consistent. You can use the following new script:-

extern crate psl;

use std::io::{self, BufRead, Write};
use psl::{Psl, List};

fn main() {
    let stdout = io::stdout();
    let mut handle = stdout.lock();

    let list = List::new();
    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let domain_str = line.unwrap();
        if let Some(domain) = list.suffix(&domain_str) {
            handle.write(domain.as_str().as_bytes()).unwrap();
            handle.write(b"\n").unwrap();
        };
    }
}

Also note that I changed the script to get the public suffix rather than the registrable domain name. That behavior matches what the Python code is doing. As you will see, they will now return the same results.

Rushmore Mushambi · Answer 29 · Wed Jul 11 2018 13:45:19 GMT+0800 (China Standard Time)

Here are the results of the build for me after running it on a Linode server:-

Removing intermediate container 8f286c7ee49b
 ---> 15a0df77eaf0
Step 8/15 : ADD ./Cargo.toml .
 ---> 5d8f6370fa7c
Step 9/15 : ADD ./main.rs .
 ---> 1e842bfb5ac9
Step 10/15 : RUN time cargo build --release --quiet
 ---> Running in 050a59353578
1908.31user 10.31system 31:11.92elapsed 102%CPU (0avgtext+0avgdata 1442296maxresident)k
358528inputs+395792outputs (1441major+2422274minor)pagefaults 0swaps
Removing intermediate container 050a59353578
 ---> 2ed41b06b303
Step 11/15 : ADD ./main.py .
 ---> 4f5448271e9e
Step 12/15 : ADD ./domains.txt .
 ---> 648490ff725a
Step 13/15 : RUN cat domains.txt > /dev/null
 ---> Running in b9afa7033a20
Removing intermediate container b9afa7033a20
 ---> da9c9acb372d
Step 14/15 : RUN time ./target/release/publicsuffixtest public_suffix_list.dat < domains.txt | wc -l
 ---> Running in d8cc8a2a6eaa
2.43user 1.31system 0:03.87elapsed 96%CPU (0avgtext+0avgdata 2372maxresident)k
73720inputs+0outputs (9major+94minor)pagefaults 0swaps
1048576
Removing intermediate container d8cc8a2a6eaa
 ---> 8caa81173ea6
Step 15/15 : RUN time pypy main.py public_suffix_list.dat < domains.txt | wc -l
 ---> Running in 3c91a81de36d
1048576
2.50user 0.16system 0:02.72elapsed 97%CPU (0avgtext+0avgdata 117560maxresident)k
212664inputs+0outputs (252major+18746minor)pagefaults 0swaps
Removing intermediate container 3c91a81de36d
 ---> b17756176543
Successfully built b17756176543
Successfully tagged pslbench:latest

Note that I used a local copy of domains.txt because the file wasn't getting downloaded. I didn't limit to the first 1 million domains, I just used the entire file.

Jacek Wielemborek · Answer 30 · Wed Jul 11 2018 17:32:29 GMT+0800 (China Standard Time)

@rushmorem updated the gist again, this version pulls latest public rDNS snapshot since Rapid7 clearly makes URLs expire. Test was on my laptop:

Step 15/16 : RUN time ./target/release/publicsuffixtest public_suffix_list.dat < domains.txt | wc -l
---> Running in 068609c698c5
2.42user 1.15system 0:03.57elapsed 99%CPU (0avgtext+0avgdata 2568maxresident)k
0inputs+0outputs (0major+94minor)pagefaults 0swaps
1048576
---> 53451f385480
Removing intermediate container 068609c698c5
Step 16/16 : RUN time pypy main.py public_suffix_list.dat < domains.txt | wc -l
---> Running in e39ca5cdc142
1.58user 0.05system 0:01.66elapsed 98%CPU (0avgtext+0avgdata 88852maxresident)k
472inputs+0outputs (2major+11245minor)pagefaults 0swaps

Rushmore Mushambi · Answer 31 · Fri Jul 13 2018 08:59:45 GMT+0800 (China Standard Time)

@d33tah I've just released version 0.2.2 with more optimisations. I've also made some adjustments to your Gist, mainly to enable LTO via Cargo.toml as well as decouple Docker builds from runs. Decoupling makes it easier to build once and run many times. You can find the changes at https://gist.github.com/rushmorem/858094a283e17a28d020ca9cb5af497a. Kindly try that and let me know your results.

Rushmore Mushambi · Answer 32 · Sat Jul 14 2018 22:25:38 GMT+0800 (China Standard Time)

@d33tah Version 0.3.0 is now up on crates.io with even more optimisations. Unfortunately, the optimisation I took advantage of in this release is only available on rustc 1.27.0 and newer so I had to bump the minimum version. I have pushed a sample benchmark to https://github.com/addr-rs/pslbench. For me, both 0.2.2 and 0.3.0 are consistently faster than PyPy. I also included libpsl in the benchmark against 0.3.0. Here is one of the results as per the linked repo:-

~/bench]# docker run --rm -it pslbench
Running PyPy benchmark
2.20user 0.12system 0:02.38elapsed 97%CPU (0avgtext+0avgdata 116076maxresident)k
143392inputs+0outputs (238major+18498minor)pagefaults 0swaps
1048576

Running Rust benchmark
0.40user 0.95system 0:01.36elapsed 99%CPU (0avgtext+0avgdata 2796maxresident)k
2400inputs+0outputs (10major+133minor)pagefaults 0swaps
1048576

Running C benchmark
1.70user 0.06system 0:01.77elapsed 99%CPU (0avgtext+0avgdata 2884maxresident)k
1752inputs+0outputs (8major+266minor)pagefaults 0swaps
1048576

Rushmore Mushambi · Answer 33 · Tue Jul 17 2018 07:54:53 GMT+0800 (China Standard Time)

@d33tah I've added back support for rustc 1.25.0 in the latest release, v0.3.2. rustc 1.27.0 produces faster code than rustc 1.25.0 but I expect them both to beat the performance of the PyPy library. However, since we have been getting very different results so far, I'm very curious to see your numbers.

By default, the debug versions will now only include suffices under the .com TLD. This makes debug builds very fast while giving you a lot of suffices that you can test against.

Jacek Wielemborek · Answer 34 · Fri Jul 20 2018 03:26:20 GMT+0800 (China Standard Time)

Thanks! Apologies for not responding - had some work to sort out first. I'll test it soon and let you know, OK?

Jacek Wielemborek · Answer 35 · Fri Jul 20 2018 05:24:01 GMT+0800 (China Standard Time)

Niice! I changed head to 10M because it was actually too fast:

Running PyPy benchmark
17.31user 0.26system 0:17.60elapsed 99%CPU (0avgtext+0avgdata 92508maxresident)k
416inputs+0outputs (4major+12286minor)pagefaults 0swaps
10485760

Running Rust benchmark
5.08user 10.97system 0:16.11elapsed 99%CPU (0avgtext+0avgdata 2808maxresident)k
0inputs+0outputs (0major+133minor)pagefaults 0swaps
10485759

Running C benchmark
16.69user 0.46system 0:17.17elapsed 99%CPU (0avgtext+0avgdata 2900maxresident)k
120inputs+0outputs (1major+268minor)pagefaults 0swaps
10485759

Can't wait to give it a bigger try!

Jacek Wielemborek · Answer 36 · Fri Jul 20 2018 05:31:43 GMT+0800 (China Standard Time)

Actually, it doesn't look correct - I changed ENTRYPOINT to target/release/pslbench and here's what I got:

·> echo google.com | sudo docker run -i pslbench 
com

Rushmore Mushambi · Answer 37 · Fri Jul 20 2018 05:35:08 GMT+0800 (China Standard Time)

Which script are you using? This one?

Rushmore Mushambi · Answer 38 · Fri Jul 20 2018 05:43:03 GMT+0800 (China Standard Time)

echo google.com | sudo docker run -i pslbench

That is correct. com is the public suffix. The code is doing list.suffix(&domain_str) rather than list.domain(&domain_str) because that's what's closest to what Python and C are doing as evidenced by the similar number of results returned. I found the Python library's terminology a bit confusing in that it's calling get_public_suffix which should return a suffix and not a registrable domain, but it's actually printing a registrable domain yet the number of results it's returning are that of suffices.

EDIT: if you try the same thing for libpsl you will see that it prints google.com: com, com being the public suffix.

Rushmore Mushambi · Answer 39 · Fri Jul 20 2018 06:05:23 GMT+0800 (China Standard Time)

Niice! I changed head to 10M because it was actually too fast

Thanks. Those results are pretty close though. Both the C and PyPy libraries are within 2 seconds from our execution time. As usual, PyPy is using a lot more memory than both C and Rust. I also wonder what could the extra result it returned be? Both C and Rust libraries found 10485759 suffices. For 1M domains, all 3 libraries return the same results.

EDIT: I suspect that the extra result that the Python library is returning is totally invalid since it doesn't even try to check that.

Rushmore Mushambi · Answer 40 · Fri Jul 20 2018 07:11:46 GMT+0800 (China Standard Time)

I have taken a look at the Python library's source code. I can see that it only exports that one method, get_public_suffix. Unfortunately, what it returns is neither a "public suffix" nor a "registrable domain" as defined by the spec. Instead, if the input has enough labels, it returns a registrable domain otherwise it returns a suffix. This behaviour is wrong and misleading.

Rushmore Mushambi · Answer 41 · Fri Jul 20 2018 07:26:51 GMT+0800 (China Standard Time)

In light of the bug I highlighted in my previous comment, I have adjusted the benchmarks on my machine to make both psl and libpsl return a registrable domain instead of a public suffix for 10M domains. Here are the results:-

Running PyPy benchmark
20.79user 0.84system 0:21.78elapsed 99%CPU (0avgtext+0avgdata 142040maxresident)k
141224inputs+0outputs (230major+24985minor)pagefaults 0swaps
10485760

Running Rust benchmark
4.16user 12.51system 0:16.75elapsed 99%CPU (0avgtext+0avgdata 2784maxresident)k
2464inputs+0outputs (10major+134minor)pagefaults 0swaps
10366973

Running C benchmark
16.99user 0.73system 0:17.79elapsed 99%CPU (0avgtext+0avgdata 2672maxresident)k
560inputs+0outputs (3major+263minor)pagefaults 0swaps
10485759

Note that libpsl is still returning 10485759 results. That's because for invalid domains, it still prints the input before printing the result as null. For example invalid: (null).

Rushmore Mushambi · Answer 42 · Fri Jul 20 2018 09:49:30 GMT+0800 (China Standard Time)

After looking at those results again, I noticed that we were issuing way more system calls than PyPy and C, so I did an strace. The problem was that, while we were buffering reads in our benchmark code, main.rs, we were not buffering writes to stdout! I'm rebuilding the code now. Will post the results once it's done.

Rushmore Mushambi · Answer 43 · Fri Jul 20 2018 11:03:52 GMT+0800 (China Standard Time)

I took the opportunity to make the psl crate print invalid domains as (None) to make the benchmark a bit more fair since it wasn't incurring the overhead of printing for non-valid domains. Here are the results:-

Running PyPy benchmark
19.34user 0.73system 0:20.21elapsed 99%CPU (0avgtext+0avgdata 141944maxresident)k
143392inputs+0outputs (238major+24978minor)pagefaults 0swaps
10485760

Running Rust benchmark
2.21user 0.28system 0:02.50elapsed 99%CPU (0avgtext+0avgdata 2852maxresident)k
2472inputs+0outputs (10major+136minor)pagefaults 0swaps
10485760

Running C benchmark
17.91user 1.00system 0:18.98elapsed 99%CPU (0avgtext+0avgdata 2864maxresident)k
1752inputs+0outputs (8major+265minor)pagefaults 0swaps
10485759

Rushmore Mushambi · Answer 44 · Fri Jul 20 2018 11:17:56 GMT+0800 (China Standard Time)

So a lot of the overhead was just printing the domains to stdout. We were actually processing 10,485,760 domains in less than 3 seconds!

Rushmore Mushambi · Answer 45 · Sun Jul 22 2018 07:45:02 GMT+0800 (China Standard Time)

@d33tah I think you will love this 😄

Rushmore Mushambi · Answer 46 · Sun Jul 22 2018 07:48:52 GMT+0800 (China Standard Time)

I've just rebuilt the docker container. It took about 8 and a half seconds.

Step 13/16 : RUN time cargo build --release --quiet
 ---> Running in 00866f167b8e
679.31user 14.31system 8:21.43elapsed 138%CPU (0avgtext+0avgdata 531956maxresident)k
415144inputs+553592outputs (1531major+3350157minor)pagefaults 0swaps

Rushmore Mushambi · Answer 47 · Mon Jul 23 2018 10:14:55 GMT+0800 (China Standard Time)

The latest version now builds in about 1 minute in debug mode.

Jacek Wielemborek · Answer 48 · Mon Jul 23 2018 18:10:52 GMT+0800 (China Standard Time)

@rushmorem nice! Could you also update the gist? Here are my changes:

diff --git a/Cargo.toml b/Cargo.toml
index 6d61810..5c2f337 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -8,7 +8,7 @@ name = "publicsuffixtest"
 path = "main.rs"
 
 [dependencies]
-psl = "0.2.2"
+psl = "*"
 
 [profile.release]
-lto = true
\ No newline at end of file
+lto = true
diff --git a/Dockerfile b/Dockerfile
index 672e552..4669d08 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -20,8 +20,15 @@ ADD ./main.rs .
 ENV RUSTFLAGS "-Ctarget-cpu=native"
 RUN time cargo build --release --quiet
 
+RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
+RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
+    grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
+RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
+    | pigz -dc | head -n 10M | jq -r .value > domains.txt
+
+
 ADD ./main.py .
 
 ADD ./bench.sh .
 
-ENTRYPOINT ./bench.sh
\ No newline at end of file
+ENTRYPOINT ./bench.sh
diff --git a/bench.sh b/bench.sh
old mode 100644
new mode 100755
diff --git a/main.rs b/main.rs
index 92a3a25..012a1ee 100644
--- a/main.rs
+++ b/main.rs
@@ -12,7 +12,7 @@ fn main() {
     for line in stdin.lock().lines() {
         let domain_str = line.unwrap();
         if let Some(suffix) = list.suffix(&domain_str) {
-            handle.write(suffix.as_str().as_bytes()).unwrap();
+            handle.write(suffix.as_bytes()).unwrap();
             handle.write(b"\n").unwrap();
         };
     }

And here's a rather unexpected benchmark:

Running PyPy benchmark
10.90user 0.20system 0:11.11elapsed 99%CPU (0avgtext+0avgdata 88000maxresident)k
0inputs+0outputs (0major+11245minor)pagefaults 0swaps
10485760


Running Rust benchmark
13.39user 10.51system 0:23.91elapsed 99%CPU (0avgtext+0avgdata 2452maxresident)k
0inputs+0outputs (0major+89minor)pagefaults 0swaps
10485759

Rushmore Mushambi · Answer 49 · Mon Jul 23 2018 19:18:57 GMT+0800 (China Standard Time)

In your updates you didn't include write buffering so the benchmark script itself is issuing a system call for each line printed to stdout. I have updated the Gist to match what's in https://github.com/addr-rs/pslbench. It also prints the registrable domain rather than a suffix as per #15 (comment). Computing a suffix is faster but using a registrable domain makes the benchmark fairer to the Python lib because of the bug it has #15 (comment).

Jacek Wielemborek · Answer 50 · Wed Aug 08 2018 19:37:16 GMT+0800 (China Standard Time)

@rushmorem sorry for not responding lately. I tried this modification to your gist:

diff --git a/Dockerfile b/Dockerfile
index 8794e95..f31b3ac 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,12 +1,6 @@
 FROM ubuntu:18.04
 
-RUN apt-get update && apt-get install -y cargo curl jq libssl-dev pkg-config time pigz && apt-get clean
-
-RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
-RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
-    grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
-RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
-    | pigz -dc | head -n 1M | jq -r .value > domains.txt
+RUN apt-get update && apt-get install -y cargo curl jq libssl-dev pkg-config time pigz psl && apt-get clean
 
 RUN apt-get update && apt-get -y install pypy && apt-get clean
 RUN curl -O https://bootstrap.pypa.io/get-pip.py && pypy get-pip.py
@@ -20,10 +14,14 @@ ADD ./main.rs .
 ENV RUSTFLAGS "-Ctarget-cpu=native"
 RUN time cargo build --release --quiet
 
+RUN curl https://publicsuffix.org/list/public_suffix_list.dat -o public_suffix_list.dat
+RUN curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | \
+    grep 'href="/sonar.rdns_v2/' | cut -d'"' -f2 > url.txt
+RUN curl --location https://opendata.rapid7.com/`cat url.txt` \
+    | pigz -dc | head -n 10M | jq -r .value > domains.txt
+
 ADD ./main.py .
 
 ADD ./bench.sh .
 
-RUN apt-get install -y psl
-
 ENTRYPOINT ./bench.sh

And these are the results:

Running PyPy benchmark
11.93user 0.27system 0:12.24elapsed 99%CPU (0avgtext+0avgdata 88680maxresident)k
432inputs+0outputs (3major+11175minor)pagefaults 0swaps
10485760

Running Rust benchmark
10.40user 0.16system 0:10.57elapsed 99%CPU (0avgtext+0avgdata 2512maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps
10485760

Running C benchmark
10.94user 0.25system 0:11.21elapsed 99%CPU (0avgtext+0avgdata 3136maxresident)k
1400inputs+0outputs (4major+276minor)pagefaults 0swaps
10485760

Rushmore Mushambi · Answer 51 · Thu Aug 09 2018 02:54:34 GMT+0800 (China Standard Time)

No problem. We are all busy with other things. The Gist is using Ubuntu 18.04, which has Rust 1.25 in its repos. Rust 1.25 does not support slice patterns, so we use the much slower string matches there. To take advantage of slice patterns you need at least Rust 1.27. Kindly run the benchmark in https://github.com/addr-rs/pslbench and let me know your results. That one uses Rust 1.27.

EDIT: Having said that, I'm glad that we are still beating both the C and PyPy libraries even on Rust 1.25.

Rushmore Mushambi · Answer 52 · Sun Mar 28 2021 02:41:08 GMT+0800 (China Standard Time)

For future reference, publicsuffix v2 is super fast.