emo-crab / tldextract-rs

tldextract-rs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Summary

tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.

Try the example code

All of the following examples can be found at examples/example.rs. To play the demo, run the following command:

# `git clone` and `cd` to the tldextract-rs repository folder first
cargo run --example=example

Hostname

  • Cargo.toml:
tldextract = { git = "https://github.com/emo-cat/tldextract-rs" }
  • example code
use tldextract::TLDExtract;

fn main() {
  let source = tldextract::Source::Hardcode;
  let suffix = tldextract::SuffixList::new(source, false, None);
  let mut extract = TLDExtract::new(suffix, true).unwrap();
  let e = extract.extract("  mirrors.tuna.tsinghua.edu.cn");
  println!("{e:#?}");
}
  • ExtractResult
Ok(
    ExtractResult {
        subdomain: Some(
            "mirrors.tuna",
        ),
        domain: Some(
            "tsinghua",
        ),
        suffix: Some(
            "edu.cn",
        ),
        registered_domain: Some(
            "tsinghua.edu.cn",
        ),
    },
)

Implementation details

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple eTLDs like com, but not more complex ones like oseto.nagasaki.jp.

eTLD tries

tldextract-rs stores eTLDs in compressed tries.

Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 βœ…
 β•‘  β•šβ• edu βœ…
 β•‘     β•šβ• nsw 🚩 βœ…
 β•šβ• ac
    ╠═ com 🚩
    ╠═ edu 🚩
    β•šβ• gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
βœ… : path to this node found in example URL host `example.nsw.edu.au`

The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw. Reversing the nodes gives the extracted eTLD nsw.edu.au.

Acknowledgements

About

tldextract-rs

License:GNU General Public License v3.0


Languages

Language:Rust 99.8%Language:Shell 0.2%