Summary
tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.
Try the example code
All of the following examples can be found at examples/example.rs
. To play the demo, run the following command:
# `git clone` and `cd` to the tldextract-rs repository folder first
cargo run --example=example
Hostname
- Cargo.toml:
tldextract = { git = "https://github.com/emo-cat/tldextract-rs" }
- example code
use tldextract::TLDExtract;
fn main() {
let source = tldextract::Source::Hardcode;
let suffix = tldextract::SuffixList::new(source, false, None);
let mut extract = TLDExtract::new(suffix, true).unwrap();
let e = extract.extract(" mirrors.tuna.tsinghua.edu.cn");
println!("{e:#?}");
}
- ExtractResult
Ok(
ExtractResult {
subdomain: Some(
"mirrors.tuna",
),
domain: Some(
"tsinghua",
),
suffix: Some(
"edu.cn",
),
registered_domain: Some(
"tsinghua.edu.cn",
),
},
)
Implementation details
Why not split on "." and take the last element instead?
Splitting on "." and taking the last element only works for simple eTLDs like com
, but not more complex ones like oseto.nagasaki.jp
.
eTLD tries
tldextract-rs stores eTLDs in compressed tries.
Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.
Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac
and the example URL host `example.nsw.edu.au`
The compressed trie will be structured as follows:
START
β β au π© β
β ββ edu β
β ββ nsw π© β
ββ ac
β β com π©
β β edu π©
ββ gov π©
=== Symbol meanings ===
π© : path to this node is a valid eTLD
β
: path to this node found in example URL host `example.nsw.edu.au`
The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw
. Reversing the nodes gives the extracted eTLD nsw.edu.au
.