Extract Domain From URL in Python

The Figure shown below shows the main parts of the URL. In this article, we will explore methods that can be used to extract the Domain from the URL in Python.

There are three tools/packages/methods we will use to accomplish this:

  • Method 1. Using the tldextract module,
  • Method 2: Using tld library and,
  • Method 2: urlparse() in urllib.parse.

The first two methods depend on Public Suffix List (PSL), whereas urllib takes the generic approach.

Method 1: Using the tldextract package

The package does not come pre-installed with Python. For that reason, you might have to install it before using it. You can do that by running the following command on the terminal:

pip install tldextract

The package separates a URL’s Subdomain, Domain, and public suffix (TLD), using the Public Suffix List (PSL).

Example

Output:

ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')
cnn
bbc
worldbank

Another example

Output:

Parts:  ExtractResult(subdomain='', domain='asciimath', suffix='org') --> Domain:  asciimath
Parts:  ExtractResult(subdomain='', domain='todoist', suffix='com') --> Domain:  todoist
Parts:  ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') --> Domain:  cnn
Parts:  ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') --> Domain:  bbc
Parts:  ExtractResult(subdomain='www', domain='amazon', suffix='de') --> Domain:  amazon
Parts:  ExtractResult(subdomain='', domain='google', suffix='com') --> Domain:  google
Parts:  ExtractResult(subdomain='www.example', domain='test', suffix='') --> Domain:  test
Parts:  ExtractResult(subdomain='sandbox', domain='evernote', suffix='com') --> Domain:  evernote

Method 2: Using tld module

Extracts the Top-Level Domain (TLD) from the URL given. The list of TLD names is taken from Public Suffix. You can install tld with pip using the command “pip install tld”.

Optionally raises exceptions on non-existing TLDs or silently fails (if fail_silently argument is set to True)

Note: The module requires Python 2.7, 3.5, 3.6, 3.7, 3.8, and 3.9.

Example

Output:

Respond of get_tld() TLD:  co.uk
Subdomain:  forums
Domain:  bbc
Top Level Domain:  co.uk
Full-level Domain:  bbc.co.uk

If you try to get the parts of a URL not in Public Suffix List (PSL), tld will through TldDomainNotFound Exception unless fail_silently is set to True (in this case, None is returned). For example,

Output:

tld.exceptions.TldDomainNotFound: Domain www.example.test didn't match any existing TLD name!

Another example

from tld import get_tld

urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action",
]

for url in urls:
    response = get_tld(url, as_object=True, fail_silently=True)
    if response is not None:
        # this captures URL domains not in the Public Suffix List (PSL)
        # Get the full Domain - subdomain + domain
        print("Full Domain: ", response.fld,"--> Domain: ", response.domain, "--> URL: ", url)
    else:
        print(f"The URL {url} is not in the Public Suffix List (PSL).")

Output:

Full Domain:  asciimath.org --> Domain:  asciimath --> URL:  http://asciimath.org/
Full Domain:  todoist.com --> Domain:  todoist --> URL:  https://todoist.com/app/today
Full Domain:  cnn.com --> Domain:  cnn --> URL:  http://forums.news.cnn.com/
Full Domain:  bbc.co.uk --> Domain:  bbc --> URL:  http://forums.bbc.co.uk/
Full Domain:  amazon.de --> Domain:  amazon --> URL:  https://www.amazon.de/
Full Domain:  google.com --> Domain:  google --> URL:  https://google.com/
The URL http://www.example.test/foo/bar is not in the Public Suffix List.
Full Domain:  evernote.com --> Domain:  evernote --> URL:  https://sandbox.evernote.com/Home.action

Method 3: Using urlparse() from urllib.parse

Output:

Domain:  asciimath.org --> URL:  http://asciimath.org/
Domain:  todoist.com --> URL:  https://todoist.com/app/today
Domain:  forums.news.cnn.com --> URL:  http://forums.news.cnn.com/
Domain:  forums.bbc.co.uk --> URL:  http://forums.bbc.co.uk/
Domain:  www.amazon.de --> URL:  https://www.amazon.de/
Domain:  google.com --> URL:  https://google.com/
Domain:  www.example.test --> URL:  http://www.example.test/foo/bar
Domain:  sandbox.evernote.com --> URL:  https://sandbox.evernote.com/Home.action

Notice that urlparse() returns the URL parts, and the Network location part (netloc) contains Subdomain + Domain + TLD.

Conclusion

We have discussed three packages you can use to extract a domain from a URL in Python. Unlike urlparse(), the first two modules, tldextract, and tld, provides all URL parts so that you can easily control what you want to extract: the second-level domain or the entire Domain name.