Extract Domain From URL in Python

The Figure shown below shows the main parts of the URL. In this article, we will explore methods that can be used to extract the Domain from the URL in Python.

There are three tools/packages/methods we will use to accomplish this:

  • Method 1. Using the tldextract module,
  • Method 2: Using tld library and,
  • Method 2: urlparse() in urllib.parse.

The first two methods depend on Public Suffix List (PSL), whereas urllib takes the generic approach.

Method 1: Using the tldextract package

The package does not come pre-installed with Python. For that reason, you might have to install it before using it. You can do that by running the following command on the terminal:

pip install tldextract

The package separates a URL’s Subdomain, Domain, and public suffix (TLD), using the Public Suffix List (PSL).

Example

import tldextract
tld1 = tldextract.extract('http://forums.news.cnn.com/')
tld2 = tldextract.extract('http://forums.bbc.co.uk/')
tld3 = tldextract.extract('http://www.worldbank.org.kg/')
# Prints the separated parts
print(tld1, tld2, tld3, sep="\n")
# Get the domains of the sites
print(tld1.domain, tld2.domain, tld3.domain, sep="\n")

Output:

ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')
cnn
bbc
worldbank

Another example

import tldextract
urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action"]
for url in urls:
	parts = tldextract.extract(url)
	print("Parts: ", parts, "--> Domain: ", parts.domain)

Output:

Parts:  ExtractResult(subdomain='', domain='asciimath', suffix='org') --> Domain:  asciimath
Parts:  ExtractResult(subdomain='', domain='todoist', suffix='com') --> Domain:  todoist
Parts:  ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') --> Domain:  cnn
Parts:  ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') --> Domain:  bbc
Parts:  ExtractResult(subdomain='www', domain='amazon', suffix='de') --> Domain:  amazon
Parts:  ExtractResult(subdomain='', domain='google', suffix='com') --> Domain:  google
Parts:  ExtractResult(subdomain='www.example', domain='test', suffix='') --> Domain:  test
Parts:  ExtractResult(subdomain='sandbox', domain='evernote', suffix='com') --> Domain:  evernote

Method 2: Using tld module

Extracts the Top-Level Domain (TLD) from the URL given. The list of TLD names is taken from Public Suffix. You can install tld with pip using the command “pip install tld”.

Optionally raises exceptions on non-existing TLDs or silently fails (if fail_silently argument is set to True)

Note: The module requires Python 2.7, 3.5, 3.6, 3.7, 3.8 and 3.9.

Example

from tld import get_tld
response = get_tld('http://forums.bbc.co.uk/' , as_object=True)
# get_tld() returns Top-Level Domain (TLD),
# therefore response = "co.uk"
print("Respond of get_tld() TLD: ", response)
# The Subdomain
print("Subdomain: ", response. subdomain)
# The Domain
print("Domain: ", response. domain)
# Top Level Domain - same as the response for get_tld()
print("Top Level Domain: ", response.tld)
# Full level Domain - that is Subdomain + Domain.
print("Full-level Domain: ", response.fld)

Output:

Respond of get_tld() TLD:  co.uk
Subdomain:  forums
Domain:  bbc
Top Level Domain:  co.uk
Full-level Domain:  bbc.co.uk

If you try to get the parts of a URL not in Public Suffix List (PSL), tld will through TldDomainNotFound Exception unless fail_silently is set to True (in this case, None is returned). For example,

from tld import get_tld
response = get_tld("http://www.example.test/foo/bar" , as_object=True)
print(response)

Output:

tld.exceptions.TldDomainNotFound: Domain www.example.test didn't match any existing TLD name!
from tld import get_tld
response = get_tld("http://www.example.test/foo/bar" , as_object=True, fail_silently=True)
print(response) #None

Another example

from tld import get_tld
urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action",
]
for url in urls:
    response = get_tld(url, as_object=True, fail_silently=True)
    if response is not None:
        # this captures URL domains not in the Public Suffix List (PSL)
        # Get the full Domain - subdomain + domain
        print("Full Domain: ", response.fld,"--> Domain: ", response.domain, "--> URL: ", url)
    else:
        print(f"The URL {url} is not in the Public Suffix List (PSL).")

Output:

Full Domain:  asciimath.org --> Domain:  asciimath --> URL:  http://asciimath.org/
Full Domain:  todoist.com --> Domain:  todoist --> URL:  https://todoist.com/app/today
Full Domain:  cnn.com --> Domain:  cnn --> URL:  http://forums.news.cnn.com/
Full Domain:  bbc.co.uk --> Domain:  bbc --> URL:  http://forums.bbc.co.uk/
Full Domain:  amazon.de --> Domain:  amazon --> URL:  https://www.amazon.de/
Full Domain:  google.com --> Domain:  google --> URL:  https://google.com/
The URL http://www.example.test/foo/bar is not in the Public Suffix List.
Full Domain:  evernote.com --> Domain:  evernote --> URL:  https://sandbox.evernote.com/Home.action

Method 3: Using urlparse() from urllib.parse

from urllib.parse import urlparse
urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action"]
for url in urls:
	domain = urlparse(url).netloc
	print("Domain: ", domain, "--> URL: ",  url)

Output:

Domain:  asciimath.org --> URL:  http://asciimath.org/
Domain:  todoist.com --> URL:  https://todoist.com/app/today
Domain:  forums.news.cnn.com --> URL:  http://forums.news.cnn.com/
Domain:  forums.bbc.co.uk --> URL:  http://forums.bbc.co.uk/
Domain:  www.amazon.de --> URL:  https://www.amazon.de/
Domain:  google.com --> URL:  https://google.com/
Domain:  www.example.test --> URL:  http://www.example.test/foo/bar
Domain:  sandbox.evernote.com --> URL:  https://sandbox.evernote.com/Home.action

Notice that urlparse() returns the URL parts, and the Network location part (netloc) contains Subdomain + Domain + TLD.

Conclusion

We have discussed three packages you can use to extract a domain from a URL in Python. Unlike urlparse(), the first two modules, tldextract and tld, provides all URL parts so that you can easily control what you want to extract: the second-level domain or the entire Domain name.