Extract Domain From URL in Python

The Figure shown below shows the main parts of the URL. In this article, we will explore methods that can be used to extract the Domain from the URL in Python.

There are three tools/packages/methods we will use to accomplish this:

Method 1. Using the tldextract module,
Method 2: Using tld library and,
Method 2: urlparse() in urllib.parse.

The first two methods depend on Public Suffix List (PSL), whereas urllib takes the generic approach.

Method 1: Using the tldextract package

The package does not come pre-installed with Python. For that reason, you might have to install it before using it. You can do that by running the following command on the terminal:

pip install tldextract

The package separates a URL’s Subdomain, Domain, and public suffix (TLD), using the Public Suffix List (PSL).

Example

import tldextract

tld1 = tldextract.extract('http://forums.news.cnn.com/')

tld2 = tldextract.extract('http://forums.bbc.co.uk/')

tld3 = tldextract.extract('http://www.worldbank.org.kg/')

# Prints the separated parts

print(tld1, tld2, tld3, sep="\n")

# Get the domains of the sites

print(tld1.domain, tld2.domain, tld3.domain, sep="\n")

Output:

ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')
cnn
bbc
worldbank

Another example

import tldextract

urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action"]

for url in urls:

parts = tldextract.extract(url)

print("Parts: ", parts, "--> Domain: ", parts.domain)

Output:

Parts:  ExtractResult(subdomain='', domain='asciimath', suffix='org') --> Domain:  asciimath
Parts:  ExtractResult(subdomain='', domain='todoist', suffix='com') --> Domain:  todoist
Parts:  ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') --> Domain:  cnn
Parts:  ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') --> Domain:  bbc
Parts:  ExtractResult(subdomain='www', domain='amazon', suffix='de') --> Domain:  amazon
Parts:  ExtractResult(subdomain='', domain='google', suffix='com') --> Domain:  google
Parts:  ExtractResult(subdomain='www.example', domain='test', suffix='') --> Domain:  test
Parts:  ExtractResult(subdomain='sandbox', domain='evernote', suffix='com') --> Domain:  evernote

Method 2: Using tld module

Extracts the Top-Level Domain (TLD) from the URL given. The list of TLD names is taken from Public Suffix. You can install tld with pip using the command “pip install tld”.

Optionally raises exceptions on non-existing TLDs or silently fails (if fail_silently argument is set to True)

Note: The module requires Python 2.7, 3.5, 3.6, 3.7, 3.8, and 3.9.

Example

from tld import get_tld

response = get_tld('http://forums.bbc.co.uk/' , as_object=True)

# get_tld() returns Top-Level Domain (TLD),

# therefore response = "co.uk"

print("Respond of get_tld() TLD: ", response)

# The Subdomain

print("Subdomain: ", response. subdomain)

# The Domain

print("Domain: ", response. domain)

# Top Level Domain - same as the response for get_tld()

print("Top Level Domain: ", response.tld)

# Full level Domain - that is Subdomain + Domain.

print("Full-level Domain: ", response.fld)

Output:

Respond of get_tld() TLD:  co.uk
Subdomain:  forums
Domain:  bbc
Top Level Domain:  co.uk
Full-level Domain:  bbc.co.uk

If you try to get the parts of a URL not in Public Suffix List (PSL), tld will through TldDomainNotFound Exception unless fail_silently is set to True (in this case, None is returned). For example,

from tld import get_tld

response = get_tld("http://www.example.test/foo/bar" , as_object=True)

print(response)

Output:

tld.exceptions.TldDomainNotFound: Domain www.example.test didn't match any existing TLD name!

from tld import get_tld

response = get_tld("http://www.example.test/foo/bar" , as_object=True, fail_silently=True)

print(response) #None

Another example

from tld import get_tld

urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action",
]

for url in urls:
    response = get_tld(url, as_object=True, fail_silently=True)
    if response is not None:
        # this captures URL domains not in the Public Suffix List (PSL)
        # Get the full Domain - subdomain + domain
        print("Full Domain: ", response.fld,"--> Domain: ", response.domain, "--> URL: ", url)
    else:
        print(f"The URL {url} is not in the Public Suffix List (PSL).")

Output:

Full Domain:  asciimath.org --> Domain:  asciimath --> URL:  http://asciimath.org/
Full Domain:  todoist.com --> Domain:  todoist --> URL:  https://todoist.com/app/today
Full Domain:  cnn.com --> Domain:  cnn --> URL:  http://forums.news.cnn.com/
Full Domain:  bbc.co.uk --> Domain:  bbc --> URL:  http://forums.bbc.co.uk/
Full Domain:  amazon.de --> Domain:  amazon --> URL:  https://www.amazon.de/
Full Domain:  google.com --> Domain:  google --> URL:  https://google.com/
The URL http://www.example.test/foo/bar is not in the Public Suffix List.
Full Domain:  evernote.com --> Domain:  evernote --> URL:  https://sandbox.evernote.com/Home.action

Method 3: Using urlparse() from urllib.parse

from urllib.parse import urlparse

for url in urls:

domain = urlparse(url).netloc

print("Domain: ", domain, "--> URL: ", url)

Output:

Domain:  asciimath.org --> URL:  http://asciimath.org/
Domain:  todoist.com --> URL:  https://todoist.com/app/today
Domain:  forums.news.cnn.com --> URL:  http://forums.news.cnn.com/
Domain:  forums.bbc.co.uk --> URL:  http://forums.bbc.co.uk/
Domain:  www.amazon.de --> URL:  https://www.amazon.de/
Domain:  google.com --> URL:  https://google.com/
Domain:  www.example.test --> URL:  http://www.example.test/foo/bar
Domain:  sandbox.evernote.com --> URL:  https://sandbox.evernote.com/Home.action

Notice that urlparse() returns the URL parts, and the Network location part (netloc) contains Subdomain + Domain + TLD.

Conclusion

We have discussed three packages you can use to extract a domain from a URL in Python. Unlike urlparse(), the first two modules, tldextract, and tld, provides all URL parts so that you can easily control what you want to extract: the second-level domain or the entire Domain name.

Codeigo

Just programming

Extract Domain From URL in Python

Method 1: Using the tldextract package

Example

Another example

Method 2: Using tld module

Example

Another example

Method 3: Using urlparse() from urllib.parse

Conclusion