The Figure shown below shows the main parts of the URL. In this article, we will explore methods that can be used to extract the Domain from the URL in Python.
There are three tools/packages/methods we will use to accomplish this:
- Method 1. Using the tldextract module,
- Method 2: Using tld library and,
- Method 2: urlparse() in urllib.parse.
The first two methods depend on Public Suffix List (PSL), whereas urllib takes the generic approach.
Method 1: Using the tldextract package
The package does not come pre-installed with Python. For that reason, you might have to install it before using it. You can do that by running the following command on the terminal:
pip install tldextract
The package separates a URL’s Subdomain, Domain, and public suffix (TLD), using the Public Suffix List (PSL).
Example
1 2 3 4 5 6 7 8 9 10 11 |
import tldextract tld1 = tldextract.extract('http://forums.news.cnn.com/') tld2 = tldextract.extract('http://forums.bbc.co.uk/') tld3 = tldextract.extract('http://www.worldbank.org.kg/') # Prints the separated parts print(tld1, tld2, tld3, sep="\n") # Get the domains of the sites print(tld1.domain, tld2.domain, tld3.domain, sep="\n") |
Output:
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg') cnn bbc worldbank
Another example
1 2 3 4 5 6 7 |
import tldextract urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action"] for url in urls: parts = tldextract.extract(url) print("Parts: ", parts, "--> Domain: ", parts.domain) |
Output:
Parts: ExtractResult(subdomain='', domain='asciimath', suffix='org') --> Domain: asciimath Parts: ExtractResult(subdomain='', domain='todoist', suffix='com') --> Domain: todoist Parts: ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') --> Domain: cnn Parts: ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') --> Domain: bbc Parts: ExtractResult(subdomain='www', domain='amazon', suffix='de') --> Domain: amazon Parts: ExtractResult(subdomain='', domain='google', suffix='com') --> Domain: google Parts: ExtractResult(subdomain='www.example', domain='test', suffix='') --> Domain: test Parts: ExtractResult(subdomain='sandbox', domain='evernote', suffix='com') --> Domain: evernote
Method 2: Using tld module
Extracts the Top-Level Domain (TLD) from the URL given. The list of TLD names is taken from Public Suffix. You can install tld with pip using the command “pip install tld”.
Optionally raises exceptions on non-existing TLDs or silently fails (if fail_silently argument is set to True)
Note: The module requires Python 2.7, 3.5, 3.6, 3.7, 3.8, and 3.9.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from tld import get_tld response = get_tld('http://forums.bbc.co.uk/' , as_object=True) # get_tld() returns Top-Level Domain (TLD), # therefore response = "co.uk" print("Respond of get_tld() TLD: ", response) # The Subdomain print("Subdomain: ", response. subdomain) # The Domain print("Domain: ", response. domain) # Top Level Domain - same as the response for get_tld() print("Top Level Domain: ", response.tld) # Full level Domain - that is Subdomain + Domain. print("Full-level Domain: ", response.fld) |
Output:
Respond of get_tld() TLD: co.uk Subdomain: forums Domain: bbc Top Level Domain: co.uk Full-level Domain: bbc.co.uk
If you try to get the parts of a URL not in Public Suffix List (PSL), tld will through TldDomainNotFound Exception unless fail_silently is set to True (in this case, None is returned). For example,
1 2 3 4 |
from tld import get_tld response = get_tld("http://www.example.test/foo/bar" , as_object=True) print(response) |
Output:
tld.exceptions.TldDomainNotFound: Domain www.example.test didn't match any existing TLD name!
1 2 3 4 |
from tld import get_tld response = get_tld("http://www.example.test/foo/bar" , as_object=True, fail_silently=True) print(response) #None |
Another example
from tld import get_tld urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action", ] for url in urls: response = get_tld(url, as_object=True, fail_silently=True) if response is not None: # this captures URL domains not in the Public Suffix List (PSL) # Get the full Domain - subdomain + domain print("Full Domain: ", response.fld,"--> Domain: ", response.domain, "--> URL: ", url) else: print(f"The URL {url} is not in the Public Suffix List (PSL).")
Output:
Full Domain: asciimath.org --> Domain: asciimath --> URL: http://asciimath.org/ Full Domain: todoist.com --> Domain: todoist --> URL: https://todoist.com/app/today Full Domain: cnn.com --> Domain: cnn --> URL: http://forums.news.cnn.com/ Full Domain: bbc.co.uk --> Domain: bbc --> URL: http://forums.bbc.co.uk/ Full Domain: amazon.de --> Domain: amazon --> URL: https://www.amazon.de/ Full Domain: google.com --> Domain: google --> URL: https://google.com/ The URL http://www.example.test/foo/bar is not in the Public Suffix List. Full Domain: evernote.com --> Domain: evernote --> URL: https://sandbox.evernote.com/Home.action
Method 3: Using urlparse() from urllib.parse
1 2 3 4 5 6 7 |
from urllib.parse import urlparse urls = ["http://asciimath.org/", "https://todoist.com/app/today", "http://forums.news.cnn.com/", "http://forums.bbc.co.uk/", "https://www.amazon.de/", "https://google.com/", "http://www.example.test/foo/bar", "https://sandbox.evernote.com/Home.action"] for url in urls: domain = urlparse(url).netloc print("Domain: ", domain, "--> URL: ", url) |
Output:
Domain: asciimath.org --> URL: http://asciimath.org/ Domain: todoist.com --> URL: https://todoist.com/app/today Domain: forums.news.cnn.com --> URL: http://forums.news.cnn.com/ Domain: forums.bbc.co.uk --> URL: http://forums.bbc.co.uk/ Domain: www.amazon.de --> URL: https://www.amazon.de/ Domain: google.com --> URL: https://google.com/ Domain: www.example.test --> URL: http://www.example.test/foo/bar Domain: sandbox.evernote.com --> URL: https://sandbox.evernote.com/Home.action
Notice that urlparse() returns the URL parts, and the Network location part (netloc) contains Subdomain + Domain + TLD.
Conclusion
We have discussed three packages you can use to extract a domain from a URL in Python. Unlike urlparse(), the first two modules, tldextract, and tld, provides all URL parts so that you can easily control what you want to extract: the second-level domain or the entire Domain name.