Read a CSV File From a Url in Python

Python provides capabilities to read CSV data directly from the web. This post will show how to load CSV in three ways-: using pandas, urllib, and requests packages in python.

Method A: Loading CSV from URL using Pandas

The read_csv() function can read CSV files directly from an online source. In the following example, the function loads CSV data from GitHub and stores it in a DataFrame df.

import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

Output (description):

A DataFrame of 891 rows and 12 columns.

When reading data from GitHub, ensure you read from the raw URL. For example, in the above, if you try loading the data directly from https://github.com/datasciencedojo/datasets/blob/master/titanic.csv, the download process will fail. When you land in such a URL, open raw content using the “Raw” button at the top of the DataFrame then you will have the correct URL like the one we used in the above example.

CSV stands for Comma-Separated Values, but in some cases, the values of a CSV file are not comma-delimited. In some cases, other characters like “;”, tab (“\t”), etc. If you attempt to load a CSV that is not comma-separated, pass the “sep” argument to the read_csv() function. For example,

import pandas as pd
# csv not comma-delimited
df1 = pd.read_csv("https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv") 

Output:

ParserError: Error tokenizing data.
import pandas as pd
#data is colon-separated
df1 = pd.read_csv("https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv", sep=";")

Output (description):

A data table of 78 rows and 16 columns

Method B: Reading CSV from URL using urllib

Python’s urllib module is used to interact with and get URLs from various protocols. To connect to a URL and read its contents, use the urllib.urlopen() function.

Once the response is received, we can utilize the csv.reader() function to parse the received content. The reader allows us to iterate through the CSV row by row.

# Load packages
from urllib.request import urlopen
import csv
import codecs
# the URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
# fetch the source using urlopen
response = urlopen(url)
# parse the fetched data using csv.read
# codecs allow us to decode the byte response into a string
csvfile = csv.reader(codecs.iterdecode(response, "utf-8"))
# Loop through the rows
# enumerate() allows us to index the iterable
for index, row in enumerate(csvfile):
    print(index, row)  # do something with row - note: the first row is  the header

Output (description):

139 rows of data

Note that the first row is the header in most cases.

Method C: Use requests and csv to load CSV from an Online Source

Like urllib, the requests module can fetch CSV data from a URL. It is a straightforward HTTP library with enhanced error handling.

The get() function in this module can retrieve the response from a link the content iterated using the iter_lines() function.

The finished data is then parsed using the csv.reader() method, which allows us to iterate through the rows. Here is an example.

import requests
import csv
import codecs
# url - it is a long url, so we break it using \ for better viewing using
url = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/\
Annual-enterprise-survey-2021-financial-year-provisional/Download-data/\
annual-enterprise-survey-2021-financial-year-provisional-csv.csv"
# fetch page source using requests.get()
res = requests.get(url)
# create an iterator for all lines
lines_iterator = res.iter_lines()
# create a CSV reader object and encode the content using the codecs module
data = csv.reader(codecs.iterdecode(lines_iterator, encoding="utf-8"), delimiter=",")
# loop through the rows on the "data" list
for index, row in enumerate(data):
    print(index, row)
    # iterate through rows - note: the first row is  the header

Output (description):

41715 rows of data, including the header row

Conclusion

This article covered three ways of loading CSV data from an online source. If you are loading data from the GitHub repository, use the URL for the raw content. For other sources, make sure to get the correct link as well. A simple way to test the link is to click it. If the CSV starts downloading, hovering over the URL, right-click, and “Copy Link” should get you the correct URL.