Python provides capabilities to read CSV data directly from the web. This post will show how to load CSV in three ways-: using pandas, urllib, and requests packages in python.
Method A: Loading CSV from URL using Pandas
The read_csv() function can read CSV files directly from an online source. In the following example, the function loads CSV data from GitHub and stores it in a DataFrame df.
1 2 3 4 |
import pandas as pd url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url) |
Output (description):
A DataFrame of 891 rows and 12 columns.
When reading data from GitHub, ensure you read from the raw URL. For example, in the above, if you try loading the data directly from https://github.com/datasciencedojo/datasets/blob/master/titanic.csv, the download process will fail. When you land in such a URL, open raw content using the “Raw” button at the top of the DataFrame then you will have the correct URL like the one we used in the above example.
CSV stands for Comma-Separated Values, but in some cases, the values of a CSV file are not comma-delimited. In some cases, other characters like “;”, tab (“\t”), etc. If you attempt to load a CSV that is not comma-separated, pass the “sep” argument to the read_csv() function. For example,
1 2 3 4 |
import pandas as pd # csv not comma-delimited df1 = pd.read_csv("https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv") |
Output:
ParserError: Error tokenizing data.
1 2 3 4 |
import pandas as pd #data is colon-separated df1 = pd.read_csv("https://perso.telecom-paristech.fr/eagan/class/igr204/data/cereal.csv", sep=";") |
Output (description):
A data table of 78 rows and 16 columns
Method B: Reading CSV from URL using urllib
Python’s urllib module is used to interact with and get URLs from various protocols. To connect to a URL and read its contents, use the urllib.urlopen() function.
Once the response is received, we can utilize the csv.reader() function to parse the received content. The reader allows us to iterate through the CSV row by row.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Load packages from urllib.request import urlopen import csv import codecs # the URL url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" # fetch the source using urlopen response = urlopen(url) # parse the fetched data using csv.read # codecs allow us to decode the byte response into a string csvfile = csv.reader(codecs.iterdecode(response, "utf-8")) # Loop through the rows # enumerate() allows us to index the iterable for index, row in enumerate(csvfile): print(index, row) # do something with row - note: the first row is the header |
Output (description):
139 rows of data
Note that the first row is the header in most cases.
Method C: Use requests and csv to load CSV from an Online Source
Like urllib, the requests module can fetch CSV data from a URL. It is a straightforward HTTP library with enhanced error handling.
The get() function in this module can retrieve the response from a link to the content iterated using the iter_lines() function.
The finished data is then parsed using the csv.reader() method, which allows us to iterate through the rows. Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import requests import csv import codecs # url - it is a long url, so we break it using \ for better viewing using url = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/\ Annual-enterprise-survey-2021-financial-year-provisional/Download-data/\ annual-enterprise-survey-2021-financial-year-provisional-csv.csv" # fetch page source using requests.get() res = requests.get(url) # create an iterator for all lines lines_iterator = res.iter_lines() # create a CSV reader object and encode the content using the codecs module data = csv.reader(codecs.iterdecode(lines_iterator, encoding="utf-8"), delimiter=",") # loop through the rows on the "data" list for index, row in enumerate(data): print(index, row) # iterate through rows - note: the first row is the header |
Output (description):
41715 rows of data, including the header row
Conclusion
This article covered three ways of loading CSV data from an online source. If you are loading data from the GitHub repository, use the URL for the raw content. For other sources, make sure to get the correct link as well. A simple way to test the link is to click it. If the CSV starts downloading, hovering over the URL, right-click, and “Copy Link” should get you the correct URL.