The pandas and csv libraries are the most common modules for handling CSV data. This article will discuss how to use these libraries to update a CSV file.
We will see how to edit a CSV file by inserting a row, updating a row (or rows) based on a column value(s), or deleting a row (or rows) based on the values of a given column. We will also see how to insert a column into a CSV file.
The examples shown in this article will use a CSV file named “employees.csv” with the following contents:
ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 2006,Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,HR Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC
The file has six columns and five rows and is saved in the same directory as the script containing our code.
Note: Each code snippet shown will start with “employees.csv” containing the content above.
Inserting a Row into a CSV File
In csv library, we can easily write a row into a CSV file using csv.writer() object. In this case, we need to know the columns on the original data and their order.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Import csv library import csv # Open "employees.csv" file on append mode ("a") # It means the line that will be added will go to the last row. with open("employees.csv", "a") as infile: # Create a writer object for csv writer = csv.writer(infile) # Data we want to write to the CSV file line = [2008, "Procurement officer",6100,"2/8/2011","CA"] # Write the row the CSV file. # Note that this line updates the file in-place writer.writerow(line) |
Output (contents of employees.csv after update):
ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 2006,Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,HR Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC 2008,Procurement officer,6100,2/8/2011,CA
The above code snippet opens the CSV file in appending mode (“a”). That means the row we are writing is added as the last row.
Alternatively, you can use pandas to add a row to a CSV file. There are two ways to do that.
- Opening CSV file in append mode, then update it, or,
- Using pandas.DataFrame.loc[ ] function.
The first method works like open() and csv.writer() in the previous example.
We need to open the CSV file in appending mode and update it with a new DataFrame containing the row(s) we wish to add. Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import pandas as pd # Lines we want to add to the CSV file line = [[1901, "Procurement officer 2",6100,"12/8/2010","CA"], [1902, "HR",5700,"12/8/2010","DA"]] # Create a new DataFrmae with the new data we want to add to the CSV file # Note that we need to pass the lists matching the CSV file's columns. # Alternatively, we can load the CSV file as df and use the columns as columns = df.columns, that is, # df = pd.read_csv("employees.csv") # print(df.columns) df_new_data = pd.DataFrame(line, columns = ["ID","Position" ,"Salary" ,"Joining Date", "Grade"]) print(df_new_data) # Open employees.csv in append mode ("a"), and append the DataFrame with the new data # to the CSV file. This will replace the CSV file with the updated version. df_new_data.to_csv('employees.csv', mode='a', header=False, index=False) |
Output:
ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 2006,Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,HR Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC 1901,Procurement officer 2,6100,12/8/2010,CA 1902,HR,5700,12/8/2010,DA
In the second method, it is essential to get the concept of indexing correctly—the function pandas.DataFrame.loc uses the index to identify the position of the new row. If the index label does not exist, it is created; otherwise, the existing row values are replaced.
1 2 3 4 5 6 7 8 9 10 |
df = pd.read_csv("employees.csv", index_col="ID") # Adding row at a given index (ID) df.loc[1963] = ["Operations Manager",17000,"10/3/2007", "AS"] # Drop "ID" index df = df.reset_index() # Write the updated DataFrame into employees.csv. df.to_csv('employees.csv', index=False) |
Output:
ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 2006,Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,HR Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC 1963,Operations Manager,17000,10/3/2007,AS
Editing a Row Based on Conditions
This section discusses how to update a given row based on column values. Suppose we want to update the row where ID=1947 using csv module. We need to read the file row by row, update the row that matches the condition (ID=1947), and write the changes to the same file.
Reading and writing back the update to the same opened file is not recommended. Instead, it is a best practice to write the update into a temp file and use the contents of this file to update the original file after the update is complete. Here is a good example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
from tempfile import NamedTemporaryFile import shutil import csv # Create a temporary file in write mode ("w")to hold new values tempfile = NamedTemporaryFile(mode="w", delete=False) # Fields are the columns for the CSV file we wish to update fields = ["ID", "Position", "Salary", "Joining Date", "Grade"] # Open the CSV file in read mode ("r"). # We will write into the temporary file first. with open("employees.csv", "r") as csvfile, tempfile: # Create reader and writer objects using csv library. reader = csv.DictReader(csvfile, fieldnames=fields) writer = csv.DictWriter(tempfile, fieldnames=fields) # Loop trow the rows of the CSV file for row in reader: # Based on the value of the "ID" column, create edits on the temporary file if row["ID"] == str(1947): print("Updating row with ID=", row["ID"]) # New data new_data = ["Managing Director", 22000, "9/24/2009"] # Update the dictionary data with the new data, new_data. row["Position"], row["Salary"], row["Joining Date"] = new_data # create a complete row of data to write into the temporary file row = { "ID": row["ID"], "Position": row["Position"], "Salary": row["Salary"], "Joining Date": row["Joining Date"], } # write the row with the new data writer.writerow(row) # Move the temporary file to the original CSV file. This replaces employees.csv. shutil.move(tempfile.name, "employees.csv") |
Output (stdout and the updated CSV file):
Updating row with ID= 1947 ID,Position,Salary,Joining Date, 2001,Junior web developer,5600,8/27/2020, 2006,Accountant,7200,7/22/2019, 3014,Senior Data Scientist,7800,5/1/2013, 1946,HR Manager,6200,9/9/2021, 1947,Manging Director,22000,9/24/2009,
And if you want to update a cell using pandas, then pandas.DataFrame.at should suffice. Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import pandas as pd # Reading the CSV file and set the index to the "ID" column df = pd.read_csv("employees.csv", index_col="ID") # updating a cell based on the index (ID) and column. df.at[2006, 'Position'] = 'Senior Accountant' # Reset inde to 0,1,2,... df = df.reset_index() # writing the changes into the file. df.to_csv("employees.csv", index=False) |
Output:
ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 2006,Senior Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,HR Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC
Deleting a Row in a CSV File Using pandas
We can delete a row in pandas by simply filtering a DataFrame based on some conditions. We need to load the CSV file into a pandas DataFrame, apply the filters, and then save the resulting DataFrame back to a CSV file. In the following example, we want to keep the rows with ID=1946 or 2001 only and discard the rest.
1 2 3 4 5 6 7 8 9 10 11 |
import pandas as pd df = pd.read_csv('employees.csv') # pick the rows with ID=1946 or 2001 df = df[df.ID.isin([1946, 2001])] # if you want to do the opposite, then use the negation operator (~), that is, # df[~df.ID.isin([1946, 2001])] # Write the updated DataFrame into employees.csv'. df.to_csv('employees.csv', index=False) |
Output: ID,Position,Salary,Joining Date,Grade 2001,Junior web developer,5600,8/27/2020,AB 1946,HR Manager,6200,9/9/2021,DB
Replace Cell Values in a CSV
The pandas.DataFrame.replace() function is used to replace cell values. The example below shows how to replace two values
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd # Reading the CSV file df = pd.read_csv("employees.csv") # Replace every occurrence of " HR Manager " with Human Resource Manager and "Junior web developer" with Web Developer # Note that this works at the cell level. For example, a cell with "HR Manager 3" will not # be replaced. Cells with " HR Manager " only are replaced. df = df.replace({"HR Manager":"Human Resource Manager", "Junior web developer": "Web Developer"}) # You can replace values on a column with this syntax: # df[column] = df[column].replace(...) # Write the results into the CSV file. df.to_csv("employees.csv", index=False) |
Output:
ID,Position,Salary,Joining Date,Grade 2001,Web Developer,5600,8/27/2020,AB 2006,Accountant,7200,7/22/2019,BA 3014,Senior Data Scientist,7800,5/1/2013,AA 1946,Human Resource Manager,6200,9/9/2021,DB 1947,HR Manager 3,6250,1/3/2016,DC
As mentioned above, the replace() function replaces cell values and not substrings of the cell values. That is why the cell value “HR Manager” is replaced, but the substring “HR Manager 3” in the last row is not.
Insert Column Based on Condition(s)
In this case, we want to create a new column with values derived based on another column’s values. In particular, we want to create a column called “Above7000” which has the value “Yes” if Salary>7000 and “No” otherwise. We use NumPy to leverage the vectorization of the computation property.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import pandas as pd import numpy as np df = pd.read_csv("employees.csv") # State the conditions conditions = [df["Salary"]>7000, df["Salary"]<=7000] # Outputs to use for the conditions -one-to-one outputs = ["Yes", "No"] # Use np.select to create a new column based on the conditions stated. # If no condition is satisfied, the default value of "DDD" is used df['Above_7000'] = pd.Series(np.select(conditions, outputs, default='DDD')) df.to_csv("employees.csv", index=False) |
Output:
ID,Position,Salary,Joining Date,Grade,Above_7000 2001,Junior web developer,5600,8/27/2020,AB,No 2006,Accountant,7200,7/22/2019,BA,Yes 3014,Senior Data Scientist,7800,5/1/2013,AA,Yes 1946,HR Manager,6200,9/9/2021,DB,No 1947,HR Manager 3,6250,1/3/2016,DC,No
The numpy.select(conditions, choicelist, default=default_value) returns an array drawn from elements in choicelist, depending on conditions. If no condition is met, the function returns the default_value given, otherwise 0.
Conclusion
This article discussed different ways of updating a CSV file using csv and pandas modules.
In most cases, we had to load the original CSV file, make the updates, and write the updated data into the file. When dealing with a CSV with a few rows, most methods discussed here are fast enough, but as the CSV file grows, some methods become slow.
In such a case, storing data in a database and updating it from there is recommended.