Reading and Writing CSV Files in Python
CSV File Introduction
CSV (Comma Separated Values) is a simple, universal file format that is widely used for importing and exporting data in applications (databases, spreadsheets, etc.) and for data exchange between heterogeneous systems. Because CSV is a plain text file, regardless of the operating system and programming language, plain text can be processed, and many programming languages provide support for reading and writing CSV files. Therefore, the CSV format is widely used in data processing and data science.
CSV files have the following characteristics:
- Plain text, using a certain character set (such as ASCII, Unicode, GB2312, etc.);
- Composed of records (typically one record per line);
- Each record is separated into fields (columns) by delimiters (such as commas, semicolons, tabs, etc.);
- Each record has the same field sequence.
CSV files can be opened and edited using text editors or tools like Excel spreadsheets. When opening CSV files with Excel spreadsheets, you may not even feel the difference between CSV and Excel files. Many database systems support exporting data to CSV files and, of course, also support reading data from CSV files and saving it to databases, but these are not the focus of our current discussion.
Writing Data to CSV Files
Suppose we have five students’ exam scores for three courses that need to be saved to a CSV file. To achieve this goal, we can use the csv module in Python’s standard library. The writer function of this module returns a csvwriter object. Through this object’s writerow or writerows methods, we can write data to a CSV file. The specific code is shown below.
import csv
import random
with open('scores.csv', 'w') as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Chinese', 'Math', 'English'])
names = ['Guan Yu', 'Zhang Fei', 'Zhao Yun', 'Ma Chao', 'Huang Zhong']
for name in names:
scores = [random.randrange(50, 101) for _ in range(3)]
scores.insert(0, name)
writer.writerow(scores)
Content of the generated CSV file.
Name,Chinese,Math,English
Guan Yu,98,86,61
Zhang Fei,86,58,80
Zhao Yun,95,73,70
Ma Chao,83,97,55
Huang Zhong,61,54,87
It should be noted that in addition to passing in the file object to write data to, the writer function above can also take a dialect parameter, which represents the CSV file’s dialect, with a default value of excel. In addition, you can specify the delimiter (default is comma), the character that surrounds values (default is double quotes), and the surrounding method through the delimiter, quotechar, and quoting parameters. Among them, the character that surrounds values is mainly used when there are special symbols in fields. By adding surrounding characters, ambiguity can be avoided. You can try modifying line 5 of the above code to the following code and then check the generated CSV file.
writer = csv.writer(file, delimiter='|', quoting=csv.QUOTE_ALL)
Content of the generated CSV file.
"Name"|"Chinese"|"Math"|"English"
"Guan Yu"|"88"|"64"|"65"
"Zhang Fei"|"76"|"93"|"79"
"Zhao Yun"|"78"|"55"|"76"
"Ma Chao"|"72"|"77"|"68"
"Huang Zhong"|"70"|"72"|"51"
Reading Data from CSV Files
If you want to read the CSV file just created, you can use the following code. The reader function of the csv module can create a csvreader object, which is an iterator. You can read the data in the file through the next function or a for-in loop.
import csv
with open('scores.csv', 'r') as file:
reader = csv.reader(file, delimiter='|')
for data_list in reader:
print(reader.line_num, end='\t')
for elem in data_list:
print(elem, end='\t')
print()
Note: When doing a
forloop on thecsvreaderobject in the above code, each time it takes out a list object, which contains all the fields in a row.
Summary
In the future, if you use Python for data analysis, you will most likely use a third-party library called pandas, which is one of Python’s data analysis tools. pandas encapsulates functions named read_csv and to_csv for reading and writing CSV files. Among them, read_csv will turn the read data into a DataFrame object, and DataFrame is the most important type in the pandas library. It encapsulates a series of methods for data processing (cleaning, transformation, aggregation, etc.). The to_csv function writes data from a DataFrame object to a CSV file, completing data persistence. The read_csv and to_csv functions are far more powerful than the native csvreader and csvwriter.