Read Multiple Csv Files Into Separate Dataframes Python
CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, manipulate, and write information to and from CSV files using Python is a key skill to master for any data scientist or business analysis. In this post, nosotros'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files post analysis.
Pandas is the most popular data manipulation parcel in Python, and DataFrames are the Pandas information type for storing tabular 2d data.
- Load CSV files to Python Pandas
- 1. File Extensions and File Types
- ii. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- 3. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Absolute and Relative Paths
- iv. Pandas CSV File Loading Errors
- Advanced Read CSV Files
- Specifying Information Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Boosted Reading
Load CSV files to Python Pandas
The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is accomplished using the "read_csv" part in Pandas:
# Load the Pandas libraries with alias 'pd' import pandas as pd # Read information from file 'filename.csv' # (in the same directory that your python procedure is based) # Command delimiters, rows, column names with read_csv (see subsequently) data = pd.read_csv("filename.csv") # Preview the first 5 lines of the loaded information data.head() While this code seems unproblematic, an understanding of 3 key concepts is required to fully grasp and debug the functioning of the data loading procedure if you lot run into bug:
- Understanding file extensions and file types – what practise the messages CSV really mean? What's the difference between a .csv file and a .txt file?
- Agreement how data is represented within CSV files – if you open a CSV file, what does the data actually look like?
- Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are yous working in?
- CSV data formats and errors – mutual errors with the function.
Each of these topics is discussed beneath, and nosotros stop this tutorial past looking at some more advanced CSV loading mechanisms and giving some wide advantages and disadvantages of the CSV format.
one. File Extensions and File Types
The kickoff step to working with comma-separated-value (CSV) files is agreement the concept of file types and file extensions.
- Data is stored on your computer in private "files", or containers, each with a unlike name.
- Each file contains information of different types – the internals of a Word document is quite unlike from the internals of an paradigm.
- Computers determine how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
- And then, a filename is typically in the course "<random name>.<file extension>". Examples:
- project1.DOCX – a Microsoft Word file called Project1.
- shanes_file.TXT – a unproblematic text file chosen shanes_file
- IMG_5673.JPG – An image file called IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, ZIP – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a consummate list of extensions here.
- A CSV file is a file with a ".csv" file extension, e.g. "data.csv", "super_information.csv". The "CSV" in this case lets the computer know that the data contained in the file is in "comma separated value" format, which nosotros'll discuss below.
File extensions are hidden past default on a lot of operating systems. The first stride that whatever self-respecting engineer, software engineer, or data scientist will practise on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.
To bank check if file extensions are showing in your organisation, create a new text document with Notepad (Windows) or TextEdit (Mac) and save it to a folder of your selection. If you can't see the ".txt" extension in your folder when you lot view it, y'all will accept to modify your settings.
- In Microsoft Windows: Open Control Console > Appearance and Personalization. Now, click on Folder Options or File Explorer Option, every bit information technology is at present called > View tab. In this tab, under Advance Settings, y'all volition come across the pick Hide extensions for known file types. Uncheck this selection and click on Apply and OK.
- In Mac OS: Open Finder > In carte, click Finder > Preferences, Click Advanced, Select the checkbox for "Prove all filename extensions".
2. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open a CSV file and prove the contents. Sublime Text is a wonderful and multi-functional text editor pick for whatever platform.
CSV is a standard for storing tabular data in text format, where commas are used to split the different columns, and newlines (carriage return / printing enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.
And example tabular array information set and the corresponding CSV-format data is shown in the diagram beneath.
Note that nigh any tabular data tin be stored in CSV format – the format is popular considering of its simplicity and flexibility. You tin create a text file in a text editor, save information technology with a .csv extension, and open that file in Excel or Google Sheets to encounter the table form.
Other Delimiters / Separators – TSV files
The comma separation scheme is past far the almost pop method of storing tabular information in text files.
However, the choice of the ',' comma character to delimiters columns, however, is arbitrary, and can exist substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known every bit TSV (Tab-Separated Value) files.
When loading data with Pandas, the read_csv part is used for reading any delimited text file, and by changing the delimiter using the sep parameter.
Delimiters in Text Fields – Quotechar
I complication in creating CSV files is if yous have commas, semicolons, or tabs actually in 1 of the text fields that you lot want to store. In this instance, it's of import to use a "quote character" in the CSV file to create these fields.
The quote grapheme tin can exist specified in Pandas.read_csv using the quotechar statement. Past default (as with many systems), information technology's set as the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur betwixt ii quote characters will be ignored every bit column separators.
In the example shown, a semicolon-delimited file, with quotation marks equally a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" cavalcade to incorporate semicolons without being split into more columns.
iii. Python – Paths, Folders, Files
When you specify a filename to Pandas.read_csv, Python volition look in your "current working directory". Your working directory is typically the directory that you started your Python procedure or Jupyter notebook from.
Finding your Python Path
Your Python path can be displayed using the built-in bone module. The OS module is for operating system dependent functionality into Python programs and scripts.
To find your current working directory, the function required is os.getcwd(). Thebone.listdir() function tin exist used to display all files in a directory, which is a good check to see if the CSV file you are loading is in the directory as expected.
# Notice out your current working directory import bone print(os.getcwd()) # Out: /Users/shane/Documents/weblog # Brandish all of the files found in your electric current working directory print(os.listdir(bone.getcwd()) # Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']
In the example above, my current working directory is in the '/Users/Shane/Certificate/web log' directory. Whatsoever files that are places in this directory will be immediately available to the Python file open() function or the Pandas read csv role.
Instead of moving the required data files to your working directory, you tin as well alter your electric current working directory to the directory where the files reside usingos.chdir().
File Loading: Absolute and Relative Paths
When specifying file names to the read_csv function, yous can supply both accented or relative file paths.
- A relative pathis the path to the file if you start from your current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will non start with a drive specifier, eastward.grand. (data/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
- An absolute pathis the consummate path from the base of operations of your file system to the file that you want to load, e.chiliad. c:/Documents/Shane/data/test_file.csv. Accented paths will start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
It's recommended and preferred to use relative paths where possible in applications, because accented paths are unlikely to work on different computers due to different directory structures.
4. Pandas CSV File Loading Errors
The near common mistake's you'll get while loading data from CSV files into Pandas volition exist:
-
FileNotFoundError: File b'filename.csv' does not exist
A File Not Constitute fault is typically an outcome with path setup, current directory, or file name confusion (file extension can play a part hither!) -
UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
A Unicode Decode Error is typically caused by non specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick prepare, try opening the file in Sublime Text, and re-saving with encoding 'UTF-8'. -
pandas.parser.CParserError: Error tokenizing data.
Parse Errors can be caused in unusual circumstances to do with your information format – try to add the parameter "engine='python'" to the read_csv role phone call; this changes the data reading function internally to a slower only more than stable method.
Advanced Read CSV Files
There are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data science techniques:
Specifying Data Types
Equally mentioned before, CSV files do not comprise whatsoever type information for data. Data types are inferred through test of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, thedtype parameter tin exist used with a dictionary of column names and data types to be practical, for instance:dtype={"name": str, "historic period": np.int32}.
Annotation that for dates and date times, the format, columns, and other behaviour can exist adapted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the pinnacle of CSV file to read, which is useful to accept a sample of a large file without loading completely. Similarly theskiprowsparameter allows you to specify rows to get out out, either at the start of the file (provide an int), or throughout the file (provide a listing of row indices). Similarly, theusecolsparameter can be used to specify which columns in the information to load.
Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values can be specified with different tokens. Thena_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted every bit NA/NaN are: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', 'one.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'north/a', 'nan', 'aught'.
# Advanced CSV loading example data = pd.read_csv( "data/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # single quote allowed as quote graphic symbol dtype={"bacon": int}, # Parse the salary column equally an integer usecols=['name', 'birth_date', 'salary']. # Only load the three columns specified. parse_dates=['birth_date'], # Intepret the birth_date column as a engagement skiprows=10, # Skip the get-go ten rows of the file na_values=['.', '??'] # Have any '.' or '??' values as NA ) CSV Format Advantages and Disadvantages
As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will come across as you load, store, and exchange data in CSV format:
On the plus side:
- CSV format is universal and the data can be loaded by almost any software.
- CSV files are unproblematic to sympathize and debug with a bones text editor
- CSV files are quick to create and load into memory before analysis.
However, the CSV format has some negative sides:
- At that place is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.
- There's no formatting or layout information storable – things like fonts, borders, column width settings from Microsoft Excel will be lost.
- File encodings can get a problem if there are not-ASCII compatible characters in text fields.
- CSV format is inefficient; numbers are stored equally characters rather than binary values, which is wasteful. You volition find even so that your CSV information compresses well using zip pinch.
As and aside, in an effort to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to exist a fast, simple, open up, flexible and multi-platform information format that supports multiple data types natively.
Boosted Reading
- Official Pandas documentation for the read_csv function.
- Python 3 Notes on file paths, working directories, and using the Bone module.
- Datacamp Tutorial on loading CSV files, including some additional Bone commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "Read Multiple Csv Files Into Separate Dataframes Python"
Post a Comment