2.3. Importing data into DataFrames and exploring its attributes#
pandas provides many functions to import data into dataframes, such as read_csv() to read delimited text files, or read_excel() for Excel or OpenDocument spreadsheets. read_csv() provides options that allow you to filter the data, such as specifying the separator/delimiter, the lines that form the headers, which rows to skip, etc. Let’s analyze the mineral_properties.txt. Below a screenshot of it:
below we import the .txt:
we indicate that the separator is the comma
"sep=','"we indicate the header (what should be the columns names) is in the second line
"header=[1]"we indicate to not skip any rows
"skiprows=None"we indicate the first column should be the index of the rows
"index_col=0"
import pandas as pd
file_location = ("mineral_properties.txt")
df4 = pd.read_csv(file_location, sep=',', header=[1],
skiprows=None, index_col=0)
df4
| hardness | sp. gr. | cleavage | |
|---|---|---|---|
| name | |||
| Amphibole | 5.50 | 2.800 | Two |
| Biotite | 2.75 | 3.000 | One |
| Calcite | 3.00 | 2.720 | Three |
| Dolomite | 3.00 | 2.850 | Three |
| Feldspars | 6.00 | 2.645 | Two |
| Garnet | 7.00 | 3.900 | Fracture |
| Graphite | 1.50 | 2.300 | One |
| Kyanite | 6.00 | 4.010 | One |
| Muscovite | 2.25 | 2.930 | One |
| Pyroxene | 5.50 | 3.325 | Two |
| Quartz | 7.00 | 2.650 | Fracture |
| Sillimanite | 6.50 | 3.230 | One |
Note that if we try to call any of the columns from df4 we will get an error.
df4['hardness']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'hardness'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[67], line 1
----> 1 df4['hardness']
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\frame.py:4090, in DataFrame.__getitem__(self, key)
4088 if self.columns.nlevels > 1:
4089 return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
4091 if is_integer(indexer):
4092 indexer = [indexer]
File c:\Users\gui-win10\AppData\Local\pypoetry\Cache\virtualenvs\learn-python-rwbttIgo-py3.11\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'hardness'
Do you know why?
Answer …
In case you were not able to answer the above question, let’s look into df4.columns
df4.columns
Index([' hardness', ' sp. gr.', ' cleavage'], dtype='object')
You see there are spaces at the beginning of each column name… this happens because that’s how people usually type, with commas followed by a space. We could use the skipinitialspace = True from the pd.read_csv() function to avoid this. Let’s try it out:
df4 = pd.read_csv(file_location + 'mineral_properties.txt',sep=',',header=[1],
skiprows=None, index_col=0, skipinitialspace=True)
print(df4.columns)
Index(['hardness', 'sp. gr.', 'cleavage'], dtype='object')
Ok, much better!