Pandas: Working with Structured Data in Python#
After NumPy, the next natural step in scientific Python workflows is Pandas. While NumPy focuses on numerical arrays and mathematical operations, many real-world datasets are tabular, labeled, and heterogeneous. Pandas is designed specifically to work with such data.
Pandas is one of the most widely used third-party Python libraries in scientific computing, data science, and engineering, and it is often the first library that makes environment management unavoidable.
Why Pandas?#
Basic Python data structures such as lists and dictionaries are sufficient for small tasks. However, they become impractical when working with:
Large datasets
Tabular data (rows and columns)
Missing values
Mixed data types
Labeled data that must be filtered or aggregated
Pandas provides high-level abstractions that make these tasks simpler, safer, and more expressive.
Installing Pandas#
Pandas is a third-party package and must be installed in the active environment.
# conda install pandas
As discussed in earlier chapters, this installs Pandas only in the currently active environment.
1. Import Pandas#
By convention, Pandas is imported as pd:
import pandas as pd
Core Pandas Data Structures#
Pandas introduces two fundamental data structures: Series and DataFrame.
2. Series#
A Series is a one-dimensional array with an associated index.
s = pd.Series([10, 20, 30, 40])
s
0 10
1 20
2 30
3 40
dtype: int64
You can explicitly define the index:
s2 = pd.Series([10, 20, 30], index=["a", "b", "c"])
s2
a 10
b 20
c 30
dtype: int64
A Series behaves similarly to a labeled NumPy array.
3. DataFrame#
A DataFrame is a two-dimensional table consisting of rows and columns. Each column is a Series.
Creating a DataFrame from a dictionary
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"score": [85, 90, 95],
}
df = pd.DataFrame(data)
df
| name | age | score | |
|---|---|---|---|
| 0 | Alice | 25 | 85 |
| 1 | Bob | 30 | 90 |
| 2 | Charlie | 35 | 95 |
4. Inspecting a DataFrame#
Before doing analysis, inspect the structure and data types. These operations help you understand the structure of the dataset.
df.head()
| name | age | score | |
|---|---|---|---|
| 0 | Alice | 25 | 85 |
| 1 | Bob | 30 | 90 |
| 2 | Charlie | 35 | 95 |
df.tail()
| name | age | score | |
|---|---|---|---|
| 0 | Alice | 25 | 85 |
| 1 | Bob | 30 | 90 |
| 2 | Charlie | 35 | 95 |
df.columns
Index(['name', 'age', 'score'], dtype='object')
df.shape
(3, 3)
df.dtypes
name object
age int64
score int64
dtype: object
5. Reading Data from Files (CSV)#
If you have a CSV file on disk, you can load it with pd.read_csv.
Uncomment and update the path to use this in your project.
df_from_csv = pd.read_csv("data.csv")
df_from_csv.head()
| name | age | score | |
|---|---|---|---|
| 0 | Alice | 25 | 85.0 |
| 1 | Bob | 30 | 90.0 |
| 2 | Charlie | 35 | 95.0 |
| 3 | Diana | 28 | NaN |
How Pandas Thinks About a DataFrame (Objects, Attributes, and Methods)#
When data is loaded using Pandas, the result is not just a table of values. It is a Python object with its own attributes and methods.
For example:
df = pd.read_csv("data.csv")
The variable df now refers to a DataFrame object.
DataFrames Are Python Objects#
In Python, objects have:
Attributes → information stored in the object
Methods → functions that act on the object
Pandas follows this object-oriented design strictly.
You interact with a DataFrame using dot notation (.).
DataFrame Attributes (No Parentheses)#
Attributes describe properties of the DataFrame. They are accessed without parentheses.
Examples:
df.columns
df.shape
df.dtypes
df.index
RangeIndex(start=0, stop=4, step=1)
These return information such as:
Column names
Number of rows and columns
Data types of each column
Row labels
Rule of thumb:
If you are asking for information, it is usually an attribute.
DataFrame Methods (With Parentheses)#
Methods are functions attached to the DataFrame object. They are called with parentheses.
Examples:
df.head()
df.describe()
df[["age", "score"]].mean()
df.dropna()
| name | age | score | |
|---|---|---|---|
| 0 | Alice | 25 | 85.0 |
| 1 | Bob | 30 | 90.0 |
| 2 | Charlie | 35 | 95.0 |
These perform actions such as:
Displaying part of the data
Computing statistics
Transforming or cleaning data
Rule of thumb:
If something does work or changes data, it is usually a method.
Why Parentheses Matter#
This is a very common beginner mistake:
df.shape() # ❌ incorrect
df.shape # ✅ correct
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[15], line 1
----> 1 df.shape() # ❌ incorrect
2 df.shape # ✅ correct
TypeError: 'tuple' object is not callable
Explanation:
df.shapeis an attribute (stored information)df.head()is a method (a function call)
Forgetting or adding parentheses incorrectly will cause errors.
How read_csv Fits This Model#
pd.read_csv() is not a DataFrame method.
df = pd.read_csv("data.csv")
Here:
pd.read_csv()is a function provided by the Pandas moduleIt creates a new DataFrame object
After this, all interaction happens through
df
What Pandas Does Automatically When Reading Files#
When you load data from a CSV file, Pandas automatically:
Parses column names from the header
Infers data types for each column
Assigns a default index
Represents missing values using
NaN
All of this information becomes part of the DataFrame object and is accessible through its attributes and methods.
Seeing This in Practice#
You can inspect the DataFrame immediately after loading:
df.columns
df.dtypes
df.head()
df.shape
This should always be your first step after reading a file.
Mental Model to Remember#
A Pandas DataFrame is not just data. it is a Python object that knows about its structure and provides methods to work with it.
Once students understand this model:
.loc,.iloc,.mean(),.dropna()make senseErrors become easier to debug
Pandas feels consistent rather than magical
6. Selecting Columns#
Select one column (returns a Series) or multiple columns (returns a DataFrame).
df["age"]
df[["name", "score"]]
7. Selecting Rows#
Use .iloc for position-based selection and .loc for label-based selection.
df.iloc[0] # Select By position
df.iloc[0:2 # Select By position
df.loc[0] # Select By label
8. Filtering Rows by Conditions#
Filtering is one of the most useful Pandas features.
df[df["age"] > 30] # Filter by one condition
df[(df["age"] > 25) & (df["score"] >= 90)] #Filter by multiple condition
9. Basic Statistics and Descriptive Analysis#
Pandas provides built-in methods for quick exploration.
df.mean(numeric_only=True)
df.describe()
df["score"].mean(), df["score"].max() # Individual columns
10. Handling Missing Data#
Missing values are common in real datasets. Pandas represents missing values as NaN in many cases.
Here we create a small example with missing values and demonstrate typical operations. Understanding how missing data is handled is critical for correct analysis.
df_missing = df.copy()
df_missing.loc[1, "score"] = None
df_missing
df_missing.isna()
df_missing.dropna()
df_missing.fillna(0)
11. Modifying Data#
Create new columns and update values using .loc.
df2 = df.copy()
df2["passed"] = df2["score"] >= 60
df2
You can update existing values:
df2.loc[df2["score"] < 60, "passed"] = False
df2
12. Jupyter Tip: Verify Which Python Executable Runs This Notebook#
This is helpful for confirming that your notebook is using the intended Conda environment.
import sys
sys.executable
Pandas and Environment Management#
Pandas depends on NumPy and system-level libraries. This means:
Version mismatches can cause installation issues
Different projects may require different Pandas versions
Isolated environments are strongly recommended
Pandas is often the first library that exposes students to real dependency management challenges.
Common Beginner Mistakes#
Installing Pandas in one environment and running code in another
Treating DataFrames like Python lists
Forgetting that column selection returns a Series
Ignoring missing values
Modifying data unintentionally due to chained indexing
Most Pandas errors are conceptual rather than syntactic.
Summary#
Use Series for 1D labeled data
Use DataFrame for 2D tabular data
Inspect with
head,shape,dtypesSelect with
[],.iloc,.locFilter with boolean conditions
Handle missing values with
isna,dropna,fillna