Pandas: Working with Structured Data in Python#

After NumPy, the next natural step in scientific Python workflows is Pandas. While NumPy focuses on numerical arrays and mathematical operations, many real-world datasets are tabular, labeled, and heterogeneous. Pandas is designed specifically to work with such data.

Pandas is one of the most widely used third-party Python libraries in scientific computing, data science, and engineering, and it is often the first library that makes environment management unavoidable.

Why Pandas?#

Basic Python data structures such as lists and dictionaries are sufficient for small tasks. However, they become impractical when working with:

  • Large datasets

  • Tabular data (rows and columns)

  • Missing values

  • Mixed data types

  • Labeled data that must be filtered or aggregated

Pandas provides high-level abstractions that make these tasks simpler, safer, and more expressive.

Installing Pandas#

Pandas is a third-party package and must be installed in the active environment.

# conda install pandas

As discussed in earlier chapters, this installs Pandas only in the currently active environment.

1. Import Pandas#

By convention, Pandas is imported as pd:

import pandas as pd

Core Pandas Data Structures#

Pandas introduces two fundamental data structures: Series and DataFrame.

2. Series#

A Series is a one-dimensional array with an associated index.

s = pd.Series([10, 20, 30, 40])
s
0    10
1    20
2    30
3    40
dtype: int64

You can explicitly define the index:

s2 = pd.Series([10, 20, 30], index=["a", "b", "c"])
s2
a    10
b    20
c    30
dtype: int64

A Series behaves similarly to a labeled NumPy array.

3. DataFrame#

A DataFrame is a two-dimensional table consisting of rows and columns. Each column is a Series.

Creating a DataFrame from a dictionary

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "score": [85, 90, 95],
}

df = pd.DataFrame(data)
df
name age score
0 Alice 25 85
1 Bob 30 90
2 Charlie 35 95

4. Inspecting a DataFrame#

Before doing analysis, inspect the structure and data types. These operations help you understand the structure of the dataset.

df.head()
name age score
0 Alice 25 85
1 Bob 30 90
2 Charlie 35 95
df.tail()
name age score
0 Alice 25 85
1 Bob 30 90
2 Charlie 35 95
df.columns
Index(['name', 'age', 'score'], dtype='object')
df.shape
(3, 3)
df.dtypes
name     object
age       int64
score     int64
dtype: object

5. Reading Data from Files (CSV)#

If you have a CSV file on disk, you can load it with pd.read_csv.

Uncomment and update the path to use this in your project.

df_from_csv = pd.read_csv("data.csv")
df_from_csv.head()
name age score
0 Alice 25 85.0
1 Bob 30 90.0
2 Charlie 35 95.0
3 Diana 28 NaN

How Pandas Thinks About a DataFrame (Objects, Attributes, and Methods)#

When data is loaded using Pandas, the result is not just a table of values. It is a Python object with its own attributes and methods.

For example:

df = pd.read_csv("data.csv")

The variable df now refers to a DataFrame object.

DataFrames Are Python Objects#

In Python, objects have:

  • Attributes → information stored in the object

  • Methods → functions that act on the object

Pandas follows this object-oriented design strictly.

You interact with a DataFrame using dot notation (.).

DataFrame Attributes (No Parentheses)#

Attributes describe properties of the DataFrame. They are accessed without parentheses.

Examples:

df.columns
df.shape
df.dtypes
df.index
RangeIndex(start=0, stop=4, step=1)

These return information such as:

  • Column names

  • Number of rows and columns

  • Data types of each column

  • Row labels

Rule of thumb:

If you are asking for information, it is usually an attribute.

DataFrame Methods (With Parentheses)#

Methods are functions attached to the DataFrame object. They are called with parentheses.

Examples:

df.head()
df.describe()
df[["age", "score"]].mean()
df.dropna()
name age score
0 Alice 25 85.0
1 Bob 30 90.0
2 Charlie 35 95.0

These perform actions such as:

  • Displaying part of the data

  • Computing statistics

  • Transforming or cleaning data

Rule of thumb:

If something does work or changes data, it is usually a method.

Why Parentheses Matter#

This is a very common beginner mistake:

df.shape()   # ❌ incorrect
df.shape     # ✅ correct
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 df.shape()   # ❌ incorrect
      2 df.shape     # ✅ correct

TypeError: 'tuple' object is not callable

Explanation:

  • df.shape is an attribute (stored information)

  • df.head() is a method (a function call)

Forgetting or adding parentheses incorrectly will cause errors.

How read_csv Fits This Model#

pd.read_csv() is not a DataFrame method.

df = pd.read_csv("data.csv")

Here:

  • pd.read_csv() is a function provided by the Pandas module

  • It creates a new DataFrame object

  • After this, all interaction happens through df

What Pandas Does Automatically When Reading Files#

When you load data from a CSV file, Pandas automatically:

  • Parses column names from the header

  • Infers data types for each column

  • Assigns a default index

  • Represents missing values using NaN

All of this information becomes part of the DataFrame object and is accessible through its attributes and methods.

Seeing This in Practice#

You can inspect the DataFrame immediately after loading:

df.columns
df.dtypes
df.head()
df.shape

This should always be your first step after reading a file.

Mental Model to Remember#

A Pandas DataFrame is not just data. it is a Python object that knows about its structure and provides methods to work with it.

Once students understand this model:

  • .loc, .iloc, .mean(), .dropna() make sense

  • Errors become easier to debug

  • Pandas feels consistent rather than magical

6. Selecting Columns#

Select one column (returns a Series) or multiple columns (returns a DataFrame).

df["age"]
df[["name", "score"]]

7. Selecting Rows#

Use .iloc for position-based selection and .loc for label-based selection.

df.iloc[0]      # Select By position
df.iloc[0:2     # Select By position
df.loc[0]       # Select By label

8. Filtering Rows by Conditions#

Filtering is one of the most useful Pandas features.

df[df["age"] > 30]     # Filter by one condition
df[(df["age"] > 25) & (df["score"] >= 90)]      #Filter by multiple condition

9. Basic Statistics and Descriptive Analysis#

Pandas provides built-in methods for quick exploration.

df.mean(numeric_only=True)
df.describe()
df["score"].mean(), df["score"].max()    # Individual columns

10. Handling Missing Data#

Missing values are common in real datasets. Pandas represents missing values as NaN in many cases.

Here we create a small example with missing values and demonstrate typical operations. Understanding how missing data is handled is critical for correct analysis.

df_missing = df.copy()
df_missing.loc[1, "score"] = None
df_missing
df_missing.isna()
df_missing.dropna()
df_missing.fillna(0)

11. Modifying Data#

Create new columns and update values using .loc.

df2 = df.copy()
df2["passed"] = df2["score"] >= 60
df2

You can update existing values:

df2.loc[df2["score"] < 60, "passed"] = False
df2

12. Jupyter Tip: Verify Which Python Executable Runs This Notebook#

This is helpful for confirming that your notebook is using the intended Conda environment.

import sys
sys.executable

Pandas and Environment Management#

Pandas depends on NumPy and system-level libraries. This means:

  • Version mismatches can cause installation issues

  • Different projects may require different Pandas versions

  • Isolated environments are strongly recommended

Pandas is often the first library that exposes students to real dependency management challenges.

Common Beginner Mistakes#

  • Installing Pandas in one environment and running code in another

  • Treating DataFrames like Python lists

  • Forgetting that column selection returns a Series

  • Ignoring missing values

  • Modifying data unintentionally due to chained indexing

    Most Pandas errors are conceptual rather than syntactic.

Summary#

  • Use Series for 1D labeled data

  • Use DataFrame for 2D tabular data

  • Inspect with head, shape, dtypes

  • Select with [], .iloc, .loc

  • Filter with boolean conditions

  • Handle missing values with isna, dropna, fillna