Pandas: Working with Structured Data in Python

Pandas: Working with Structured Data in Python#

After NumPy, the next natural step in scientific Python workflows is Pandas. While NumPy focuses on numerical arrays and mathematical operations, many real-world datasets are tabular, labeled, and heterogeneous. Pandas is designed specifically to work with such data.

Pandas is one of the most widely used third-party Python libraries in scientific computing, data science, and engineering, and it is often the first library that makes environment management unavoidable.

Why Pandas?#

Basic Python data structures such as lists and dictionaries are sufficient for small tasks. However, they become impractical when working with:

Large datasets
Tabular data (rows and columns)
Missing values
Mixed data types
Labeled data that must be filtered or aggregated

Pandas provides high-level abstractions that make these tasks simpler, safer, and more expressive.

Installing Pandas#

Pandas is a third-party package and must be installed in the active environment.

# conda install pandas

As discussed in earlier chapters, this installs Pandas only in the currently active environment.

1. Import Pandas#

By convention, Pandas is imported as pd:

import pandas as pd

Core Pandas Data Structures#

Pandas introduces two fundamental data structures: Series and DataFrame.

2. Series#

A Series is a one-dimensional array with an associated index.

s = pd.Series([10, 20, 30, 40])
s

  10
  20
  30
  40
dtype: int64

You can explicitly define the index:

s2 = pd.Series([10, 20, 30], index=["a", "b", "c"])
s2

a    10
b    20
c    30
dtype: int64

A Series behaves similarly to a labeled NumPy array.

3. DataFrame#

A DataFrame is a two-dimensional table consisting of rows and columns. Each column is a Series.

Creating a DataFrame from a dictionary

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "score": [85, 90, 95],
}

df = pd.DataFrame(data)
df

	name	age	score
0	Alice	25	85
1	Bob	30	90
2	Charlie	35	95

4. Inspecting a DataFrame#

Before doing analysis, inspect the structure and data types. These operations help you understand the structure of the dataset.

df.head()

	name	age	score
0	Alice	25	85
1	Bob	30	90
2	Charlie	35	95

df.tail()

	name	age	score
0	Alice	25	85
1	Bob	30	90
2	Charlie	35	95

df.columns

Index(['name', 'age', 'score'], dtype='object')

df.shape

(3, 3)

df.dtypes

name     object
age       int64
score     int64
dtype: object

5. Reading Data from Files (CSV)#

If you have a CSV file on disk, you can load it with pd.read_csv.

Uncomment and update the path to use this in your project.

df_from_csv = pd.read_csv("data.csv")
df_from_csv.head()

	name	age	score
0	Alice	25	85.0
1	Bob	30	90.0
2	Charlie	35	95.0
3	Diana	28	NaN

How Pandas Thinks About a DataFrame (Objects, Attributes, and Methods)#

When data is loaded using Pandas, the result is not just a table of values. It is a Python object with its own attributes and methods.

For example:

df = pd.read_csv("data.csv")

The variable df now refers to a DataFrame object.

DataFrames Are Python Objects#

In Python, objects have:

Attributes → information stored in the object
Methods → functions that act on the object

Pandas follows this object-oriented design strictly.

You interact with a DataFrame using dot notation (.).

DataFrame Attributes (No Parentheses)#

Attributes describe properties of the DataFrame. They are accessed without parentheses.

Examples:

df.columns
df.shape
df.dtypes
df.index

RangeIndex(start=0, stop=4, step=1)

These return information such as:

Column names
Number of rows and columns
Data types of each column
Row labels

Rule of thumb:

If you are asking for information, it is usually an attribute.

DataFrame Methods (With Parentheses)#

Methods are functions attached to the DataFrame object. They are called with parentheses.

Examples:

df.head()
df.describe()
df[["age", "score"]].mean()
df.dropna()

	name	age	score
0	Alice	25	85.0
1	Bob	30	90.0
2	Charlie	35	95.0

These perform actions such as:

Displaying part of the data
Computing statistics
Transforming or cleaning data

Rule of thumb:

If something does work or changes data, it is usually a method.

Why Parentheses Matter#

This is a very common beginner mistake:

df.shape()   # ❌ incorrect
df.shape     # ✅ correct

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 df.shape()   # ❌ incorrect
      2 df.shape     # ✅ correct

TypeError: 'tuple' object is not callable

Explanation:

df.shape is an attribute (stored information)
df.head() is a method (a function call)

Forgetting or adding parentheses incorrectly will cause errors.

How read_csv Fits This Model#

pd.read_csv() is not a DataFrame method.

df = pd.read_csv("data.csv")

Here:

pd.read_csv() is a function provided by the Pandas module
It creates a new DataFrame object
After this, all interaction happens through df

What Pandas Does Automatically When Reading Files#

When you load data from a CSV file, Pandas automatically:

Parses column names from the header
Infers data types for each column
Assigns a default index
Represents missing values using NaN

All of this information becomes part of the DataFrame object and is accessible through its attributes and methods.

Seeing This in Practice#

You can inspect the DataFrame immediately after loading:

df.columns
df.dtypes
df.head()
df.shape

This should always be your first step after reading a file.

Mental Model to Remember#

A Pandas DataFrame is not just data. it is a Python object that knows about its structure and provides methods to work with it.

Once students understand this model:

.loc, .iloc, .mean(), .dropna() make sense
Errors become easier to debug
Pandas feels consistent rather than magical

6. Selecting Columns#

Select one column (returns a Series) or multiple columns (returns a DataFrame).

df["age"]

df[["name", "score"]]

7. Selecting Rows#

Use .iloc for position-based selection and .loc for label-based selection.

df.iloc[0]      # Select By position

df.iloc[0:2     # Select By position

df.loc[0]       # Select By label

8. Filtering Rows by Conditions#

Filtering is one of the most useful Pandas features.

df[df["age"] > 30]     # Filter by one condition

df[(df["age"] > 25) & (df["score"] >= 90)]      #Filter by multiple condition

9. Basic Statistics and Descriptive Analysis#

Pandas provides built-in methods for quick exploration.

df.mean(numeric_only=True)

df.describe()

df["score"].mean(), df["score"].max()    # Individual columns

10. Handling Missing Data#

Missing values are common in real datasets. Pandas represents missing values as NaN in many cases.

Here we create a small example with missing values and demonstrate typical operations. Understanding how missing data is handled is critical for correct analysis.

df_missing = df.copy()
df_missing.loc[1, "score"] = None
df_missing

df_missing.isna()

df_missing.dropna()

df_missing.fillna(0)

11. Modifying Data#

Create new columns and update values using .loc.

df2 = df.copy()
df2["passed"] = df2["score"] >= 60
df2

You can update existing values:

df2.loc[df2["score"] < 60, "passed"] = False
df2

12. Jupyter Tip: Verify Which Python Executable Runs This Notebook#

This is helpful for confirming that your notebook is using the intended Conda environment.

import sys
sys.executable

Pandas and Environment Management#

Pandas depends on NumPy and system-level libraries. This means:

Version mismatches can cause installation issues
Different projects may require different Pandas versions
Isolated environments are strongly recommended

Pandas is often the first library that exposes students to real dependency management challenges.

Common Beginner Mistakes#

Installing Pandas in one environment and running code in another
Treating DataFrames like Python lists
Forgetting that column selection returns a Series
Ignoring missing values
Modifying data unintentionally due to chained indexing

Most Pandas errors are conceptual rather than syntactic.

Summary#

Use Series for 1D labeled data
Use DataFrame for 2D tabular data
Inspect with head, shape, dtypes
Select with [], .iloc, .loc
Filter with boolean conditions
Handle missing values with isna, dropna, fillna