Python NumPy and Pandas

This time, we’ll dive into two powerful Python libraries: NumPy and Pandas. You’ll see these libraries everywhere – they’re essential for data manipulation and scientific computing in Python.

They provide efficient data structures for numerical analysis, of large datasets.

Introduction to NumPy

NumPy (Numerical Python) is the core library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

Why Use NumPy?

It offers highly efficient operations on arrays and matrices.
Provides built-in mathematical functions, making computations faster.
Reduces memory overhead compared to Python lists.

Installing NumPy

Before using NumPy, you need to install it. Use the following command:

pip install numpy

Creating Arrays in NumPy

NumPy’s primary data structure is the ndarray (n-dimensional array). You can create an array using the np.array() function.

Example:

import numpy as np

# Creating a 1D array
array_1d = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(array_1d)
print(array_2d)

Basic Array Operations

NumPy arrays support element-wise operations. You can add, subtract, multiply, and divide arrays directly.

Example:

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise addition
print(a + b)

# Element-wise multiplication
print(a * b)

Common NumPy Functions

NumPy provides a bunch of useful functions for numerical analysis.

zeros() – Creates an array of all zeros:

   np.zeros((2, 3))

ones() – Creates an array of all ones:

   np.ones((3, 3))

arange() – Generates a sequence of numbers:

   np.arange(0, 10, 2)  # From 0 to 10 with a step of 2

linspace() – Generates evenly spaced numbers over a specified range:

   np.linspace(0, 1, 5)  # 5 numbers between 0 and 1

reshape() – Reshapes an array to a new dimension:

   array = np.arange(9)
   reshaped_array = array.reshape(3, 3)
   print(reshaped_array)

You can learn more about NumPy here.

Introduction to Pandas

Pandas is a library built on top of NumPy, designed for data manipulation and analysis. It introduces two key data structures:

Series: A one-dimensional labeled array.
DataFrame: A two-dimensional labeled array (like a table or spreadsheet).

Installing Pandas

Use the following command to install Pandas:

pip install pandas

Creating a Series

A Series is similar to a NumPy array but has labels (called the index) associated with each element.

Example:

import pandas as pd

# Creating a Series from a list
data = [10, 20, 30]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series)

Creating a DataFrame

A DataFrame is a two-dimensional table, where each column can have different data types.

Example:

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Accessing Data in DataFrames

You can access data in DataFrames using column names or indices.

Example:

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'Age']])

# Accessing rows by index
print(df.iloc[1])  # Second row (index 1)

Basic Data Manipulation with Pandas

Adding a New Column:

   df['Salary'] = [50000, 60000, 70000]
   print(df)

Filtering Rows:
You can filter rows based on column values.

   filtered_df = df[df['Age'] > 28]
   print(filtered_df)

Removing Rows or Columns:
Use the drop() method to remove rows or columns.

   df_without_age = df.drop(columns=['Age'])
   print(df_without_age)

Handling Missing Data:
Pandas has built-in methods for handling missing data:

   df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill missing values with the mean

Descriptive Statistics with Pandas

Pandas allows you to quickly compute summary statistics:

print(df.describe())  # Summary statistics for numerical columns

You can learn more about pandas here.

Combining NumPy and Pandas

NumPy and Pandas work well together. You can use NumPy functions to manipulate data inside a Pandas DataFrame.

Example:

# Adding a new column with NumPy operations
df['Double Age'] = np.array(df['Age']) * 2
print(df)

Using NumPy Arrays in Pandas DataFrames

You can also insert NumPy arrays directly into Pandas DataFrames.

Example:

random_data = np.random.rand(3, 2)
df['Random A'] = random_data[:, 0]
df['Random B'] = random_data[:, 1]
print(df)

Key Concepts Recap

We explored two fundamental Python libraries, NumPy and Pandas, that are widely used in data manipulation and scientific computing. You learned how to:

Create and manipulate NumPy arrays.
Use Pandas Series and DataFrames to store and process data.
Perform basic data manipulations using both libraries.

These skills form the foundation for working with data in Python. Next, we will build on this knowledge and introduce more advanced techniques in data analysis and visualization.

Exercises

NumPy Exercise:

Create a 3×3 NumPy array filled with random numbers between 0 and 1.
Find the sum of each row and each column.

Pandas Exercise:

Create a Pandas DataFrame with the following data:
- Names: [‘John’, ‘Jane’, ‘Jim’, ‘Jill’]
- Ages: [22, 29, 34, 42]
- Cities: [‘New York’, ‘London’, ‘Berlin’, ‘Tokyo’]
Add a column called “Salary” with random values.
Filter out rows where the “Age” is less than 30.

Sure! Here’s the reformatted FAQ:

<< 23. Version Control

Course Outline

25. Virtual Environments >>

FAQ

Q1: What is the difference between a NumPy array and a Python list?

A1: A NumPy array is more efficient for numerical operations than a Python list. Arrays store data of the same type and are optimized for calculations. Python lists can hold elements of different types but lack the optimized mathematical operations and are slower for large datasets.

Q2: Why should I use Pandas if I can manipulate data using NumPy?

A2: Pandas adds functionality for structured data, like tables. It provides easy-to-use methods for:

Handling labeled data.
Dealing with missing values.
Filtering and manipulating data intuitively.
Pandas is more suited for real-world datasets, which often come in table formats with mixed data types (e.g., numbers, text).

Q3: How do I choose between a Pandas Series and a DataFrame?

A3: Use a Series for one-dimensional data (like a single column or list). Use a DataFrame for two-dimensional data (tables), where each column can have different data types.

Q4: What does `inplace=True` mean when using Pandas methods like `fillna()` or `drop()`?

A4: When you use inplace=True, the DataFrame is modified directly, without creating a new copy. If inplace=False (the default), a new DataFrame is returned, and the original remains unchanged.

Q5: Can I use NumPy functions directly on Pandas DataFrames?

A5: Yes, many NumPy functions work on Pandas DataFrames because Pandas is built on top of NumPy. You can apply functions like np.mean() or np.sum() to DataFrame columns.

Q6: How do I handle missing data in a DataFrame?

A6: Pandas provides several methods:

fillna(): Replaces missing values with a specified value.
dropna(): Removes rows or columns with missing data.

Q7: How can I select multiple columns from a Pandas DataFrame?

A7: Pass a list of column names to the DataFrame:

df[['Name', 'Age']]

Q8: How do I filter rows in a DataFrame based on a condition?

A8: You can filter rows by applying conditions directly to columns. For example, to select rows where “Age” is greater than 30:

filtered_df = df[df['Age'] > 30]

Q9: What is the difference between `.loc[]` and `.iloc[]` in Pandas?

A9:

.loc[]: Accesses rows and columns by labels (e.g., column names or row indices).
.iloc[]: Accesses rows and columns by integer position, like with NumPy arrays.

Q10: How do I generate random numbers in NumPy for my array?

A10: You can use the np.random module:

np.random.rand(): Generates random floats between 0 and 1.

np.random.randint(): Generates random integers within a range.

Q11: Can I concatenate or merge two Pandas DataFrames?

A11: Yes, use:

pd.concat(): Concatenates DataFrames along rows or columns.
pd.merge(): Merges DataFrames on a common column, similar to SQL joins.

Q12: How do I export a Pandas DataFrame to a CSV file?

A12: Use the to_csv() method:

df.to_csv('filename.csv', index=False)

Q13: How do I reshape a NumPy array?

A13: Use reshape() to change the array shape without altering data:

array = np.arange(12).reshape(3, 4)

Q14: Can I apply custom functions to a Pandas DataFrame column?

A14: Yes, use the apply() method to apply custom functions:

df['Age'] = df['Age'].apply(lambda x: x * 2)

Q15: How can I quickly get summary statistics of my Pandas DataFrame?

A15: Use describe() to get summary statistics (mean, count, standard deviation, etc.) for numerical columns:

df.describe()

<< 23. Version Control

Course Outline

25. Virtual Environments >>

Table of Contents

Introduction to NumPy

Why Use NumPy?

Installing NumPy

Creating Arrays in NumPy

Basic Array Operations

Common NumPy Functions

Introduction to Pandas

Installing Pandas

Creating a Series

Creating a DataFrame

Accessing Data in DataFrames

Basic Data Manipulation with Pandas

Descriptive Statistics with Pandas

Combining NumPy and Pandas

Using NumPy Arrays in Pandas DataFrames

Key Concepts Recap

Exercises

NumPy Exercise:

Pandas Exercise:

FAQ

Q1: What is the difference between a NumPy array and a Python list?

Q2: Why should I use Pandas if I can manipulate data using NumPy?

Q3: How do I choose between a Pandas Series and a DataFrame?

Q4: What does inplace=True mean when using Pandas methods like fillna() or drop()?

Q5: Can I use NumPy functions directly on Pandas DataFrames?

Q6: How do I handle missing data in a DataFrame?

Q7: How can I select multiple columns from a Pandas DataFrame?

Q8: How do I filter rows in a DataFrame based on a condition?

Q9: What is the difference between .loc[] and .iloc[] in Pandas?

Q10: How do I generate random numbers in NumPy for my array?

Q11: Can I concatenate or merge two Pandas DataFrames?

Q12: How do I export a Pandas DataFrame to a CSV file?

Q13: How do I reshape a NumPy array?

Q14: Can I apply custom functions to a Pandas DataFrame column?

Q15: How can I quickly get summary statistics of my Pandas DataFrame?

Similar Posts

Q4: What does `inplace=True` mean when using Pandas methods like `fillna()` or `drop()`?

Q9: What is the difference between `.loc[]` and `.iloc[]` in Pandas?