Lightning bolt and Python code snippet with "Python NumPy and Pandas" in blocky caps

Python NumPy and Pandas

This time, we’ll dive into two powerful Python libraries: NumPy and Pandas. You’ll see these libraries everywhere – they’re essential for data manipulation and scientific computing in Python.

They provide efficient data structures for numerical analysis, of large datasets.

Introduction to NumPy

NumPy (Numerical Python) is the core library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

Why Use NumPy?

  • It offers highly efficient operations on arrays and matrices.
  • Provides built-in mathematical functions, making computations faster.
  • Reduces memory overhead compared to Python lists.
NumPy Logo

Installing NumPy

Before using NumPy, you need to install it. Use the following command:

pip install numpy

Creating Arrays in NumPy

NumPy’s primary data structure is the ndarray (n-dimensional array). You can create an array using the np.array() function.

Example:

import numpy as np

# Creating a 1D array
array_1d = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(array_1d)
print(array_2d)

Basic Array Operations

NumPy arrays support element-wise operations. You can add, subtract, multiply, and divide arrays directly.

Example:

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise addition
print(a + b)

# Element-wise multiplication
print(a * b)

Common NumPy Functions

NumPy provides a bunch of useful functions for numerical analysis.

  1. zeros() – Creates an array of all zeros:
   np.zeros((2, 3))
  1. ones() – Creates an array of all ones:
   np.ones((3, 3))
  1. arange() – Generates a sequence of numbers:
   np.arange(0, 10, 2)  # From 0 to 10 with a step of 2
  1. linspace() – Generates evenly spaced numbers over a specified range:
   np.linspace(0, 1, 5)  # 5 numbers between 0 and 1
  1. reshape() – Reshapes an array to a new dimension:
   array = np.arange(9)
   reshaped_array = array.reshape(3, 3)
   print(reshaped_array)

You can learn more about NumPy here.

Introduction to Pandas

Pandas is a library built on top of NumPy, designed for data manipulation and analysis. It introduces two key data structures:

  • Series: A one-dimensional labeled array.
  • DataFrame: A two-dimensional labeled array (like a table or spreadsheet).
Pandas Logo - vertical blocks of different colours

Installing Pandas

Use the following command to install Pandas:

pip install pandas

Creating a Series

A Series is similar to a NumPy array but has labels (called the index) associated with each element.

Example:

import pandas as pd

# Creating a Series from a list
data = [10, 20, 30]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series)

Creating a DataFrame

A DataFrame is a two-dimensional table, where each column can have different data types.

Example:

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Accessing Data in DataFrames

You can access data in DataFrames using column names or indices.

Example:

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'Age']])

# Accessing rows by index
print(df.iloc[1])  # Second row (index 1)

Basic Data Manipulation with Pandas

  1. Adding a New Column:
   df['Salary'] = [50000, 60000, 70000]
   print(df)
  1. Filtering Rows:
    You can filter rows based on column values.
   filtered_df = df[df['Age'] > 28]
   print(filtered_df)
  1. Removing Rows or Columns:
    Use the drop() method to remove rows or columns.
   df_without_age = df.drop(columns=['Age'])
   print(df_without_age)
  1. Handling Missing Data:
    Pandas has built-in methods for handling missing data:
   df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill missing values with the mean

Descriptive Statistics with Pandas

Pandas allows you to quickly compute summary statistics:

print(df.describe())  # Summary statistics for numerical columns

You can learn more about pandas here.

Combining NumPy and Pandas

NumPy and Pandas work well together. You can use NumPy functions to manipulate data inside a Pandas DataFrame.

Example:

# Adding a new column with NumPy operations
df['Double Age'] = np.array(df['Age']) * 2
print(df)

Using NumPy Arrays in Pandas DataFrames

You can also insert NumPy arrays directly into Pandas DataFrames.

Example:

random_data = np.random.rand(3, 2)
df['Random A'] = random_data[:, 0]
df['Random B'] = random_data[:, 1]
print(df)

Key Concepts Recap

We explored two fundamental Python libraries, NumPy and Pandas, that are widely used in data manipulation and scientific computing. You learned how to:

  • Create and manipulate NumPy arrays.
  • Use Pandas Series and DataFrames to store and process data.
  • Perform basic data manipulations using both libraries.

These skills form the foundation for working with data in Python. Next, we will build on this knowledge and introduce more advanced techniques in data analysis and visualization.

Exercises

NumPy Exercise:

  • Create a 3×3 NumPy array filled with random numbers between 0 and 1.
  • Find the sum of each row and each column.

Pandas Exercise:

  • Create a Pandas DataFrame with the following data:
    • Names: [‘John’, ‘Jane’, ‘Jim’, ‘Jill’]
    • Ages: [22, 29, 34, 42]
    • Cities: [‘New York’, ‘London’, ‘Berlin’, ‘Tokyo’]
  • Add a column called “Salary” with random values.
  • Filter out rows where the “Age” is less than 30.

Sure! Here’s the reformatted FAQ:

FAQ

Q1: What is the difference between a NumPy array and a Python list?

A1: A NumPy array is more efficient for numerical operations than a Python list. Arrays store data of the same type and are optimized for calculations. Python lists can hold elements of different types but lack the optimized mathematical operations and are slower for large datasets.

Q2: Why should I use Pandas if I can manipulate data using NumPy?

A2: Pandas adds functionality for structured data, like tables. It provides easy-to-use methods for:

  • Handling labeled data.
  • Dealing with missing values.
  • Filtering and manipulating data intuitively.
    Pandas is more suited for real-world datasets, which often come in table formats with mixed data types (e.g., numbers, text).

Q3: How do I choose between a Pandas Series and a DataFrame?

A3: Use a Series for one-dimensional data (like a single column or list). Use a DataFrame for two-dimensional data (tables), where each column can have different data types.

Q4: What does inplace=True mean when using Pandas methods like fillna() or drop()?

A4: When you use inplace=True, the DataFrame is modified directly, without creating a new copy. If inplace=False (the default), a new DataFrame is returned, and the original remains unchanged.

Q5: Can I use NumPy functions directly on Pandas DataFrames?

A5: Yes, many NumPy functions work on Pandas DataFrames because Pandas is built on top of NumPy. You can apply functions like np.mean() or np.sum() to DataFrame columns.

Q6: How do I handle missing data in a DataFrame?

A6: Pandas provides several methods:

  • fillna(): Replaces missing values with a specified value.
  • dropna(): Removes rows or columns with missing data.

Q7: How can I select multiple columns from a Pandas DataFrame?

A7: Pass a list of column names to the DataFrame:

df[['Name', 'Age']]

Q8: How do I filter rows in a DataFrame based on a condition?

A8: You can filter rows by applying conditions directly to columns. For example, to select rows where “Age” is greater than 30:

filtered_df = df[df['Age'] > 30]

Q9: What is the difference between .loc[] and .iloc[] in Pandas?

A9:

  • .loc[]: Accesses rows and columns by labels (e.g., column names or row indices).
  • .iloc[]: Accesses rows and columns by integer position, like with NumPy arrays.

Q10: How do I generate random numbers in NumPy for my array?

A10: You can use the np.random module:

np.random.rand(): Generates random floats between 0 and 1.

  • np.random.randint(): Generates random integers within a range.

Q11: Can I concatenate or merge two Pandas DataFrames?

A11: Yes, use:

  • pd.concat(): Concatenates DataFrames along rows or columns.
  • pd.merge(): Merges DataFrames on a common column, similar to SQL joins.

Q12: How do I export a Pandas DataFrame to a CSV file?

A12: Use the to_csv() method:

df.to_csv('filename.csv', index=False)

Q13: How do I reshape a NumPy array?

A13: Use reshape() to change the array shape without altering data:

array = np.arange(12).reshape(3, 4)

Q14: Can I apply custom functions to a Pandas DataFrame column?

A14: Yes, use the apply() method to apply custom functions:

df['Age'] = df['Age'].apply(lambda x: x * 2)

Q15: How can I quickly get summary statistics of my Pandas DataFrame?

A15: Use describe() to get summary statistics (mean, count, standard deviation, etc.) for numerical columns:

df.describe()

Similar Posts