Effective Pandas - Notes and Exercises

data analysis

python

An update on my progress in the book “Effective Pandas” by Matt Harrison.

Author

Vishal Bakshi

Published

July 6, 2023

In this blog post, I work through the book Effective Pandas by Matt Harrison. I’ll take notes, work through examples and end-of-chapter exercises.

Chapter 4: Series Introduction

Represent the following data in pure python:

Artist	Data
0	145
1	142
2	38
3	13

series = {
    'index': [0, 1, 2, 3],
    'data': [145, 142, 38, 13],
    'name': 'songs'
}

series

{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}

The get function below can pull items out of this data structure based on the index:

def get(series, idx):
    value_idx = series['index'].index(idx)
    return series['data'][value_idx]

get(series, 1)

The index method on the list returns the list element at the provided index value.

[0, 1, 2, 3].index(1)

Below is an example that has string values for the index:

songs = {
    'index': ['Paul', 'John', 'George', 'Ringo'],
    'data': [145, 142, 38, 13],
    'name': 'songs'
}

get(songs, 'John')

Create a Series object from a list:

import pandas as pd

songs2 = pd.Series([145, 142, 38, 13], name = 'counts')
songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

The series is one-dimensional. The leftmost column is the index, also called the axis. The data (145, 142, 38, 13) is also called the values of the series. A DataFrame has two axes, one for the rows and another for the columns.

songs2.index

RangeIndex(start=0, stop=4, step=1)

The default values for an index are monotonically increasing integers. The index can be string-based as well (datatype for the index is object).

songs3 = pd.Series([145, 142, 38, 13],
                   name = 'counts',
                   index = ['Paul', 'John', 'George', 'Ringo'])
songs3

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

We can insert Python objects into a series:

class Foo:
    pass

ringo = pd.Series(
    ['Richard', 'Starkey', 13, Foo()],
    name = 'ringo')

ringo

0                                 Richard
1                                 Starkey
2                                      13
3    <__main__.Foo object at 0x135016ce0>
Name: ringo, dtype: object

The object data type is also used for a series with string values and values that have heterogeneous or mixed types.

Here is a series that has NaN in it:

import numpy as np
nan_series = pd.Series([2, np.nan],
                       index = ['Ono', 'Clapton'])
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

float64 supports NaN while int64 does not. As of pandas 0.24, Int64 (nullable integer type) supports NaN.

count ignores NaNs, .size does not.

nan_series.count()

nan_series.size

nan_series2 = pd.Series([2, None],
                        index = ['Ono', 'Clapton'],
                        dtype = 'Int64')
nan_series2

Ono           2
Clapton    <NA>
dtype: Int64

nan_series2.count()

# convert data type
nan_series.astype('Int64')

Ono           2
Clapton    <NA>
dtype: Int64

The Series object behaves similarly to a NumPy array.

numpy_ser = np.array([145, 142, 38, 13])
songs3[1], numpy_ser[1]

(142, 142)

They both have methods in common

songs3.mean(), numpy_ser.mean()

(84.5, 84.5)

They both have a notion of a boolean array.

mask = songs3 > songs3.median()
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

# use mask as a filter
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

# NumPy equivalent
numpy_ser[numpy_ser > np.median(numpy_ser)]

array([145, 142])

If can indicate that data is categorical.

Categorical values:

Use less memory than strings
Impove performance
Can have an ordering
Can perform operations on categories
Enforce membership on values

s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype = 'category')
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

By default categories don’t have an ordering.

s.cat.ordered

False

Convert non-categorical series to an ordered category:

s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])

size_type = pd.api.types.CategoricalDtype(
    categories=['s', 'm', 'l'], ordered = True)

s3 = s2.astype(size_type)
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

# can perform comparisons on ordered categories
s3 > 's'

0     True
1     True
2    False
3    False
4    False
dtype: bool

# add ordering information to categorical data
s.cat.reorder_categories(['xs', 's', 'm', 'l', 'xl'], ordered=True)

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

For strings and dates converted to categorical types, we can still use the str or dt attributes on them:

s3.str.upper()

0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

Method	Description
`pd.Series(data=None, index=None, dtype=None, name=None, copy=False)`	Create a series from data (sequence, dictionary or scalar)
`s.index`	Access index of series.
`s.astype(dtype, errors='raise')`	Cast a series to `dtype`. To ignore errors (and return original object) use `errors='ignore'`
`s[boolean_array]`	Return values from `s` where `boolean_array` is `True`
`s.cat.ordered`	Determine if a categorical series is ordered
`s.cat.reorder_categories(new_categories, ordered=False)`	Add categories (potentially ordered) to the series. `new_categories` must include all categories.

Exercises

Using Jupyter, create a series with the temperature values for the last seven days. Filter out the vaues below the mean.
Using Jupyter, create a series with your favorite colors. Use a categorical type.

# temperature series
temps = pd.Series([88, 84, 84, 84, 88, 95, 97 ,88])

temps[temps >= temps.mean()]

5    95
6    97
dtype: int64

# favorite colors
colors_series = pd.Series(['orange', 'coral', 'midnight green'], dtype = 'category')
colors_series

0            orange
1             coral
2    midnight green
dtype: category
Categories (3, object): ['coral', 'midnight green', 'orange']

Chapter 5: Series Deep Dive

# analyze the US Fuel Economy data
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

df = pd.read_csv(url)

city_mpg = df.city08
highway_mpg = df.highway08

/var/folders/5q/_bn7l90s177_2gq7rnhssjxm0000gn/T/ipykernel_52057/221626492.py:4: DtypeWarning: Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(url)

city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

Because the type is int64 we know that none of the values are missing.

The dir function lists the attributes of an object. A series has 400+ attributes:

len(dir(city_mpg))

len(dir(highway_mpg))

Functionality of series attributes:

Dunder methods provide many numeric operations, looping, attribute access, and index access. For the numeric operations, these return Series.
Corresponding operator methods for many of the numeric operations allow us to tweak the behavior.
Aggregate methods and properties which reduce or aggregate the values in a series down to a single scalar value.
Conversion methods. Some of these start with .to_ and export the data to other formats.
Manipulation methods that return Series objects with the same index.
Indexing and accessor methods and attributes that return Series or scalars.
String manipulation methods using .str.
Date manipulation methods using .dt.
Plotting methods using .plot.
Categorical manipulation methods using .cat.
Transformation methods.
Attributes such as .index and .dtype.
A bunch of private attributes (130 of them) that we’ll ignore.

Exercises

Explore the documentation for five attributes of a series from Jupyter.
How many attributes are found on the .str attribute? Look at the documentation for three of them.
How many attributes are found on the .dt attribute? Look at the documentation for three of them.

city_mpg.values

array([19,  9, 23, ..., 18, 18, 16])

city_mpg.axes

[RangeIndex(start=0, stop=41144, step=1)]

city_mpg.empty

False

city_mpg.at[4]

city_mpg.loc[1:4]

1     9
2    23
3    10
4    17
Name: city08, dtype: int64

# 98 string attributes
len(dir(s2.str))

s2.str.cat(sep = ".")

'm.l.xs.s.xl'

s2.str.capitalize()

0     M
1     L
2    Xs
3     S
4    Xl
dtype: object

s2.str.endswith('l')

0    False
1     True
2    False
3    False
4     True
dtype: bool

dt_series = pd.Series(['2023-01-01', '2023-04-05', '2023-07-06'])

dt_series = pd.to_datetime(dt_series)
dt_series

0   2023-01-01
1   2023-04-05
2   2023-07-06
dtype: datetime64[ns]

len(dir(dt_series.dt))

dt_series.dt.day

0    1
1    5
2    6
dtype: int32

dt_series.dt.day_of_year

0      1
1     95
2    187
dtype: int32

dt_series.dt.daysinmonth

0    31
1    30
2    31
dtype: int32

Chapter 6: Operators (& Dunder Methods)

These are the protocols that determine how the Python language reacts to operations.

2 + 4

# under the cover is
(2).__add__(4)

(city_mpg + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

When you operate with two series, pandas will align the index before performing the operation. Because of index alignment, you will want to make sure that the indexes: - are unique - are common to both series

# example of series with repeated and non-common indexes
s1 = pd.Series([10, 20, 30], index=[1,2,2])
s2 = pd.Series([35, 44, 53], index=[2,2,4], name = 's2')

s1

1    10
2    20
2    30
dtype: int64

s2

2    35
2    44
4    53
Name: s2, dtype: int64

# index 1 and 4 have NaN
# index 2 has four results
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

When you perform math operations with a scalar, pandas broadcasts the operation to all values. A numeric pandas series is a block of memory, and modern CPUs leverage a technology called Single Instruction/Multiple Data (SIMD) to apply a math operation to the block of memory.

# use `fill_value` parameter to replace missing operands
s1.add(s2, fill_value = 0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

Chaining makes the code easy to read and understand

(city_mpg
    .add(highway_mpg)
    .div(2))

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

Method	Operator	Description
`s.add(s2)`	`s + s2`	Adds series
`s.radd(s2)`	`s2 + s`	Adds series
`s.sub(s2)`	`s - s2`	Subtracts series
`s.rsub(s2)`	`s2 - s`	Subtracts series
`s.mul(s2)`	`s * s2`	Multiplies series
`s.multiply(s2)`	`s * s2`	Multiplies series
`s.rmul(s2)`	`s2 * s`	Multiplies series
`s.div(s2)`	`s / s2`	Divides series
`s.truediv(s2)`	`s / s2`	Divides series
`s.rdiv(s2)`	`s2 / s`	Divides series
`s.rtruediv(s2)`	`s2 / s`	Divides series
`s.mod(s2)`	`s % s2`	Modulo of series division
`s.rmod(s2)`	`s2 % s`	Modulo of series division
`s.floordiv(s2)`	`s // s2`	Floor divide series
`s.rfloordiv(s2)`	`s2 // s`	Floor divide series
`s.pow(s2)`	`s ** s2`	Exponential power of series
`s.rpow(s2)`	`s2 ** s`	Exponential power of series
`s.eq(s2)`	`s2 == s`	Elementwise equals of series
`s.ne(s2)`	`s2 != s`	Elementwise not equals of series
`s.gt(s2)`	`s > s2`	Elementwise greater than of series
`s.ge(s2)`	`s >= s2`	Elementwise greater than or equals of series
`s.lt(s2)`	`s < s2`	Elementwise less than of series
`s.le(s2)`	`s <= s2`	Elementwise less than or equals of series
`np.invert(s)`	`~s`	Elementwise inversion of boolean series (no pandas method)
`np.logical_and(s, s2)`	`s & s2`	Elementwise logical and of boolean series (no pandas method)
`np.logical_or(s, s2)`	`s \\| s2`	Elementwise logical or of boolean series (no pandas method)

Exercises

With a dataset of your choice:

Add a numeric series to itself.
Add 10 to a numeric series.
Add a numeric series to itself using the .add method.
Read the documentation for the .add method.

city_mpg + city_mpg

0        38
1        18
2        46
3        20
4        34
         ..
41139    38
41140    40
41141    36
41142    36
41143    32
Name: city08, Length: 41144, dtype: int64

city_mpg + 10

0        29
1        19
2        33
3        20
4        27
         ..
41139    29
41140    30
41141    28
41142    28
41143    26
Name: city08, Length: 41144, dtype: int64

city_mpg.add(city_mpg)

0        38
1        18
2        46
3        20
4        34
         ..
41139    38
41140    40
41141    36
41142    36
41143    32
Name: city08, Length: 41144, dtype: int64

# experimenting with fill_value parameter
nan_series3 = pd.Series([2, None])
nan_series4 = pd.Series([3, None])

nan_series3

0    2.0
1    NaN
dtype: float64

nan_series4

0    3.0
1    NaN
dtype: float64

# two corresponding NaN values stay NaN
# even with fill_value = 0
nan_series3.add(nan_series4, fill_value=0)

0    5.0
1    NaN
dtype: float64

Chapter 7: Aggregate Methods

Aggregate methods collapse the values of a series down to a scalar.

# calculate the mean
city_mpg.mean()

18.369045304297103

city_mpg.is_unique

False

pd.Series([1,2,3]).is_unique

True

city_mpg.is_monotonic_increasing

False

pd.Series([1,2,3]).is_monotonic_increasing

True

# default is median (50% quantile)
city_mpg.quantile()

17.0

city_mpg.quantile(0.9)

24.0

# multiple quantiles returns a Series
city_mpg.quantile([0.1, 0.5, 0.9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64

If you want the count of values that meet some criteria, you can use the .sum method:

# count of cars with mileage greater than 20
(city_mpg
     .gt(20)
     .sum()
)

# percentage of cars with mileage greater than 20
(city_mpg
     .gt(20)
     .mul(100)
     .mean()
)

24.965973167412017

Observe the .mul(100).mean() calculation on a simpler Series:

(pd.Series([1,2,3,4])
    .gt(2)
    .mul(100)
)

0      0
1      0
2    100
3    100
dtype: int64

(pd.Series([1,2,3,4])
     .gt(2)
     .mul(100)
     .mean()   
)

50.0

If you sum up a series of boolean values, the result is the count of True values. If you take the mean of a series of boolean values, the result is the fraction of values that are True.

.agg can perform multiple operations.

city_mpg.agg('mean')

18.369045304297103

def second_to_last(s):
    return s.iloc[-2]

city_mpg.agg(['mean', np.var, max, second_to_last])

mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

Aggregation strings and descriptions:

Method	Description
`'all'`	Returns `True` if every value is truthy.
`'any'`	Returns `True` if any value is truthy.
`'autocorr'`	Returns Pearson correlation of series with shifted self. Can override `lag` as keyword argument (default is 1).
`'corr'`	Returns Pearson correlation of series with other series. Need to specify `other`
`'count'`	Returns count of non-missing values.
`'cov'`	Returns covariance of series with other series. Need to specify `other`
`'dtype'`	Type of the series.
`'dtypes'`	Type of the series.
`'empty'`	`True` is no values in series.
`'hasnans'`	`True` if missing values in series.
`'idxmax'`	Returns index value of maximum value.
`'idxmin'`	Returns index value of minimum value.
`'is_monotonic'`	`True` if values always increase.
`'is_monotonic_decreasing'`	`True` if values always decrease.
`'is_monotonic_increasing'`	`True` if values always increase.
`'kurt'`	Returns “excess” kurtosis (0 is normal distribution). Values greater than 0 have more outliers than normal.
`'mad'`	Returns the mean absolute deviation.
`'max'`	Returns the maximum value.
`'mean'`	Returns the mean value.
`'median'`	Returns the median value.
`'min'`	Returns the minimum value.
`'nbytes'`	Returns the number of bytes of the data.
`'ndim'`	Return the number of dimensions (1) of the data.
`'nunique'`	Returns the count of unique values.
`'quantile'`	Returns the median value. Can override `q` to specify other quantile.
`'sem'`	Returns the unbiarsed standard error.
`'size'`	Returns the size of the data.
`'skew'`	Returns the unbiased skew of the data. Negative indicates tail is on the left side.
`'std'`	Returns the standard deviation of the data.
`'sum'`	Returns the sum of the series.

Aggregation methods and properties:

Method	Description
`s.agg(func=None, axis=0, args, *kwargs)`	Returns a scalar if `func` is a single aggregation function. Returns a series if a list of aggregations are passed to `func`.
`s.all(axis=0, bool_only=None, skipna=True, level=None)`	Returns `True` if every value is truthy. Otherwise `False`.
`s.any(axis=0, bool_only=None, skipna=True, level=None)`	Returns `True` if at least one value is truthy. Otherwise `False`.
`s.autocorr(lag=1)`	Returns Pearson correlation between `s` and shifted `s`.
`s.corr(other, method='pearson')`	Returns correlation coefficient for `'pearson'`, `'spearman'`, `'kendall'`, or a callable.
`s.cov(other, min_periods=None)`	Returns covariance.
`s.max(axis=None, skipna=None, level=None, numeric_only=None)`	Returns maximum value.
`s.min(axis=None, skipna=None, level=None, numeric_only=None)`	Returns minimum value.
`s.mean(axis=None, skipna=None, level=None, numeric_only=None)`	Returns mean value.
`s.median(axis=None, skipna=None, level=None, numeric_only=None)`	Returns median value.
`s.prod(axis=None, skipna=None, level=None, numeric_only=None, min_count=0)`	Returns product of `s` values.
`s.quantile(q=0.5, interpolation='linear')`	Returns 50% quantile by default. Returns `Series` if `q` is a list.
`s.sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)`	Returns unbiased standard error of mean.
`s.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)`	Returns sample standard deviation.
`s.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)`	Returns unbiased variance.
`s.skew(axis=None, skipna=None, level=None, numeric_only=None)`	Returns unbiased skew.
`s.kurtosis(axis=None, skipna=None, level=None, numeric_only=None)`	Returns unbiased kurtosis.
`s.nunique(dropna=True)`	Returns count of unique items.
`s.count(level=None)`	Returns count of non-missing items.
`s.size`	Number of items in series. (Property)
`s.is_unique`	`True` if all values are unique.
`s.is_monotonic`	`True` if all values are increasing.
`s.is_monotonic_increasing`	`True` if all values are increasing.
`s.is_monotonic_decreasing`	`True` if all values are decreasing.

Exercises

With a dataset of your choice:

Find the count of non-missing values of a series.
Find the number of entries of a series.
Find the number of unique entries of a series.
Find the mean value of a series.
Find the maximum value of a series.
Use the .agg method to find all of the above.

city_mpg.count()

city_mpg.size

city_mpg.nunique()

city_mpg.mean()

18.369045304297103

city_mpg.max()

city_mpg.agg(['count', 'size', 'nunique', 'mean', 'max'])

count      41144.000000
size       41144.000000
nunique      105.000000
mean          18.369045
max          150.000000
Name: city08, dtype: float64

Chapter 8: Conversion Methods

8.1 Automatic Conversion

.convert_dtypes tries to convert a Series to a type that supports pd.NA. In the case of our city_mpg series it will change the type from int64 to Int64.

city_mpg.convert_dtypes()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64

.astype works more explicitly. The maximumm 8-bit integer is 127, so we need 16-bit integer for city_mpg since it’s max is 150.

city_mpg.max()

city_mpg.astype('Int16')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int16

city_mpg.astype('Int8')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mambaforge/lib/python3.10/site-packages/pandas/core/arrays/integer.py:51, in IntegerDtype._safe_cast(cls, values, dtype, copy)
     50 try:
---> 51     return values.astype(dtype, casting="safe", copy=copy)
     52 except TypeError as err:

TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[93], line 1
----> 1 city_mpg.astype('Int8')

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/generic.py:6324, in NDFrame.astype(self, dtype, copy, errors)
   6317     results = [
   6318         self.iloc[:, i].astype(dtype, copy=copy)
   6319         for i in range(len(self.columns))
   6320     ]
   6322 else:
   6323     # else, only a single dtype is given
-> 6324     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6325     return self._constructor(new_data).__finalize__(self, method="astype")
   6327 # GH 33113: handle empty frame or series

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/internals/managers.py:451, in BaseBlockManager.astype(self, dtype, copy, errors)
    448 elif using_copy_on_write():
    449     copy = False
--> 451 return self.apply(
    452     "astype",
    453     dtype=dtype,
    454     copy=copy,
    455     errors=errors,
    456     using_cow=using_copy_on_write(),
    457 )

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/internals/managers.py:352, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353     result_blocks = extend_blocks(applied, result_blocks)
    355 out = type(self).from_blocks(result_blocks, self.axes)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/internals/blocks.py:511, in Block.astype(self, dtype, copy, errors, using_cow)
    491 """
    492 Coerce to the new dtype.
    493 
   (...)
    507 Block
    508 """
    509 values = self.values
--> 511 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    513 new_values = maybe_coerce_values(new_values)
    515 refs = None

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:242, in astype_array_safe(values, dtype, copy, errors)
    239     dtype = dtype.numpy_dtype
    241 try:
--> 242     new_values = astype_array(values, dtype, copy=copy)
    243 except (ValueError, TypeError):
    244     # e.g. _astype_nansafe can fail on object-dtype of strings
    245     #  trying to convert to float
    246     if errors == "ignore":

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:187, in astype_array(values, dtype, copy)
    184     values = values.astype(dtype, copy=copy)
    186 else:
--> 187     values = _astype_nansafe(values, dtype, copy=copy)
    189 # in pandas we don't store numpy str dtypes, so convert to object
    190 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:84, in _astype_nansafe(arr, dtype, copy, skipna)
     82 # dispatch on extension dtype if needed
     83 if isinstance(dtype, ExtensionDtype):
---> 84     return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
     86 elif not isinstance(dtype, np.dtype):  # pragma: no cover
     87     raise ValueError("dtype must be np.dtype or ExtensionDtype")

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/arrays/masked.py:140, in BaseMaskedArray._from_sequence(cls, scalars, dtype, copy)
    136 @classmethod
    137 def _from_sequence(
    138     cls: type[BaseMaskedArrayT], scalars, *, dtype=None, copy: bool = False
    139 ) -> BaseMaskedArrayT:
--> 140     values, mask = cls._coerce_to_array(scalars, dtype=dtype, copy=copy)
    141     return cls(values, mask)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/arrays/numeric.py:277, in NumericArray._coerce_to_array(cls, value, dtype, copy)
    275 default_dtype = dtype_cls._default_np_dtype
    276 mask = None
--> 277 values, mask, _, _ = _coerce_to_data_and_mask(
    278     value, mask, dtype, copy, dtype_cls, default_dtype
    279 )
    280 return values, mask

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/arrays/numeric.py:233, in _coerce_to_data_and_mask(values, mask, dtype, copy, dtype_cls, default_dtype)
    231     values = values.astype(dtype, copy=copy)
    232 else:
--> 233     values = dtype_cls._safe_cast(values, dtype, copy=False)
    235 return values, mask, dtype, inferred_type

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/arrays/integer.py:57, in IntegerDtype._safe_cast(cls, values, dtype, copy)
     54 if (casted == values).all():
     55     return casted
---> 57 raise TypeError(
     58     f"cannot safely cast non-equivalent {values.dtype} to {np.dtype(dtype)}"
     59 ) from err

TypeError: cannot safely cast non-equivalent int64 to int8

If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data.

Use NumPy to inspect limits on integer and float types:

np.iinfo('int64')

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

np.iinfo('uint8')

iinfo(min=0, max=255, dtype=uint8)

np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

8.2 Memory Usage

Use the .nbytes property or the .memory_usage method to calculate memory usage of the Series.

Pass deep=True to .memory_usage when dealing with object types in the Series.

city_mpg.nbytes

city_mpg.astype('Int16').nbytes

To get the amount of memory that includes strings in the Series (like the make column), we need to use the .memory_usage method:

df.make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

df.make.nbytes

df.make.memory_usage()

df.make.memory_usage(deep=True)

.memory_usage includes the index memory and can include the contribution from object types. .nbytes is just the memory that the data is using and not the ancillary parts of the Series.

Converting to categorical will save a lot of memory for the make data:

(df.make
 .astype('category')
 .memory_usage(deep=True)
)

8.3 String and Category Types

city_mpg.astype(str)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object

city_mpg.astype('category')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

When you convert to categorical data, pandas no longer uses Python strings for each value but optimizes it. Potentially large memory savings if you have many duplicate values.

8.4 Ordered Categories

To cretae ordered categories you need to define your own CategoricalDtype:

values = pd.Series(sorted(set(city_mpg)))
city_type = pd.CategoricalDtype(categories=values, ordered=True)
city_mpg.astype(city_type)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

String or Type	Description
`str 'str' 'string'`	Convert to Python string.
`'string'`	Convert type to pandas string (supports `pd.NA`).
`int 'int' 'int64'\|Convert type to NumPy int64.\| \|'int32' 'uint32'\|Convert type to 32 signed or unsigned NumPy integer (can also use 16 and 8).\| \|`‘Int64’`\|Convert type to pandas Int64 (supports`pd.NA`). Might complain when you convert floats or strings.\| \|`float ‘float’ ‘float64’`\|Convert type to NumPy float64 (can also support 32 or 16)\| \|`‘category’`\|Convert type to categorical (supports`pd.NA`). Can also use instance of`CategoricalDtype`\| \|dates\|Don't use this for data conversion, use`pd.to_datetime`.

8.5 Converting to Other Types

Using Python lists will slow down your code significantly.

You can convert a Series into a DataFrame:

city_mpg.to_frame()

	city08
0	19
1	9
2	23
3	10
4	17
...	...
41139	19
41140	20
41141	18
41142	18
41143	16

41144 rows × 1 columns

Method	Description
`s.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)`	Convert types to appropriate pandas 1 types (that support `NA`). Doesn’t try to reduce size of integer or float types
`s.astype(dtype, copy=True, errors='raise')`	Cast series into particular type. If `errors='ignore'` then return original series on error.
`pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)`	Convert `arg` (a series) into datetime. Use `format` to specify strftime string.
`s.to_numpy(dtype=None, copy=False, na_value=object, **kwargs)`	Convert the series to a NumPy array.
`s.values`	Convert the series to a NumPy array.
`s.to_frame(name=None)`	Return a dataframe representation of the series.
`pd.CategoricalDtype(categories=None, ordered=False)`	Create a type for categorical data.

8.7 Exercises

With a dataset of your choice:

1. Convert a numeric column to a smaller type.

# currently a float64 type
df.barrels08

0        15.695714
1        29.964545
2        12.207778
3        29.964545
4        17.347895
           ...    
41139    14.982273
41140    14.330870
41141    15.695714
41142    15.695714
41143    18.311667
Name: barrels08, Length: 41144, dtype: float64

# convert to float16
df.barrels08.astype('float16')

0        15.695312
1        29.968750
2        12.210938
3        29.968750
4        17.343750
           ...    
41139    14.984375
41140    14.328125
41141    15.695312
41142    15.695312
41143    18.312500
Name: barrels08, Length: 41144, dtype: float16

2. Calculate the memory savings by converting to smaller numeric types.

df.barrels08.memory_usage() - df.barrels08.astype('float16').memory_usage()

3. Convert a string column into a categorical type.

df.make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

values = pd.Series(sorted(set(df.make)))
make_type = pd.CategoricalDtype(categories=values, ordered=False)
df.make.astype(make_type)

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

4. Calculate the memory savings by converting to a categorical type.

df.make.memory_usage(deep=True) - df.make.astype(make_type).memory_usage(deep=True)

Chapter 9: Manipulation Methods

Comparing non-broadcasted .apply method with vectorized code:

# non-vectorized function to check if value is greater than 20
def gt20(val): 
    return val > 20

%%timeit
city_mpg.apply(gt20)

7.08 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
city_mpg.gt(20)

174 µs ± 44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The broadcasted .gt method is 40 times faster than non-broadcasted .apply function.

Show the top 5 makes and label everything else as Other:

# top 5 makes
top5 = df.make.value_counts().index[:5]

# function to use in apply
def generalize_top5(val):
    if val in top5:
        return val
    return 'Other'

%%timeit
df.make.apply(generalize_top5)

48.2 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

A (10 times) faster and more idiomatic manner of doing this uses the .where method, which keeps values from the series it is called on where the boolean array is true. If the boolean array is false, it uses the value of the second parameter, other:

%%timeit
df.make.where(df.make.isin(top5), other='Other')

3.68 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df.make.where(df.make.isin(top5), other='Other')

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

The complement of .where is .mask—wherever the condition if False it keeps the original values; if it is True it replaces the value with the other parameter.

df.make.mask(~df.make.isin(top5), other='Other')

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

%%timeit
df.make.mask(~df.make.isin(top5), other='Other')

3.36 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The tilde ~ performs an inversion of the boolean array, switching all true values to false and vice versa.

9.2 If Else with Pandas

There is no way to do the following: if I wanted to keep the top five makes and use Top10 for the remainder of the top ten makes, with Other for the rest.

vc = df.make.value_counts()
top5 = vc.index[:5]
top10 = vc.index[:10]
def generalize(val):
    if val in top5:
        return val
    elif val in top10:
        return 'Top10'
    else:
        return 'Other'

df.make.apply(generalize)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

%%timeit
df.make.apply(generalize)

76.4 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

To replicate in pandas, chain calls to .where:

(df.make
 .where(df.make.isin(top5), 'Top10')
 .where(df.make.isin(top10), 'Other')
)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

%%timeit
(df.make
 .where(df.make.isin(top5), 'Top10')
 .where(df.make.isin(top10), 'Other')
)

6.05 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The pandas approach is still about 13 times faster.

The select function in NumPy works with pandas series. The interface takes a list of boolean arrays and a list with corresponding replacement values.

np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other')

array(['Other', 'Other', 'Dodge', ..., 'Other', 'Other', 'Other'],
      dtype=object)

%%timeit
np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other')

19.8 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

You can wrap it in a Series. I like this syntax for longer if statements than chaining .where calls because I think it is easier to understand.

pd.Series(np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other'), index=df.make.index)

0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Length: 41144, dtype: object

9.3 Missing Data

Count the number of missing items with .isna().sum():

(df.cylinders
 .isna()
 .sum()
)

Let’s index where the values are missing in the cylinders column and then show what those makes are:

missing = df.cylinders.isna()
df.make.loc[missing]

7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: object

9.4 Filling in Missing Data

It seems like cylinders are missing for cars that are electric (they have zero cylinders).

df.cylinders[df.cylinders.isna()]

7138    NaN
7139    NaN
8143    NaN
8144    NaN
8146    NaN
         ..
34563   NaN
34564   NaN
34565   NaN
34566   NaN
34567   NaN
Name: cylinders, Length: 206, dtype: float64

df.cylinders.fillna(0).loc[7136:7141]

7136    6.0
7137    6.0
7138    0.0
7139    0.0
7140    6.0
7141    6.0
Name: cylinders, dtype: float64

9.5 Interpolating Data

temp = pd.Series([32, 40, None, 42, 39, 32])
temp

0    32.0
1    40.0
2     NaN
3    42.0
4    39.0
5    32.0
dtype: float64

temp.interpolate()

0    32.0
1    40.0
2    41.0
3    42.0
4    39.0
5    32.0
dtype: float64

9.6 Clipping Data

city_mpg.loc[:446]

0      19
1       9
2      23
3      10
4      17
       ..
442    15
443    15
444    15
445    15
446    31
Name: city08, Length: 447, dtype: int64

(city_mpg
     .loc[:446]
     .clip(lower=city_mpg.quantile(0.05),
           upper=city_mpg.quantile(0.95))
)

0      19
1      11
2      23
3      11
4      17
       ..
442    15
443    15
444    15
445    15
446    27
Name: city08, Length: 447, dtype: int64

.clip uses .where under the hood.

9.7 Sorting Values

The .sort_values method will sort the values in ascending order and also rearrange the index accordingly.

city_mpg.sort_values()

7901       6
34557      6
37161      6
21060      6
35887      6
        ... 
34563    138
34564    140
32599    150
31256    150
33423    150
Name: city08, Length: 41144, dtype: int64

Because of index alignment, you can still do math operations on a sorted series:

(city_mpg.sort_values() + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

9.8 Sorting the Index

Below we unsort the index by sorting the values, then essentially revert that:

city_mpg.sort_values().sort_index()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

9.9 Dropping Duplicates

keep='first' is the default value and keeps the first duplicate value found.

keep='last' keeps the last duplicate value found.

keep=False will remove any duplicated values (including the initial value).

drop_duplicates keep the original index.

city_mpg.drop_duplicates()

0         19
1          9
2         23
3         10
4         17
        ... 
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, Length: 105, dtype: int64

9.10 Ranking Data

The .rank method will return a series that keeps the original index but uses the ranks of values from the original series. By default, if two values are the same, their rank will be the average of the positions they take. You can specify method='min' to put equal values in the same rank and method='dense' to not skip any positions:

city_mpg.rank()

0        27060.5
1          235.5
2        35830.0
3          607.5
4        19484.0
          ...   
41139    27060.5
41140    29719.5
41141    23528.0
41142    23528.0
41143    15479.0
Name: city08, Length: 41144, dtype: float64

city_mpg.rank(method='min')

0        25555.0
1          136.0
2        35119.0
3          336.0
4        17467.0
          ...   
41139    25555.0
41140    28567.0
41141    21502.0
41142    21502.0
41143    13492.0
Name: city08, Length: 41144, dtype: float64

city_mpg.rank(method='dense')

0        14.0
1         4.0
2        18.0
3         5.0
4        12.0
         ... 
41139    14.0
41140    15.0
41141    13.0
41142    13.0
41143    11.0
Name: city08, Length: 41144, dtype: float64

# a simpler example
pd.Series([1,1,2,3]).rank()

0    1.5
1    1.5
2    3.0
3    4.0
dtype: float64

pd.Series([1,1,2,3]).rank(method='min')

0    1.0
1    1.0
2    3.0
3    4.0
dtype: float64

pd.Series([1,1,2,3]).rank(method='dense')

0    1.0
1    1.0
2    2.0
3    3.0
dtype: float64

9.11 Replacing Data

The .replace method allows you to map values to new values.

df.make.replace('Subaru', 'SUBARU')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            SUBARU
            ...    
41139        SUBARU
41140        SUBARU
41141        SUBARU
41142        SUBARU
41143        SUBARU
Name: make, Length: 41144, dtype: object

# you can also use regex
df.make.replace(r'(Fer)ra(r.*)', value=r'\2-other-\1', regex=True)

0          Alfa Romeo
1        ri-other-Fer
2               Dodge
3               Dodge
4              Subaru
             ...     
41139          Subaru
41140          Subaru
41141          Subaru
41142          Subaru
41143          Subaru
Name: make, Length: 41144, dtype: object

9.12 Binning Data

Using the cut function, you can create bins of equal width:

pd.cut(city_mpg, 10)

0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

You can specify sizes for bin edges. In the following, 5 bins are created (so you need to provide 6 edges):

pd.cut(city_mpg, [0, 10, 20, 40, 70, 150])

0        (10, 20]
1         (0, 10]
2        (20, 40]
3         (0, 10]
4        (10, 20]
           ...   
41139    (10, 20]
41140    (10, 20]
41141    (10, 20]
41142    (10, 20]
41143    (10, 20]
Name: city08, Length: 41144, dtype: category
Categories (5, interval[int64, right]): [(0, 10] < (10, 20] < (20, 40] < (40, 70] < (70, 150]]

city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

Note the bins have a half-open interval. They do not have the start value but do include the end value. If the city_mpg series had values with 0 or values above 150, they would be missing after binning the series.

If you wanted 10 bins that had approximately the same number of entries in each bin (rather than each bin width being the same) use the qcut function:

pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

Both allow you to set the labels to use instead of the categorical intervals they generate:

pd.qcut(city_mpg, 10, labels=list(range(1,11)))

0        7
1        1
2        9
3        1
4        5
        ..
41139    7
41140    7
41141    6
41142    6
41143    4
Name: city08, Length: 41144, dtype: category
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]

Manipulation methods and properties:

Method	Description
`s.apply(func, convert_dtype=True, args=(), **kwds)`	Pass in a NumPy function that works on the series, or a Python function that works on a single value. `args` and `kwds` are arguments for `func`. Returns a series, or dataframe if `func` returns a series.
`s.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)`	Pass in a boolean series/dataframe, list, or callable as `cond`. If the value is `True`, keep it, otherwise use `other` value. If it is a function, it takes a series and should return a boolean sequence.
`np.select(condlist, choicelist, default=0)`	Pass in a list of boolean arrays for `condlist`. If the value is true use the corresponding value from `choicelist`. If multiple conditions are True, only use the first. Returns a NumPy array.
`s.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)`	Pass in a scalar, dict, series or dataframe for `value`. If it is a scalar, use that value, otherwise use the index from the old value to the new value.
`s.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)`	Perform interpolation with missing values. `method` may be `linear`, `time` among others.
`s.clip(lower=None, upper=None, axis=None, inplace=False, args, *kwargs)`	Return a new series with values clipped to `lower` and `upper`.
`s.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)`	Return a series with values sorted. The `kind` option may be `'quicksort'`, `'mergesort'` (stable), or `'heapsort'`. `na_position` indicates location of NaNs and may be `'first'` or `'last'`.
`s.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)`	Return a series with index sorted. The `kind` option may be `'quicksort'`, `'mergesort'` (stable), or `'heapsort'`. `na_position` indicates location of NaNs and may be `'first'` or `'last'`.
`s.drop_duplicates(keep='first', inplace=False)`	Drop duplicates. `keep` may be `'first'`, `'last'`, or False. (If `False`, it removes all values that were duplicated).
`s.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)`	Return a series with numerical ranks. `method` allows you to specify tie handling. `'average'`, `'min'`, `'max'`, `'first'` (usses order they appear in series), `'dense'` (like `'min'`, but rank only increases by one after tie). `na_option` allows you to specify `NaN` handling. `'keep'` (stay at `NaN`), `'top'` (move to smallest), `'bottom'` (move to largest).
`s.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')`	Return a series with new values. `to_replace` can be many things. If it is a string, number or regular expression, you can replace it with a scalar `value`. It can also be a list of those things which requires `value` to be a list of the same size. Finally, it can be a dictionary mapping old values to new values.
`pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)`	Bin values from `x` (a series). If `bins` is an integer, use equal-width bins. If `bins` is a list of numbers (defining minimum and maximum positions) use those for the edges. `right` defines whether the right edge is open or closed. `labels` allows you to specify the bin names. Out of bounds values will be missing.
`pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')`	Bin values from `x` (a series) into `q` equal sized bins. Alternatively, can pass in a list of quantile edges. Out of bounds values will be missing.

9.14 Exercises

1. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using .apply.

city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

city_mpg.mean()

18.369045304297103

def generalize_mean(x, mean_val):
    if x >= mean_val:
        return 'high'
    return 'low'

city_mpg.apply(generalize_mean, mean_val=city_mpg.mean())

0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Name: city08, Length: 41144, dtype: object

2. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using np.select.

pd.Series(np.select([city_mpg.gt(city_mpg.mean())], ['high'], 'low'))

0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Length: 41144, dtype: object

3. Time the differences between the previous two solutions to see which is faster.

%%timeit
city_mpg.apply(generalize_mean, mean_val=city_mpg.mean())

22.3 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.Series(np.select([city_mpg.gt(city_mpg.mean())], ['high'], 'low'))

4.45 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.select is around 5 times as fast as .apply in this exercise.

4. Replace the missing values of a numeric series with the median value.

df.cylinders.isna()[7136:7141]

7136    False
7137    False
7138     True
7139     True
7140    False
Name: cylinders, dtype: bool

df.cylinders.fillna(df.cylinders.median())[7136:7141]

7136    6.0
7137    6.0
7138    6.0
7139    6.0
7140    6.0
Name: cylinders, dtype: float64

5.Clip the values of a numeric series to between the 10th and 90th percentiles.

clip_s = pd.Series([1,2,3,4,5,6,7,8,9,10])
(clip_s
     .clip(lower=clip_s.quantile(0.1),
           upper=clip_s.quantile(0.9))
)

0    1.9
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    9.0
9    9.1
dtype: float64

6. Using a categorical column, replace any value that is not in the top 5 most frequent values with 'Other'.

df.fuelType.unique()

array(['Regular', 'Premium', 'Diesel', 'CNG', 'Gasoline or natural gas',
       'Gasoline or E85', 'Electricity', 'Gasoline or propane',
       'Premium or E85', 'Midgrade', 'Premium Gas or Electricity',
       'Regular Gas and Electricity', 'Premium and Electricity',
       'Regular Gas or Electricity'], dtype=object)

top5 = df.fuelType.value_counts().index[:5]
top5

Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity'], dtype='object', name='fuelType')

df.fuelType.where(df.fuelType.isin(top5), other='Other')

0        Regular
1        Regular
2        Regular
3        Regular
4        Premium
          ...   
41139    Regular
41140    Regular
41141    Regular
41142    Regular
41143    Premium
Name: fuelType, Length: 41144, dtype: object

7. Using a categorical column, replace any value that is not in the top 10 most frequent values with 'Other'.

top10 = df.fuelType.value_counts().index[:10]
top10

Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity',
       'Premium or E85', 'Midgrade', 'CNG', 'Premium and Electricity',
       'Regular Gas and Electricity'],
      dtype='object', name='fuelType')

df.fuelType.where(df.fuelType.isin(top10), other='Other')

0        Regular
1        Regular
2        Regular
3        Regular
4        Premium
          ...   
41139    Regular
41140    Regular
41141    Regular
41142    Regular
41143    Premium
Name: fuelType, Length: 41144, dtype: object

8. Make a function that takes a categorical series and a number (n) and returns a replace series that replaces any value that is not in the top n most frequent values with 'Other'.

def top_n_categorical(s, n):
    top_n = s.value_counts().index[:n]
    return s.where(s.isin(top_n), other='Other')

s = top_n_categorical(df.fuelType, 10)
s.unique()

array(['Regular', 'Premium', 'Diesel', 'CNG', 'Other', 'Gasoline or E85',
       'Electricity', 'Premium or E85', 'Midgrade',
       'Regular Gas and Electricity', 'Premium and Electricity'],
      dtype=object)

s = top_n_categorical(df.fuelType, 5)
s.unique()

array(['Regular', 'Premium', 'Diesel', 'Other', 'Gasoline or E85',
       'Electricity'], dtype=object)

9. Using a numeric column, bin it into 10 groups that have the same width.

pd.cut(city_mpg, 10)

0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

10. Using a numeric column, bin it into 10 groups that have equal sized bins.

pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

This Stack Overflow response addresses the below error:

pd.qcut(df.cylinders, 10)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[249], line 1
----> 1 pd.qcut(df.cylinders, 10)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/reshape/tile.py:379, in qcut(x, q, labels, retbins, precision, duplicates)
    376 x_np = x_np[~np.isnan(x_np)]
    377 bins = np.quantile(x_np, quantiles)
--> 379 fac, bins = _bins_to_cuts(
    380     x,
    381     bins,
    382     labels=labels,
    383     precision=precision,
    384     include_lowest=True,
    385     dtype=dtype,
    386     duplicates=duplicates,
    387 )
    389 return _postprocess_for_cut(fac, bins, retbins, dtype, original)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/reshape/tile.py:420, in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates, ordered)
    418 if len(unique_bins) < len(bins) and len(bins) != 2:
    419     if duplicates == "raise":
--> 420         raise ValueError(
    421             f"Bin edges must be unique: {repr(bins)}.\n"
    422             f"You can drop duplicate edges by setting the 'duplicates' kwarg"
    423         )
    424     bins = unique_bins
    426 side: Literal["left", "right"] = "left" if right else "right"

ValueError: Bin edges must be unique: array([ 2.,  4.,  4.,  4.,  5.,  6.,  6.,  6.,  8.,  8., 16.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

If I decrease the number of bins to 5, the error is not raised:

pd.qcut(df.cylinders, 5)

0        (1.999, 4.0]
1         (8.0, 16.0]
2        (1.999, 4.0]
3          (6.0, 8.0]
4        (1.999, 4.0]
             ...     
41139    (1.999, 4.0]
41140    (1.999, 4.0]
41141    (1.999, 4.0]
41142    (1.999, 4.0]
41143    (1.999, 4.0]
Name: cylinders, Length: 41144, dtype: category
Categories (5, interval[float64, right]): [(1.999, 4.0] < (4.0, 5.0] < (5.0, 6.0] < (6.0, 8.0] < (8.0, 16.0]]

Chapter 10: Indexing Operations

Both a series and a dataframe have an index. Both types support the Python indexing operator ([]). Both have attributes .loc and .iloc that you can index against.

10.1 Prepping the Data and Renaming the Index

Use .rename method to hange the index labels. We can pass in a dictionary to map the previous index label to the new label:

import itertools

dict(itertools.islice(df.make.to_dict().items(), 10))

{0: 'Alfa Romeo',
 1: 'Ferrari',
 2: 'Dodge',
 3: 'Dodge',
 4: 'Subaru',
 5: 'Subaru',
 6: 'Subaru',
 7: 'Toyota',
 8: 'Toyota',
 9: 'Toyota'}

city2 = city_mpg.rename(df.make.to_dict())

city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

# view the index
city2.index

Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

The .rename method also accepts a series, a scalar, a function that takes an old label and returns a new lable or a sequence. When we pass in a series and the index values are the same, the values from the series that we passed in are used as the index.

city2 = city_mpg.rename(df.make)
city2

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

If you pass a scalar value (a single string) into .rename the index will stay the same but the .name attribute of the series will update:

city2.rename('citympg')

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: citympg, Length: 41144, dtype: int64

10.2 Resetting the Index

.reset_index by default will return a dataframe, moving the current index into a new column:

city2.reset_index()

	index	city08
0	Alfa Romeo	19
1	Ferrari	9
2	Dodge	23
3	Dodge	10
4	Subaru	17
...	...	...
41139	Subaru	19
41140	Subaru	20
41141	Subaru	18
41142	Subaru	18
41143	Subaru	16

41144 rows × 2 columns

drop=True drops the current index and returns a Series.

city2.reset_index(drop=True)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

Note that .sort_values and .sort_index keep the same index but just rearrange the order so they do not impact operations that align on the index.

10.3 The `.loc` Attribute

The .loc attribute deals with index labels. You can pass the following into an index operation on .loc:

A scalar value of one of the index labels.
A list of index labels.
A slice of labels (closed interval so it includes the stop value).
An index.
A boolean array (same index labels as the series, but with True and False values).
A function that accepts a series and returns one of the above.

If there are duplicate labels in the index, and you pass in a scalar with the label of an index, it will return a series. If there is only one value for that label it will return a scalar.

city2.loc['Subaru']

Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 885, dtype: int64

city2.loc['Fisker']

If you want to guarantee that a series is returned, pass in a list rather than passing in a scalar value:

city2.loc[['Fisker']]

Fisker    20
Name: city08, dtype: int64

city2.loc[['Ferrari', 'Lamborghini']]

Ferrari         9
Ferrari        12
Ferrari        11
Ferrari        10
Ferrari        11
               ..
Lamborghini     6
Lamborghini     8
Lamborghini     8
Lamborghini     8
Lamborghini     8
Name: city08, Length: 357, dtype: int64

Sort the index if you are slicing with duplicate index labels:

city2.loc['Ferrari':'Lamborghini']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[271], line 1
----> 1 city2.loc['Ferrari':'Lamborghini']

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1103, in _LocationIndexer.__getitem__(self, key)
   1100 axis = self.axis or 0
   1102 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1103 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1323, in _LocIndexer._getitem_axis(self, key, axis)
   1321 if isinstance(key, slice):
   1322     self._validate_key(key, axis)
-> 1323     return self._get_slice_axis(key, axis=axis)
   1324 elif com.is_bool_indexer(key):
   1325     return self._getbool_axis(key, axis=axis)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1355, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1352     return obj.copy(deep=False)
   1354 labels = obj._get_axis(axis)
-> 1355 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1357 if isinstance(indexer, slice):
   1358     return self.obj._slice(indexer, axis=axis)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py:6344, in Index.slice_indexer(self, start, end, step)
   6300 def slice_indexer(
   6301     self,
   6302     start: Hashable | None = None,
   6303     end: Hashable | None = None,
   6304     step: int | None = None,
   6305 ) -> slice:
   6306     """
   6307     Compute the slice indexer for input labels and step.
   6308 
   (...)
   6342     slice(1, 3, None)
   6343     """
-> 6344     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6346     # return a slice
   6347     if not is_scalar(start_slice):

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py:6537, in Index.slice_locs(self, start, end, step)
   6535 start_slice = None
   6536 if start is not None:
-> 6537     start_slice = self.get_slice_bound(start, "left")
   6538 if start_slice is None:
   6539     start_slice = 0

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py:6470, in Index.get_slice_bound(self, label, side)
   6468     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
   6469     if isinstance(slc, np.ndarray):
-> 6470         raise KeyError(
   6471             f"Cannot get {side} slice bound for non-unique "
   6472             f"label: {repr(original_label)}"
   6473         )
   6475 if isinstance(slc, slice):
   6476     if side == "left":

KeyError: "Cannot get left slice bound for non-unique label: 'Ferrari'"

city2.sort_index().loc['Ferrari':'Lamborghini']

Ferrari        10
Ferrari        13
Ferrari        13
Ferrari         9
Ferrari        10
               ..
Lamborghini    12
Lamborghini     9
Lamborghini     8
Lamborghini    13
Lamborghini     8
Name: city08, Length: 11210, dtype: int64

Slicing with .loc follows the closed interval, includes both the start index and the final index.

If you have a sorted index, you can slice with strings that are not actual labels.

city2.sort_index().loc["F":"J"]

Federal Coach    15
Federal Coach    13
Federal Coach    13
Federal Coach    14
Federal Coach    13
                 ..
Isuzu            15
Isuzu            15
Isuzu            15
Isuzu            27
Isuzu            18
Name: city08, Length: 9040, dtype: int64

You can also pass an Index to .loc:

idx = pd.Index(['Dodge'])
city2.loc[idx]

Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 2583, dtype: int64

If we duplicate 'Dodge' in the Index, the previous operation has twice as many values, a combinatoric explosion:

idx = pd.Index(['Dodge', 'Dodge'])
city2.loc[idx]

Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 5166, dtype: int64

You can also pass a boolean array to .loc:

mask = city2 > 50
mask

Alfa Romeo    False
Ferrari       False
Dodge         False
Dodge         False
Subaru        False
              ...  
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Name: city08, Length: 41144, dtype: bool

city2.loc[mask]

Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

You can use a function with .loc. If I calculate the boolean array before taking into account the inflation, I get the wrong answer:

cost = pd.Series([1.00, 2.25, 3.99, .99, 2.79],
                 index=['Gum', 'Cookie', 'Melon', 'Roll', 'Carrots'])

inflation = 1.10

mask = cost > 3

# wrong answer
(cost
     .mul(inflation)
     .loc[mask]
)

Melon    4.389
dtype: float64

# right answer
(cost
     .mul(inflation)
     .loc[lambda s_: s_ > 3]
)

Melon      4.389
Carrots    3.069
dtype: float64

There is an implicit return statement in the lambda function. You can only put an expression in it, you can have a statement. It is limited to a single line of code.

10.4 The `.iloc` Attribute

The .iloc attribute supports indexing with the following:

A scalar index position (an integer).
A list of index positions.
A slice of positions (half-open interval so it does not include stop value).
A NumPy array (or Python list) of boolean values.
A function that accepts a series and returns one of the above.

Because index positions are unique, we will always get the scalar value when indexing with .iloc at a position:

city2.iloc[0]

city2.iloc[-1]

If we want to return a series object, we can index it with a list of positions:

city2.iloc[[0]]

Alfa Romeo    19
Name: city08, dtype: int64

city2.iloc[[0, 1, -1]]

Alfa Romeo    19
Ferrari        9
Subaru        16
Name: city08, dtype: int64

We can also use slices with .iloc (they follow the half-open interval):

city2.iloc[0:5]

Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
Name: city08, dtype: int64

city2.iloc[-8:]

Saturn    21
Saturn    24
Saturn    21
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

You can use a NumPy array of booleans (or a Python list) but if you use what we call a boolean array (a pandas series with booleans), this will fail:

mask = city2 > 50
city2.iloc[mask]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[289], line 2
      1 mask = city2 > 50
----> 2 city2.iloc[mask]

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1103, in _LocationIndexer.__getitem__(self, key)
   1100 axis = self.axis or 0
   1102 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1103 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1642, in _iLocIndexer._getitem_axis(self, key, axis)
   1639     key = np.asarray(key)
   1641 if com.is_bool_indexer(key):
-> 1642     self._validate_key(key, axis)
   1643     return self._getbool_axis(key, axis=axis)
   1645 # a list of integers

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexing.py:1487, in _iLocIndexer._validate_key(self, key, axis)
   1481         if key.index.inferred_type == "integer":
   1482             raise NotImplementedError(
   1483                 "iLocation based boolean "
   1484                 "indexing on an integer type "
   1485                 "is not available"
   1486             )
-> 1487         raise ValueError(
   1488             "iLocation based boolean indexing cannot use "
   1489             "an indexable as a mask"
   1490         )
   1491     return
   1493 if isinstance(key, slice):

ValueError: iLocation based boolean indexing cannot use an indexable as a mask

city2.iloc[mask.to_numpy()]

Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

10.5 Heads and Tails

city2.head(3)

Alfa Romeo    19
Ferrari        9
Dodge         23
Name: city08, dtype: int64

city2.tail(3)

Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

10.6 Sampling

The code below randomly pulls out six values:

city2.sample(6, random_state=42)

Volvo         16
Mitsubishi    19
Buick         27
Jeep          15
Land Rover    13
Saab          17
Name: city08, dtype: int64

10.7 Filtering Index Values

The filter method will filter index labels by exact match (items), substring (like), or regex (regex).

# exact match fails with duplicate index labels
city2.filter(items=['Ford', 'Subaru'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[294], line 2
      1 # exact match fails with duplicate index labels
----> 2 city2.filter(items=['Ford', 'Subaru'])

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/generic.py:5537, in NDFrame.filter(self, items, like, regex, axis)
   5535     name = self._get_axis_name(axis)
   5536     # error: Keywords must be strings
-> 5537     return self.reindex(  # type: ignore[misc]
   5538         **{name: [r for r in items if r in labels]}  # type: ignore[arg-type]
   5539     )
   5540 elif like:
   5542     def f(x) -> bool_t:

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/series.py:4918, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance)
   4901 @doc(
   4902     NDFrame.reindex,  # type: ignore[has-type]
   4903     klass=_shared_doc_kwargs["klass"],
   (...)
   4916     tolerance=None,
   4917 ) -> Series:
-> 4918     return super().reindex(
   4919         index=index,
   4920         method=method,
   4921         copy=copy,
   4922         level=level,
   4923         fill_value=fill_value,
   4924         limit=limit,
   4925         tolerance=tolerance,
   4926     )

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/generic.py:5360, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance)
   5357     return self._reindex_multi(axes, copy, fill_value)
   5359 # perform the reindex on the axes
-> 5360 return self._reindex_axes(
   5361     axes, level, limit, tolerance, method, fill_value, copy
   5362 ).__finalize__(self, method="reindex")

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/generic.py:5375, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   5372     continue
   5374 ax = self._get_axis(a)
-> 5375 new_index, indexer = ax.reindex(
   5376     labels, level=level, limit=limit, tolerance=tolerance, method=method
   5377 )
   5379 axis = self._get_axis_number(a)
   5380 obj = obj._reindex_with_indexers(
   5381     {axis: [new_index, indexer]},
   5382     fill_value=fill_value,
   5383     copy=copy,
   5384     allow_dups=False,
   5385 )

File ~/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py:4275, in Index.reindex(self, target, method, level, limit, tolerance)
   4272     raise ValueError("cannot handle a non-unique multi-index!")
   4273 elif not self.is_unique:
   4274     # GH#42568
-> 4275     raise ValueError("cannot reindex on an axis with duplicate labels")
   4276 else:
   4277     indexer, _ = self.get_indexer_non_unique(target)

ValueError: cannot reindex on an axis with duplicate labels

city2.filter(like='rd')

Ford    18
Ford    16
Ford    17
Ford    17
Ford    15
        ..
Ford    26
Ford    19
Ford    21
Ford    18
Ford    19
Name: city08, Length: 3371, dtype: int64

city2.filter(regex='(Ford)|(Subaru)')

Subaru    17
Subaru    21
Subaru    22
Ford      18
Ford      16
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 4256, dtype: int64

Chapter 4: Series Introduction

Exercises

Chapter 5: Series Deep Dive

Exercises

Chapter 6: Operators (& Dunder Methods)

Exercises

Chapter 7: Aggregate Methods

Exercises

Chapter 8: Conversion Methods

8.1 Automatic Conversion

8.2 Memory Usage

8.3 String and Category Types

8.4 Ordered Categories

8.5 Converting to Other Types

8.7 Exercises

Chapter 9: Manipulation Methods

9.2 If Else with Pandas

9.3 Missing Data

9.4 Filling in Missing Data

9.5 Interpolating Data

9.6 Clipping Data

9.7 Sorting Values

9.8 Sorting the Index

9.9 Dropping Duplicates

9.10 Ranking Data

9.11 Replacing Data

9.12 Binning Data

9.14 Exercises

Chapter 10: Indexing Operations

10.1 Prepping the Data and Renaming the Index

10.2 Resetting the Index

10.3 The .loc Attribute

10.4 The .iloc Attribute

10.5 Heads and Tails

10.6 Sampling

10.7 Filtering Index Values

10.3 The `.loc` Attribute

10.4 The `.iloc` Attribute