Effective Pandas - Notes and Exercises

data analysis
python
An update on my progress in the book “Effective Pandas” by Matt Harrison.
Author

Vishal Bakshi

Published

July 6, 2023

In this blog post, I work through the book Effective Pandas by Matt Harrison. I’ll take notes, work through examples and end-of-chapter exercises.

Chapter 4: Series Introduction

Represent the following data in pure python:

Artist Data
0 145
1 142
2 38
3 13
series = {
    'index': [0, 1, 2, 3],
    'data': [145, 142, 38, 13],
    'name': 'songs'
}

series
{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}

The get function below can pull items out of this data structure based on the index:

def get(series, idx):
    value_idx = series['index'].index(idx)
    return series['data'][value_idx]
get(series, 1)
142

The index method on the list returns the list element at the provided index value.

[0, 1, 2, 3].index(1)
1

Below is an example that has string values for the index:

songs = {
    'index': ['Paul', 'John', 'George', 'Ringo'],
    'data': [145, 142, 38, 13],
    'name': 'songs'
}
get(songs, 'John')
142

Create a Series object from a list:

import pandas as pd
songs2 = pd.Series([145, 142, 38, 13], name = 'counts')
songs2
0    145
1    142
2     38
3     13
Name: counts, dtype: int64

The series is one-dimensional. The leftmost column is the index, also called the axis. The data (145, 142, 38, 13) is also called the values of the series. A DataFrame has two axes, one for the rows and another for the columns.

songs2.index
RangeIndex(start=0, stop=4, step=1)

The default values for an index are monotonically increasing integers. The index can be string-based as well (datatype for the index is object).

songs3 = pd.Series([145, 142, 38, 13],
                   name = 'counts',
                   index = ['Paul', 'John', 'George', 'Ringo'])
songs3
Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64
songs3.index
Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

We can insert Python objects into a series:

class Foo:
    pass

ringo = pd.Series(
    ['Richard', 'Starkey', 13, Foo()],
    name = 'ringo')

ringo
0                                 Richard
1                                 Starkey
2                                      13
3    <__main__.Foo object at 0x135016ce0>
Name: ringo, dtype: object

The object data type is also used for a series with string values and values that have heterogeneous or mixed types.

Here is a series that has NaN in it:

import numpy as np
nan_series = pd.Series([2, np.nan],
                       index = ['Ono', 'Clapton'])
nan_series
Ono        2.0
Clapton    NaN
dtype: float64

float64 supports NaN while int64 does not. As of pandas 0.24, Int64 (nullable integer type) supports NaN.

count ignores NaNs, .size does not.

nan_series.count()
1
nan_series.size
2
nan_series2 = pd.Series([2, None],
                        index = ['Ono', 'Clapton'],
                        dtype = 'Int64')
nan_series2
Ono           2
Clapton    <NA>
dtype: Int64
nan_series2.count()
1
# convert data type
nan_series.astype('Int64')
Ono           2
Clapton    <NA>
dtype: Int64

The Series object behaves similarly to a NumPy array.

numpy_ser = np.array([145, 142, 38, 13])
songs3[1], numpy_ser[1]
(142, 142)

They both have methods in common

songs3.mean(), numpy_ser.mean()
(84.5, 84.5)

They both have a notion of a boolean array.

mask = songs3 > songs3.median()
mask
Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool
# use mask as a filter
songs3[mask]
Paul    145
John    142
Name: counts, dtype: int64
# NumPy equivalent
numpy_ser[numpy_ser > np.median(numpy_ser)]
array([145, 142])

If can indicate that data is categorical.

Categorical values:

  • Use less memory than strings
  • Impove performance
  • Can have an ordering
  • Can perform operations on categories
  • Enforce membership on values
s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype = 'category')
s
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

By default categories don’t have an ordering.

s.cat.ordered
False

Convert non-categorical series to an ordered category:

s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])

size_type = pd.api.types.CategoricalDtype(
    categories=['s', 'm', 'l'], ordered = True)

s3 = s2.astype(size_type)
s3
0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
# can perform comparisons on ordered categories
s3 > 's'
0     True
1     True
2    False
3    False
4    False
dtype: bool
# add ordering information to categorical data
s.cat.reorder_categories(['xs', 's', 'm', 'l', 'xl'], ordered=True)
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

For strings and dates converted to categorical types, we can still use the str or dt attributes on them:

s3.str.upper()
0      M
1      L
2    NaN
3      S
4    NaN
dtype: object
Method Description
pd.Series(data=None, index=None, dtype=None, name=None, copy=False) Create a series from data (sequence, dictionary or scalar)
s.index Access index of series.
s.astype(dtype, errors='raise') Cast a series to dtype. To ignore errors (and return original object) use errors='ignore'
s[boolean_array] Return values from s where boolean_array is True
s.cat.ordered Determine if a categorical series is ordered
s.cat.reorder_categories(new_categories, ordered=False) Add categories (potentially ordered) to the series. new_categories must include all categories.

Exercises

  1. Using Jupyter, create a series with the temperature values for the last seven days. Filter out the vaues below the mean.
  2. Using Jupyter, create a series with your favorite colors. Use a categorical type.
# temperature series
temps = pd.Series([88, 84, 84, 84, 88, 95, 97 ,88])

temps[temps >= temps.mean()]
5    95
6    97
dtype: int64
# favorite colors
colors_series = pd.Series(['orange', 'coral', 'midnight green'], dtype = 'category')
colors_series
0            orange
1             coral
2    midnight green
dtype: category
Categories (3, object): ['coral', 'midnight green', 'orange']

Chapter 5: Series Deep Dive

# analyze the US Fuel Economy data
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

df = pd.read_csv(url)

city_mpg = df.city08
highway_mpg = df.highway08
/var/folders/5q/_bn7l90s177_2gq7rnhssjxm0000gn/T/ipykernel_52057/221626492.py:4: DtypeWarning: Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(url)
city_mpg
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64
highway_mpg
0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

Because the type is int64 we know that none of the values are missing.

The dir function lists the attributes of an object. A series has 400+ attributes:

len(dir(city_mpg))
412
len(dir(highway_mpg))
412

Functionality of series attributes:

  • Dunder methods provide many numeric operations, looping, attribute access, and index access. For the numeric operations, these return Series.
  • Corresponding operator methods for many of the numeric operations allow us to tweak the behavior.
  • Aggregate methods and properties which reduce or aggregate the values in a series down to a single scalar value.
  • Conversion methods. Some of these start with .to_ and export the data to other formats.
  • Manipulation methods that return Series objects with the same index.
  • Indexing and accessor methods and attributes that return Series or scalars.
  • String manipulation methods using .str.
  • Date manipulation methods using .dt.
  • Plotting methods using .plot.
  • Categorical manipulation methods using .cat.
  • Transformation methods.
  • Attributes such as .index and .dtype.
  • A bunch of private attributes (130 of them) that we’ll ignore.

Exercises

  1. Explore the documentation for five attributes of a series from Jupyter.
  2. How many attributes are found on the .str attribute? Look at the documentation for three of them.
  3. How many attributes are found on the .dt attribute? Look at the documentation for three of them.
city_mpg.values
array([19,  9, 23, ..., 18, 18, 16])
city_mpg.axes
[RangeIndex(start=0, stop=41144, step=1)]
city_mpg.empty
False
city_mpg.at[4]
17
city_mpg.loc[1:4]
1     9
2    23
3    10
4    17
Name: city08, dtype: int64
# 98 string attributes
len(dir(s2.str))
98
s2.str.cat(sep = ".")
'm.l.xs.s.xl'
s2.str.capitalize()
0     M
1     L
2    Xs
3     S
4    Xl
dtype: object
s2.str.endswith('l')
0    False
1     True
2    False
3    False
4     True
dtype: bool
dt_series = pd.Series(['2023-01-01', '2023-04-05', '2023-07-06'])

dt_series = pd.to_datetime(dt_series)
dt_series
0   2023-01-01
1   2023-04-05
2   2023-07-06
dtype: datetime64[ns]
len(dir(dt_series.dt))
83
dt_series.dt.day
0    1
1    5
2    6
dtype: int32
dt_series.dt.day_of_year
0      1
1     95
2    187
dtype: int32
dt_series.dt.daysinmonth
0    31
1    30
2    31
dtype: int32

Chapter 6: Operators (& Dunder Methods)

These are the protocols that determine how the Python language reacts to operations.

2 + 4
6
# under the cover is
(2).__add__(4)
6
(city_mpg + highway_mpg) / 2
0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

When you operate with two series, pandas will align the index before performing the operation. Because of index alignment, you will want to make sure that the indexes: - are unique - are common to both series

# example of series with repeated and non-common indexes
s1 = pd.Series([10, 20, 30], index=[1,2,2])
s2 = pd.Series([35, 44, 53], index=[2,2,4], name = 's2')
s1
1    10
2    20
2    30
dtype: int64
s2
2    35
2    44
4    53
Name: s2, dtype: int64
# index 1 and 4 have NaN
# index 2 has four results
s1 + s2
1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

When you perform math operations with a scalar, pandas broadcasts the operation to all values. A numeric pandas series is a block of memory, and modern CPUs leverage a technology called Single Instruction/Multiple Data (SIMD) to apply a math operation to the block of memory.

# use `fill_value` parameter to replace missing operands
s1.add(s2, fill_value = 0)
1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

Chaining makes the code easy to read and understand

(city_mpg
    .add(highway_mpg)
    .div(2))
0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64
Method Operator Description
s.add(s2) s + s2 Adds series
s.radd(s2) s2 + s Adds series
s.sub(s2) s - s2 Subtracts series
s.rsub(s2) s2 - s Subtracts series
s.mul(s2) s * s2 Multiplies series
s.multiply(s2) s * s2 Multiplies series
s.rmul(s2) s2 * s Multiplies series
s.div(s2) s / s2 Divides series
s.truediv(s2) s / s2 Divides series
s.rdiv(s2) s2 / s Divides series
s.rtruediv(s2) s2 / s Divides series
s.mod(s2) s % s2 Modulo of series division
s.rmod(s2) s2 % s Modulo of series division
s.floordiv(s2) s // s2 Floor divide series
s.rfloordiv(s2) s2 // s Floor divide series
s.pow(s2) s ** s2 Exponential power of series
s.rpow(s2) s2 ** s Exponential power of series
s.eq(s2) s2 == s Elementwise equals of series
s.ne(s2) s2 != s Elementwise not equals of series
s.gt(s2) s > s2 Elementwise greater than of series
s.ge(s2) s >= s2 Elementwise greater than or equals of series
s.lt(s2) s < s2 Elementwise less than of series
s.le(s2) s <= s2 Elementwise less than or equals of series
np.invert(s) ~s Elementwise inversion of boolean series (no pandas method)
np.logical_and(s, s2) s & s2 Elementwise logical and of boolean series (no pandas method)
np.logical_or(s, s2) s \| s2 Elementwise logical or of boolean series (no pandas method)

Exercises

With a dataset of your choice:

  1. Add a numeric series to itself.
  2. Add 10 to a numeric series.
  3. Add a numeric series to itself using the .add method.
  4. Read the documentation for the .add method.
city_mpg + city_mpg
0        38
1        18
2        46
3        20
4        34
         ..
41139    38
41140    40
41141    36
41142    36
41143    32
Name: city08, Length: 41144, dtype: int64
city_mpg + 10
0        29
1        19
2        33
3        20
4        27
         ..
41139    29
41140    30
41141    28
41142    28
41143    26
Name: city08, Length: 41144, dtype: int64
city_mpg.add(city_mpg)
0        38
1        18
2        46
3        20
4        34
         ..
41139    38
41140    40
41141    36
41142    36
41143    32
Name: city08, Length: 41144, dtype: int64
# experimenting with fill_value parameter
nan_series3 = pd.Series([2, None])
nan_series4 = pd.Series([3, None])
nan_series3
0    2.0
1    NaN
dtype: float64
nan_series4
0    3.0
1    NaN
dtype: float64
# two corresponding NaN values stay NaN
# even with fill_value = 0
nan_series3.add(nan_series4, fill_value=0)
0    5.0
1    NaN
dtype: float64

Chapter 7: Aggregate Methods

Aggregate methods collapse the values of a series down to a scalar.

# calculate the mean
city_mpg.mean()
18.369045304297103
city_mpg.is_unique
False
pd.Series([1,2,3]).is_unique
True
city_mpg.is_monotonic_increasing
False
pd.Series([1,2,3]).is_monotonic_increasing
True
# default is median (50% quantile)
city_mpg.quantile()
17.0
city_mpg.quantile(0.9)
24.0
# multiple quantiles returns a Series
city_mpg.quantile([0.1, 0.5, 0.9])
0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64

If you want the count of values that meet some criteria, you can use the .sum method:

# count of cars with mileage greater than 20
(city_mpg
     .gt(20)
     .sum()
)
10272
# percentage of cars with mileage greater than 20
(city_mpg
     .gt(20)
     .mul(100)
     .mean()
)
24.965973167412017

Observe the .mul(100).mean() calculation on a simpler Series:

(pd.Series([1,2,3,4])
    .gt(2)
    .mul(100)
)
0      0
1      0
2    100
3    100
dtype: int64
(pd.Series([1,2,3,4])
     .gt(2)
     .mul(100)
     .mean()   
)
50.0

If you sum up a series of boolean values, the result is the count of True values. If you take the mean of a series of boolean values, the result is the fraction of values that are True.

.agg can perform multiple operations.

city_mpg.agg('mean')
18.369045304297103
def second_to_last(s):
    return s.iloc[-2]
city_mpg.agg(['mean', np.var, max, second_to_last])
mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

Aggregation strings and descriptions:

Method Description
'all' Returns True if every value is truthy.
'any' Returns True if any value is truthy.
'autocorr' Returns Pearson correlation of series with shifted self. Can override lag as keyword argument (default is 1).
'corr' Returns Pearson correlation of series with other series. Need to specify other
'count' Returns count of non-missing values.
'cov' Returns covariance of series with other series. Need to specify other
'dtype' Type of the series.
'dtypes' Type of the series.
'empty' True is no values in series.
'hasnans' True if missing values in series.
'idxmax' Returns index value of maximum value.
'idxmin' Returns index value of minimum value.
'is_monotonic' True if values always increase.
'is_monotonic_decreasing' True if values always decrease.
'is_monotonic_increasing' True if values always increase.
'kurt' Returns “excess” kurtosis (0 is normal distribution). Values greater than 0 have more outliers than normal.
'mad' Returns the mean absolute deviation.
'max' Returns the maximum value.
'mean' Returns the mean value.
'median' Returns the median value.
'min' Returns the minimum value.
'nbytes' Returns the number of bytes of the data.
'ndim' Return the number of dimensions (1) of the data.
'nunique' Returns the count of unique values.
'quantile' Returns the median value. Can override q to specify other quantile.
'sem' Returns the unbiarsed standard error.
'size' Returns the size of the data.
'skew' Returns the unbiased skew of the data. Negative indicates tail is on the left side.
'std' Returns the standard deviation of the data.
'sum' Returns the sum of the series.

Aggregation methods and properties:

Method Description
s.agg(func=None, axis=0, *args, **kwargs) Returns a scalar if func is a single aggregation function. Returns a series if a list of aggregations are passed to func.
s.all(axis=0, bool_only=None, skipna=True, level=None) Returns True if every value is truthy. Otherwise False.
s.any(axis=0, bool_only=None, skipna=True, level=None) Returns True if at least one value is truthy. Otherwise False.
s.autocorr(lag=1) Returns Pearson correlation between s and shifted s.
s.corr(other, method='pearson') Returns correlation coefficient for 'pearson', 'spearman', 'kendall', or a callable.
s.cov(other, min_periods=None) Returns covariance.
s.max(axis=None, skipna=None, level=None, numeric_only=None) Returns maximum value.
s.min(axis=None, skipna=None, level=None, numeric_only=None) Returns minimum value.
s.mean(axis=None, skipna=None, level=None, numeric_only=None) Returns mean value.
s.median(axis=None, skipna=None, level=None, numeric_only=None) Returns median value.
s.prod(axis=None, skipna=None, level=None, numeric_only=None, min_count=0) Returns product of s values.
s.quantile(q=0.5, interpolation='linear') Returns 50% quantile by default. Returns Series if q is a list.
s.sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) Returns unbiased standard error of mean.
s.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) Returns sample standard deviation.
s.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) Returns unbiased variance.
s.skew(axis=None, skipna=None, level=None, numeric_only=None) Returns unbiased skew.
s.kurtosis(axis=None, skipna=None, level=None, numeric_only=None) Returns unbiased kurtosis.
s.nunique(dropna=True) Returns count of unique items.
s.count(level=None) Returns count of non-missing items.
s.size Number of items in series. (Property)
s.is_unique True if all values are unique.
s.is_monotonic True if all values are increasing.
s.is_monotonic_increasing True if all values are increasing.
s.is_monotonic_decreasing True if all values are decreasing.

Exercises

With a dataset of your choice:

  1. Find the count of non-missing values of a series.
  2. Find the number of entries of a series.
  3. Find the number of unique entries of a series.
  4. Find the mean value of a series.
  5. Find the maximum value of a series.
  6. Use the .agg method to find all of the above.
city_mpg.count()
41144
city_mpg.size
41144
city_mpg.nunique()
105
city_mpg.mean()
18.369045304297103
city_mpg.max()
150
city_mpg.agg(['count', 'size', 'nunique', 'mean', 'max'])
count      41144.000000
size       41144.000000
nunique      105.000000
mean          18.369045
max          150.000000
Name: city08, dtype: float64

Chapter 8: Conversion Methods

8.1 Automatic Conversion

.convert_dtypes tries to convert a Series to a type that supports pd.NA. In the case of our city_mpg series it will change the type from int64 to Int64.

city_mpg.convert_dtypes()
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64

.astype works more explicitly. The maximumm 8-bit integer is 127, so we need 16-bit integer for city_mpg since it’s max is 150.

city_mpg.max()
150
city_mpg.astype('Int16')
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int16
city_mpg.astype('Int8')
TypeError: cannot safely cast non-equivalent int64 to int8

If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data.

Use NumPy to inspect limits on integer and float types:

np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
np.iinfo('uint8')
iinfo(min=0, max=255, dtype=uint8)
np.finfo('float16')
finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)
np.finfo('float64')
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

8.2 Memory Usage

Use the .nbytes property or the .memory_usage method to calculate memory usage of the Series.

Pass deep=True to .memory_usage when dealing with object types in the Series.

city_mpg.nbytes
329152
city_mpg.astype('Int16').nbytes
123432

To get the amount of memory that includes strings in the Series (like the make column), we need to use the .memory_usage method:

df.make
0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object
df.make.nbytes
329152
df.make.memory_usage()
329280
df.make.memory_usage(deep=True)
2606395

.memory_usage includes the index memory and can include the contribution from object types. .nbytes is just the memory that the data is using and not the ancillary parts of the Series.

Converting to categorical will save a lot of memory for the make data:

(df.make
 .astype('category')
 .memory_usage(deep=True)
)
95888

8.3 String and Category Types

city_mpg.astype(str)
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object
city_mpg.astype('category')
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

When you convert to categorical data, pandas no longer uses Python strings for each value but optimizes it. Potentially large memory savings if you have many duplicate values.

8.4 Ordered Categories

To cretae ordered categories you need to define your own CategoricalDtype:

values = pd.Series(sorted(set(city_mpg)))
city_type = pd.CategoricalDtype(categories=values, ordered=True)
city_mpg.astype(city_type)
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]
String or Type Description
str 'str' 'string' Convert to Python string.
'string' Convert type to pandas string (supports pd.NA).
int 'int' 'int64'|Convert type to NumPy int64.| |'int32' 'uint32'|Convert type to 32 signed or unsigned NumPy integer (can also use 16 and 8).| |‘Int64’|Convert type to pandas Int64 (supportspd.NA). Might complain when you convert floats or strings.| |float ‘float’ ‘float64’|Convert type to NumPy float64 (can also support 32 or 16)| |‘category’|Convert type to categorical (supportspd.NA). Can also use instance ofCategoricalDtype| |dates|Don't use this for data conversion, usepd.to_datetime`.

8.5 Converting to Other Types

Using Python lists will slow down your code significantly.

You can convert a Series into a DataFrame:

city_mpg.to_frame()
city08
0 19
1 9
2 23
3 10
4 17
... ...
41139 19
41140 20
41141 18
41142 18
41143 16

41144 rows × 1 columns

Method Description
s.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True) Convert types to appropriate pandas 1 types (that support NA). Doesn’t try to reduce size of integer or float types
s.astype(dtype, copy=True, errors='raise') Cast series into particular type. If errors='ignore' then return original series on error.
pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True) Convert arg (a series) into datetime. Use format to specify strftime string.
s.to_numpy(dtype=None, copy=False, na_value=object, **kwargs) Convert the series to a NumPy array.
s.values Convert the series to a NumPy array.
s.to_frame(name=None) Return a dataframe representation of the series.
pd.CategoricalDtype(categories=None, ordered=False) Create a type for categorical data.

8.7 Exercises

With a dataset of your choice:

1. Convert a numeric column to a smaller type.

# currently a float64 type
df.barrels08
0        15.695714
1        29.964545
2        12.207778
3        29.964545
4        17.347895
           ...    
41139    14.982273
41140    14.330870
41141    15.695714
41142    15.695714
41143    18.311667
Name: barrels08, Length: 41144, dtype: float64
# convert to float16
df.barrels08.astype('float16')
0        15.695312
1        29.968750
2        12.210938
3        29.968750
4        17.343750
           ...    
41139    14.984375
41140    14.328125
41141    15.695312
41142    15.695312
41143    18.312500
Name: barrels08, Length: 41144, dtype: float16

2. Calculate the memory savings by converting to smaller numeric types.

df.barrels08.memory_usage() - df.barrels08.astype('float16').memory_usage()
246864

3. Convert a string column into a categorical type.

df.make
0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object
values = pd.Series(sorted(set(df.make)))
make_type = pd.CategoricalDtype(categories=values, ordered=False)
df.make.astype(make_type)
0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

4. Calculate the memory savings by converting to a categorical type.

df.make.memory_usage(deep=True) - df.make.astype(make_type).memory_usage(deep=True)
2510507

Chapter 9: Manipulation Methods

Comparing non-broadcasted .apply method with vectorized code:

# non-vectorized function to check if value is greater than 20
def gt20(val): 
    return val > 20
%%timeit
city_mpg.apply(gt20)
7.08 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
city_mpg.gt(20)
174 µs ± 44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The broadcasted .gt method is 40 times faster than non-broadcasted .apply function.

Show the top 5 makes and label everything else as Other:

# top 5 makes
top5 = df.make.value_counts().index[:5]

# function to use in apply
def generalize_top5(val):
    if val in top5:
        return val
    return 'Other'
%%timeit
df.make.apply(generalize_top5)
48.2 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

A (10 times) faster and more idiomatic manner of doing this uses the .where method, which keeps values from the series it is called on where the boolean array is true. If the boolean array is false, it uses the value of the second parameter, other:

%%timeit
df.make.where(df.make.isin(top5), other='Other')
3.68 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.make.where(df.make.isin(top5), other='Other')
0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object

The complement of .where is .mask—wherever the condition if False it keeps the original values; if it is True it replaces the value with the other parameter.

df.make.mask(~df.make.isin(top5), other='Other')
0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object
%%timeit
df.make.mask(~df.make.isin(top5), other='Other')
3.36 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The tilde ~ performs an inversion of the boolean array, switching all true values to false and vice versa.

9.2 If Else with Pandas

There is no way to do the following: if I wanted to keep the top five makes and use Top10 for the remainder of the top ten makes, with Other for the rest.

vc = df.make.value_counts()
top5 = vc.index[:5]
top10 = vc.index[:10]
def generalize(val):
    if val in top5:
        return val
    elif val in top10:
        return 'Top10'
    else:
        return 'Other'
df.make.apply(generalize)
0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object
%%timeit
df.make.apply(generalize)
76.4 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

To replicate in pandas, chain calls to .where:

(df.make
 .where(df.make.isin(top5), 'Top10')
 .where(df.make.isin(top10), 'Other')
)
0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Name: make, Length: 41144, dtype: object
%%timeit
(df.make
 .where(df.make.isin(top5), 'Top10')
 .where(df.make.isin(top10), 'Other')
)
6.05 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The pandas approach is still about 13 times faster.

The select function in NumPy works with pandas series. The interface takes a list of boolean arrays and a list with corresponding replacement values.

np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other')
array(['Other', 'Other', 'Dodge', ..., 'Other', 'Other', 'Other'],
      dtype=object)
%%timeit
np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other')
19.8 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

You can wrap it in a Series. I like this syntax for longer if statements than chaining .where calls because I think it is easier to understand.

pd.Series(np.select([df.make.isin(top5), df.make.isin(top10)], [df.make, 'Top10'], 'Other'), index=df.make.index)
0        Other
1        Other
2        Dodge
3        Dodge
4        Other
         ...  
41139    Other
41140    Other
41141    Other
41142    Other
41143    Other
Length: 41144, dtype: object

9.3 Missing Data

Count the number of missing items with .isna().sum():

(df.cylinders
 .isna()
 .sum()
)
206

Let’s index where the values are missing in the cylinders column and then show what those makes are:

missing = df.cylinders.isna()
df.make.loc[missing]
7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: object

9.4 Filling in Missing Data

It seems like cylinders are missing for cars that are electric (they have zero cylinders).

df.cylinders[df.cylinders.isna()]
7138    NaN
7139    NaN
8143    NaN
8144    NaN
8146    NaN
         ..
34563   NaN
34564   NaN
34565   NaN
34566   NaN
34567   NaN
Name: cylinders, Length: 206, dtype: float64
df.cylinders.fillna(0).loc[7136:7141]
7136    6.0
7137    6.0
7138    0.0
7139    0.0
7140    6.0
7141    6.0
Name: cylinders, dtype: float64

9.5 Interpolating Data

temp = pd.Series([32, 40, None, 42, 39, 32])
temp
0    32.0
1    40.0
2     NaN
3    42.0
4    39.0
5    32.0
dtype: float64
temp.interpolate()
0    32.0
1    40.0
2    41.0
3    42.0
4    39.0
5    32.0
dtype: float64

9.6 Clipping Data

city_mpg.loc[:446]
0      19
1       9
2      23
3      10
4      17
       ..
442    15
443    15
444    15
445    15
446    31
Name: city08, Length: 447, dtype: int64
(city_mpg
     .loc[:446]
     .clip(lower=city_mpg.quantile(0.05),
           upper=city_mpg.quantile(0.95))
)
0      19
1      11
2      23
3      11
4      17
       ..
442    15
443    15
444    15
445    15
446    27
Name: city08, Length: 447, dtype: int64

.clip uses .where under the hood.

9.7 Sorting Values

The .sort_values method will sort the values in ascending order and also rearrange the index accordingly.

city_mpg.sort_values()
7901       6
34557      6
37161      6
21060      6
35887      6
        ... 
34563    138
34564    140
32599    150
31256    150
33423    150
Name: city08, Length: 41144, dtype: int64

Because of index alignment, you can still do math operations on a sorted series:

(city_mpg.sort_values() + highway_mpg) / 2
0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

9.8 Sorting the Index

Below we unsort the index by sorting the values, then essentially revert that:

city_mpg.sort_values().sort_index()
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

9.9 Dropping Duplicates

keep='first' is the default value and keeps the first duplicate value found.

keep='last' keeps the last duplicate value found.

keep=False will remove any duplicated values (including the initial value).

drop_duplicates keep the original index.

city_mpg.drop_duplicates()
0         19
1          9
2         23
3         10
4         17
        ... 
34364    127
34409    114
34564    140
34565    115
34566    104
Name: city08, Length: 105, dtype: int64

9.10 Ranking Data

The .rank method will return a series that keeps the original index but uses the ranks of values from the original series. By default, if two values are the same, their rank will be the average of the positions they take. You can specify method='min' to put equal values in the same rank and method='dense' to not skip any positions:

city_mpg.rank()
0        27060.5
1          235.5
2        35830.0
3          607.5
4        19484.0
          ...   
41139    27060.5
41140    29719.5
41141    23528.0
41142    23528.0
41143    15479.0
Name: city08, Length: 41144, dtype: float64
city_mpg.rank(method='min')
0        25555.0
1          136.0
2        35119.0
3          336.0
4        17467.0
          ...   
41139    25555.0
41140    28567.0
41141    21502.0
41142    21502.0
41143    13492.0
Name: city08, Length: 41144, dtype: float64
city_mpg.rank(method='dense')
0        14.0
1         4.0
2        18.0
3         5.0
4        12.0
         ... 
41139    14.0
41140    15.0
41141    13.0
41142    13.0
41143    11.0
Name: city08, Length: 41144, dtype: float64
# a simpler example
pd.Series([1,1,2,3]).rank()
0    1.5
1    1.5
2    3.0
3    4.0
dtype: float64
pd.Series([1,1,2,3]).rank(method='min')
0    1.0
1    1.0
2    3.0
3    4.0
dtype: float64
pd.Series([1,1,2,3]).rank(method='dense')
0    1.0
1    1.0
2    2.0
3    3.0
dtype: float64

9.11 Replacing Data

The .replace method allows you to map values to new values.

df.make.replace('Subaru', 'SUBARU')
0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            SUBARU
            ...    
41139        SUBARU
41140        SUBARU
41141        SUBARU
41142        SUBARU
41143        SUBARU
Name: make, Length: 41144, dtype: object
# you can also use regex
df.make.replace(r'(Fer)ra(r.*)', value=r'\2-other-\1', regex=True)
0          Alfa Romeo
1        ri-other-Fer
2               Dodge
3               Dodge
4              Subaru
             ...     
41139          Subaru
41140          Subaru
41141          Subaru
41142          Subaru
41143          Subaru
Name: make, Length: 41144, dtype: object

9.12 Binning Data

Using the cut function, you can create bins of equal width:

pd.cut(city_mpg, 10)
0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

You can specify sizes for bin edges. In the following, 5 bins are created (so you need to provide 6 edges):

pd.cut(city_mpg, [0, 10, 20, 40, 70, 150])
0        (10, 20]
1         (0, 10]
2        (20, 40]
3         (0, 10]
4        (10, 20]
           ...   
41139    (10, 20]
41140    (10, 20]
41141    (10, 20]
41142    (10, 20]
41143    (10, 20]
Name: city08, Length: 41144, dtype: category
Categories (5, interval[int64, right]): [(0, 10] < (10, 20] < (20, 40] < (40, 70] < (70, 150]]
city_mpg
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

Note the bins have a half-open interval. They do not have the start value but do include the end value. If the city_mpg series had values with 0 or values above 150, they would be missing after binning the series.

If you wanted 10 bins that had approximately the same number of entries in each bin (rather than each bin width being the same) use the qcut function:

pd.qcut(city_mpg, 10)
0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

Both allow you to set the labels to use instead of the categorical intervals they generate:

pd.qcut(city_mpg, 10, labels=list(range(1,11)))
0        7
1        1
2        9
3        1
4        5
        ..
41139    7
41140    7
41141    6
41142    6
41143    4
Name: city08, Length: 41144, dtype: category
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]

Manipulation methods and properties:

Method Description
s.apply(func, convert_dtype=True, args=(), **kwds) Pass in a NumPy function that works on the series, or a Python function that works on a single value. args and kwds are arguments for func. Returns a series, or dataframe if func returns a series.
s.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False) Pass in a boolean series/dataframe, list, or callable as cond. If the value is True, keep it, otherwise use other value. If it is a function, it takes a series and should return a boolean sequence.
np.select(condlist, choicelist, default=0) Pass in a list of boolean arrays for condlist. If the value is true use the corresponding value from choicelist. If multiple conditions are True, only use the first. Returns a NumPy array.
s.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Pass in a scalar, dict, series or dataframe for value. If it is a scalar, use that value, otherwise use the index from the old value to the new value.
s.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs) Perform interpolation with missing values. method may be linear, time among others.
s.clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs) Return a new series with values clipped to lower and upper.
s.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None) Return a series with values sorted. The kind option may be 'quicksort', 'mergesort' (stable), or 'heapsort'. na_position indicates location of NaNs and may be 'first' or 'last'.
s.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None) Return a series with index sorted. The kind option may be 'quicksort', 'mergesort' (stable), or 'heapsort'. na_position indicates location of NaNs and may be 'first' or 'last'.
s.drop_duplicates(keep='first', inplace=False) Drop duplicates. keep may be 'first', 'last', or False. (If False, it removes all values that were duplicated).
s.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) Return a series with numerical ranks. method allows you to specify tie handling. 'average', 'min', 'max', 'first' (usses order they appear in series), 'dense' (like 'min', but rank only increases by one after tie). na_option allows you to specify NaN handling. 'keep' (stay at NaN), 'top' (move to smallest), 'bottom' (move to largest).
s.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad') Return a series with new values. to_replace can be many things. If it is a string, number or regular expression, you can replace it with a scalar value. It can also be a list of those things which requires value to be a list of the same size. Finally, it can be a dictionary mapping old values to new values.
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True) Bin values from x (a series). If bins is an integer, use equal-width bins. If bins is a list of numbers (defining minimum and maximum positions) use those for the edges. right defines whether the right edge is open or closed. labels allows you to specify the bin names. Out of bounds values will be missing.
pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise') Bin values from x (a series) into q equal sized bins. Alternatively, can pass in a list of quantile edges. Out of bounds values will be missing.

9.14 Exercises

1. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using .apply.

city_mpg
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64
city_mpg.mean()
18.369045304297103
def generalize_mean(x, mean_val):
    if x >= mean_val:
        return 'high'
    return 'low'

city_mpg.apply(generalize_mean, mean_val=city_mpg.mean())
0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Name: city08, Length: 41144, dtype: object

2. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using np.select.

pd.Series(np.select([city_mpg.gt(city_mpg.mean())], ['high'], 'low'))
0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Length: 41144, dtype: object

3. Time the differences between the previous two solutions to see which is faster.

%%timeit
city_mpg.apply(generalize_mean, mean_val=city_mpg.mean())
22.3 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.Series(np.select([city_mpg.gt(city_mpg.mean())], ['high'], 'low'))
4.45 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.select is around 5 times as fast as .apply in this exercise.

4. Replace the missing values of a numeric series with the median value.

df.cylinders.isna()[7136:7141]
7136    False
7137    False
7138     True
7139     True
7140    False
Name: cylinders, dtype: bool
df.cylinders.fillna(df.cylinders.median())[7136:7141]
7136    6.0
7137    6.0
7138    6.0
7139    6.0
7140    6.0
Name: cylinders, dtype: float64

5.Clip the values of a numeric series to between the 10th and 90th percentiles.

clip_s = pd.Series([1,2,3,4,5,6,7,8,9,10])
(clip_s
     .clip(lower=clip_s.quantile(0.1),
           upper=clip_s.quantile(0.9))
)
0    1.9
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    9.0
9    9.1
dtype: float64

6. Using a categorical column, replace any value that is not in the top 5 most frequent values with 'Other'.

df.fuelType.unique()
array(['Regular', 'Premium', 'Diesel', 'CNG', 'Gasoline or natural gas',
       'Gasoline or E85', 'Electricity', 'Gasoline or propane',
       'Premium or E85', 'Midgrade', 'Premium Gas or Electricity',
       'Regular Gas and Electricity', 'Premium and Electricity',
       'Regular Gas or Electricity'], dtype=object)
top5 = df.fuelType.value_counts().index[:5]
top5
Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity'], dtype='object', name='fuelType')
df.fuelType.where(df.fuelType.isin(top5), other='Other')
0        Regular
1        Regular
2        Regular
3        Regular
4        Premium
          ...   
41139    Regular
41140    Regular
41141    Regular
41142    Regular
41143    Premium
Name: fuelType, Length: 41144, dtype: object

7. Using a categorical column, replace any value that is not in the top 10 most frequent values with 'Other'.

top10 = df.fuelType.value_counts().index[:10]
top10
Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity',
       'Premium or E85', 'Midgrade', 'CNG', 'Premium and Electricity',
       'Regular Gas and Electricity'],
      dtype='object', name='fuelType')
df.fuelType.where(df.fuelType.isin(top10), other='Other')
0        Regular
1        Regular
2        Regular
3        Regular
4        Premium
          ...   
41139    Regular
41140    Regular
41141    Regular
41142    Regular
41143    Premium
Name: fuelType, Length: 41144, dtype: object

8. Make a function that takes a categorical series and a number (n) and returns a replace series that replaces any value that is not in the top n most frequent values with 'Other'.

def top_n_categorical(s, n):
    top_n = s.value_counts().index[:n]
    return s.where(s.isin(top_n), other='Other')
s = top_n_categorical(df.fuelType, 10)
s.unique()
array(['Regular', 'Premium', 'Diesel', 'CNG', 'Other', 'Gasoline or E85',
       'Electricity', 'Premium or E85', 'Midgrade',
       'Regular Gas and Electricity', 'Premium and Electricity'],
      dtype=object)
s = top_n_categorical(df.fuelType, 5)
s.unique()
array(['Regular', 'Premium', 'Diesel', 'Other', 'Gasoline or E85',
       'Electricity'], dtype=object)

9. Using a numeric column, bin it into 10 groups that have the same width.

pd.cut(city_mpg, 10)
0        (5.856, 20.4]
1        (5.856, 20.4]
2         (20.4, 34.8]
3        (5.856, 20.4]
4        (5.856, 20.4]
             ...      
41139    (5.856, 20.4]
41140    (5.856, 20.4]
41141    (5.856, 20.4]
41142    (5.856, 20.4]
41143    (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]

10. Using a numeric column, bin it into 10 groups that have equal sized bins.

pd.qcut(city_mpg, 10)
0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]

This Stack Overflow response addresses the below error:

pd.qcut(df.cylinders, 10)
ValueError: Bin edges must be unique: array([ 2.,  4.,  4.,  4.,  5.,  6.,  6.,  6.,  8.,  8., 16.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

If I decrease the number of bins to 5, the error is not raised:

pd.qcut(df.cylinders, 5)
0        (1.999, 4.0]
1         (8.0, 16.0]
2        (1.999, 4.0]
3          (6.0, 8.0]
4        (1.999, 4.0]
             ...     
41139    (1.999, 4.0]
41140    (1.999, 4.0]
41141    (1.999, 4.0]
41142    (1.999, 4.0]
41143    (1.999, 4.0]
Name: cylinders, Length: 41144, dtype: category
Categories (5, interval[float64, right]): [(1.999, 4.0] < (4.0, 5.0] < (5.0, 6.0] < (6.0, 8.0] < (8.0, 16.0]]

Chapter 10: Indexing Operations

Both a series and a dataframe have an index. Both types support the Python indexing operator ([]). Both have attributes .loc and .iloc that you can index against.

10.1 Prepping the Data and Renaming the Index

Use .rename method to hange the index labels. We can pass in a dictionary to map the previous index label to the new label:

import itertools

dict(itertools.islice(df.make.to_dict().items(), 10))
{0: 'Alfa Romeo',
 1: 'Ferrari',
 2: 'Dodge',
 3: 'Dodge',
 4: 'Subaru',
 5: 'Subaru',
 6: 'Subaru',
 7: 'Toyota',
 8: 'Toyota',
 9: 'Toyota'}
city2 = city_mpg.rename(df.make.to_dict())
city2
Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64
# view the index
city2.index
Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
       'Toyota', 'Toyota', 'Toyota',
       ...
       'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
       'Subaru', 'Subaru', 'Subaru'],
      dtype='object', length=41144)

The .rename method also accepts a series, a scalar, a function that takes an old label and returns a new lable or a sequence. When we pass in a series and the index values are the same, the values from the series that we passed in are used as the index.

city2 = city_mpg.rename(df.make)
city2
Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: city08, Length: 41144, dtype: int64

If you pass a scalar value (a single string) into .rename the index will stay the same but the .name attribute of the series will update:

city2.rename('citympg')
Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
              ..
Subaru        19
Subaru        20
Subaru        18
Subaru        18
Subaru        16
Name: citympg, Length: 41144, dtype: int64

10.2 Resetting the Index

.reset_index by default will return a dataframe, moving the current index into a new column:

city2.reset_index()
index city08
0 Alfa Romeo 19
1 Ferrari 9
2 Dodge 23
3 Dodge 10
4 Subaru 17
... ... ...
41139 Subaru 19
41140 Subaru 20
41141 Subaru 18
41142 Subaru 18
41143 Subaru 16

41144 rows × 2 columns

drop=True drops the current index and returns a Series.

city2.reset_index(drop=True)
0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

Note that .sort_values and .sort_index keep the same index but just rearrange the order so they do not impact operations that align on the index.

10.3 The .loc Attribute

The .loc attribute deals with index labels. You can pass the following into an index operation on .loc:

  • A scalar value of one of the index labels.
  • A list of index labels.
  • A slice of labels (closed interval so it includes the stop value).
  • An index.
  • A boolean array (same index labels as the series, but with True and False values).
  • A function that accepts a series and returns one of the above.

If there are duplicate labels in the index, and you pass in a scalar with the label of an index, it will return a series. If there is only one value for that label it will return a scalar.

city2.loc['Subaru']
Subaru    17
Subaru    21
Subaru    22
Subaru    19
Subaru    20
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 885, dtype: int64
city2.loc['Fisker']
20

If you want to guarantee that a series is returned, pass in a list rather than passing in a scalar value:

city2.loc[['Fisker']]
Fisker    20
Name: city08, dtype: int64
city2.loc[['Ferrari', 'Lamborghini']]
Ferrari         9
Ferrari        12
Ferrari        11
Ferrari        10
Ferrari        11
               ..
Lamborghini     6
Lamborghini     8
Lamborghini     8
Lamborghini     8
Lamborghini     8
Name: city08, Length: 357, dtype: int64

Sort the index if you are slicing with duplicate index labels:

city2.loc['Ferrari':'Lamborghini']
KeyError: "Cannot get left slice bound for non-unique label: 'Ferrari'"
city2.sort_index().loc['Ferrari':'Lamborghini']
Ferrari        10
Ferrari        13
Ferrari        13
Ferrari         9
Ferrari        10
               ..
Lamborghini    12
Lamborghini     9
Lamborghini     8
Lamborghini    13
Lamborghini     8
Name: city08, Length: 11210, dtype: int64

Slicing with .loc follows the closed interval, includes both the start index and the final index.

If you have a sorted index, you can slice with strings that are not actual labels.

city2.sort_index().loc["F":"J"]
Federal Coach    15
Federal Coach    13
Federal Coach    13
Federal Coach    14
Federal Coach    13
                 ..
Isuzu            15
Isuzu            15
Isuzu            15
Isuzu            27
Isuzu            18
Name: city08, Length: 9040, dtype: int64

You can also pass an Index to .loc:

idx = pd.Index(['Dodge'])
city2.loc[idx]
Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 2583, dtype: int64

If we duplicate 'Dodge' in the Index, the previous operation has twice as many values, a combinatoric explosion:

idx = pd.Index(['Dodge', 'Dodge'])
city2.loc[idx]
Dodge    23
Dodge    10
Dodge    12
Dodge    11
Dodge    11
         ..
Dodge    18
Dodge    17
Dodge    14
Dodge    14
Dodge    11
Name: city08, Length: 5166, dtype: int64

You can also pass a boolean array to .loc:

mask = city2 > 50
mask
Alfa Romeo    False
Ferrari       False
Dodge         False
Dodge         False
Subaru        False
              ...  
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Subaru        False
Name: city08, Length: 41144, dtype: bool
city2.loc[mask]
Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

You can use a function with .loc. If I calculate the boolean array before taking into account the inflation, I get the wrong answer:

cost = pd.Series([1.00, 2.25, 3.99, .99, 2.79],
                 index=['Gum', 'Cookie', 'Melon', 'Roll', 'Carrots'])

inflation = 1.10

mask = cost > 3
# wrong answer
(cost
     .mul(inflation)
     .loc[mask]
)
Melon    4.389
dtype: float64
# right answer
(cost
     .mul(inflation)
     .loc[lambda s_: s_ > 3]
)
Melon      4.389
Carrots    3.069
dtype: float64

There is an implicit return statement in the lambda function. You can only put an expression in it, you can have a statement. It is limited to a single line of code.

10.4 The .iloc Attribute

The .iloc attribute supports indexing with the following:

  • A scalar index position (an integer).
  • A list of index positions.
  • A slice of positions (half-open interval so it does not include stop value).
  • A NumPy array (or Python list) of boolean values.
  • A function that accepts a series and returns one of the above.

Because index positions are unique, we will always get the scalar value when indexing with .iloc at a position:

city2.iloc[0]
19
city2.iloc[-1]
16

If we want to return a series object, we can index it with a list of positions:

city2.iloc[[0]]
Alfa Romeo    19
Name: city08, dtype: int64
city2.iloc[[0, 1, -1]]
Alfa Romeo    19
Ferrari        9
Subaru        16
Name: city08, dtype: int64

We can also use slices with .iloc (they follow the half-open interval):

city2.iloc[0:5]
Alfa Romeo    19
Ferrari        9
Dodge         23
Dodge         10
Subaru        17
Name: city08, dtype: int64
city2.iloc[-8:]
Saturn    21
Saturn    24
Saturn    21
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

You can use a NumPy array of booleans (or a Python list) but if you use what we call a boolean array (a pandas series with booleans), this will fail:

mask = city2 > 50
city2.iloc[mask]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask
city2.iloc[mask.to_numpy()]
Nissan     81
Toyota     81
Toyota     81
Ford       74
Nissan     84
         ... 
Tesla     140
Tesla     115
Tesla     104
Tesla      98
Toyota     55
Name: city08, Length: 236, dtype: int64

10.5 Heads and Tails

city2.head(3)
Alfa Romeo    19
Ferrari        9
Dodge         23
Name: city08, dtype: int64
city2.tail(3)
Subaru    18
Subaru    18
Subaru    16
Name: city08, dtype: int64

10.6 Sampling

The code below randomly pulls out six values:

city2.sample(6, random_state=42)
Volvo         16
Mitsubishi    19
Buick         27
Jeep          15
Land Rover    13
Saab          17
Name: city08, dtype: int64

10.7 Filtering Index Values

The filter method will filter index labels by exact match (items), substring (like), or regex (regex).

# exact match fails with duplicate index labels
city2.filter(items=['Ford', 'Subaru'])
ValueError: cannot reindex on an axis with duplicate labels
city2.filter(like='rd')
Ford    18
Ford    16
Ford    17
Ford    17
Ford    15
        ..
Ford    26
Ford    19
Ford    21
Ford    18
Ford    19
Name: city08, Length: 3371, dtype: int64
city2.filter(regex='(Ford)|(Subaru)')
Subaru    17
Subaru    21
Subaru    22
Ford      18
Ford      16
          ..
Subaru    19
Subaru    20
Subaru    18
Subaru    18
Subaru    16
Name: city08, Length: 4256, dtype: int64