= {
series 'index': [0, 1, 2, 3],
'data': [145, 142, 38, 13],
'name': 'songs'
}
series
{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}
Vishal Bakshi
July 6, 2023
In this blog post, I work through the book Effective Pandas by Matt Harrison. I’ll take notes, work through examples and end-of-chapter exercises.
Represent the following data in pure python:
Artist | Data |
---|---|
0 | 145 |
1 | 142 |
2 | 38 |
3 | 13 |
{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}
The get
function below can pull items out of this data structure based on the index:
The index
method on the list returns the list element at the provided index value.
Below is an example that has string values for the index:
Create a Series
object from a list:
0 145
1 142
2 38
3 13
Name: counts, dtype: int64
The series is one-dimensional. The leftmost column is the index, also called the axis. The data (145, 142, 38, 13) is also called the values of the series. A DataFrame
has two axes, one for the rows and another for the columns.
The default values for an index are monotonically increasing integers. The index can be string-based as well (datatype for the index is object
).
songs3 = pd.Series([145, 142, 38, 13],
name = 'counts',
index = ['Paul', 'John', 'George', 'Ringo'])
songs3
Paul 145
John 142
George 38
Ringo 13
Name: counts, dtype: int64
We can insert Python objects into a series:
0 Richard
1 Starkey
2 13
3 <__main__.Foo object at 0x135016ce0>
Name: ringo, dtype: object
The object
data type is also used for a series with string values and values that have heterogeneous or mixed types.
Here is a series that has NaN
in it:
Ono 2.0
Clapton NaN
dtype: float64
float64
supports NaN
while int64
does not. As of pandas 0.24, Int64
(nullable integer type) supports NaN
.
count
ignores NaN
s, .size
does not.
Ono 2
Clapton <NA>
dtype: Int64
The Series
object behaves similarly to a NumPy array.
They both have methods in common
They both have a notion of a boolean array.
Paul True
John True
George False
Ringo False
Name: counts, dtype: bool
If can indicate that data is categorical.
Categorical values:
0 m
1 l
2 xs
3 s
4 xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']
By default categories don’t have an ordering.
Convert non-categorical series to an ordered category:
s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
size_type = pd.api.types.CategoricalDtype(
categories=['s', 'm', 'l'], ordered = True)
s3 = s2.astype(size_type)
s3
0 m
1 l
2 NaN
3 s
4 NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
0 True
1 True
2 False
3 False
4 False
dtype: bool
# add ordering information to categorical data
s.cat.reorder_categories(['xs', 's', 'm', 'l', 'xl'], ordered=True)
0 m
1 l
2 xs
3 s
4 xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
For strings and dates converted to categorical types, we can still use the str
or dt
attributes on them:
Method | Description |
---|---|
pd.Series(data=None, index=None, dtype=None, name=None, copy=False) |
Create a series from data (sequence, dictionary or scalar) |
s.index |
Access index of series. |
s.astype(dtype, errors='raise') |
Cast a series to dtype . To ignore errors (and return original object) use errors='ignore' |
s[boolean_array] |
Return values from s where boolean_array is True |
s.cat.ordered |
Determine if a categorical series is ordered |
s.cat.reorder_categories(new_categories, ordered=False) |
Add categories (potentially ordered) to the series. new_categories must include all categories. |
# analyze the US Fuel Economy data
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08
/var/folders/5q/_bn7l90s177_2gq7rnhssjxm0000gn/T/ipykernel_52057/221626492.py:4: DtypeWarning: Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(url)
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64
0 25
1 14
2 33
3 12
4 23
..
41139 26
41140 28
41141 24
41142 24
41143 21
Name: highway08, Length: 41144, dtype: int64
Because the type is int64
we know that none of the values are missing.
The dir
function lists the attributes of an object. A series has 400+ attributes:
Functionality of series attributes:
Series
..to_
and export the data to other formats.Series
objects with the same index.Series
or scalars..str
..dt
..plot
..cat
..index
and .dtype
..str
attribute? Look at the documentation for three of them..dt
attribute? Look at the documentation for three of them.These are the protocols that determine how the Python language reacts to operations.
0 22.0
1 11.5
2 28.0
3 11.0
4 20.0
...
41139 22.5
41140 24.0
41141 21.0
41142 21.0
41143 18.5
Length: 41144, dtype: float64
When you operate with two series, pandas will align the index before performing the operation. Because of index alignment, you will want to make sure that the indexes: - are unique - are common to both series
1 NaN
2 55.0
2 64.0
2 65.0
2 74.0
4 NaN
dtype: float64
When you perform math operations with a scalar, pandas broadcasts the operation to all values. A numeric pandas series is a block of memory, and modern CPUs leverage a technology called Single Instruction/Multiple Data (SIMD) to apply a math operation to the block of memory.
1 10.0
2 55.0
2 64.0
2 65.0
2 74.0
4 53.0
dtype: float64
Chaining makes the code easy to read and understand
0 22.0
1 11.5
2 28.0
3 11.0
4 20.0
...
41139 22.5
41140 24.0
41141 21.0
41142 21.0
41143 18.5
Length: 41144, dtype: float64
Method | Operator | Description |
---|---|---|
s.add(s2) |
s + s2 |
Adds series |
s.radd(s2) |
s2 + s |
Adds series |
s.sub(s2) |
s - s2 |
Subtracts series |
s.rsub(s2) |
s2 - s |
Subtracts series |
s.mul(s2) |
s * s2 |
Multiplies series |
s.multiply(s2) |
s * s2 |
Multiplies series |
s.rmul(s2) |
s2 * s |
Multiplies series |
s.div(s2) |
s / s2 |
Divides series |
s.truediv(s2) |
s / s2 |
Divides series |
s.rdiv(s2) |
s2 / s |
Divides series |
s.rtruediv(s2) |
s2 / s |
Divides series |
s.mod(s2) |
s % s2 |
Modulo of series division |
s.rmod(s2) |
s2 % s |
Modulo of series division |
s.floordiv(s2) |
s // s2 |
Floor divide series |
s.rfloordiv(s2) |
s2 // s |
Floor divide series |
s.pow(s2) |
s ** s2 |
Exponential power of series |
s.rpow(s2) |
s2 ** s |
Exponential power of series |
s.eq(s2) |
s2 == s |
Elementwise equals of series |
s.ne(s2) |
s2 != s |
Elementwise not equals of series |
s.gt(s2) |
s > s2 |
Elementwise greater than of series |
s.ge(s2) |
s >= s2 |
Elementwise greater than or equals of series |
s.lt(s2) |
s < s2 |
Elementwise less than of series |
s.le(s2) |
s <= s2 |
Elementwise less than or equals of series |
np.invert(s) |
~s |
Elementwise inversion of boolean series (no pandas method) |
np.logical_and(s, s2) |
s & s2 |
Elementwise logical and of boolean series (no pandas method) |
np.logical_or(s, s2) |
s \| s2 |
Elementwise logical or of boolean series (no pandas method) |
With a dataset of your choice:
.add
method..add
method.0 38
1 18
2 46
3 20
4 34
..
41139 38
41140 40
41141 36
41142 36
41143 32
Name: city08, Length: 41144, dtype: int64
0 29
1 19
2 33
3 20
4 27
..
41139 29
41140 30
41141 28
41142 28
41143 26
Name: city08, Length: 41144, dtype: int64
0 38
1 18
2 46
3 20
4 34
..
41139 38
41140 40
41141 36
41142 36
41143 32
Name: city08, Length: 41144, dtype: int64
Aggregate methods collapse the values of a series down to a scalar.
0.1 13.0
0.5 17.0
0.9 24.0
Name: city08, dtype: float64
If you want the count of values that meet some criteria, you can use the .sum
method:
24.965973167412017
Observe the .mul(100).mean()
calculation on a simpler Series
:
If you sum up a series of boolean values, the result is the count of True
values. If you take the mean of a series of boolean values, the result is the fraction of values that are True
.
.agg
can perform multiple operations.
mean 18.369045
var 62.503036
max 150.000000
second_to_last 18.000000
Name: city08, dtype: float64
Aggregation strings and descriptions:
Method | Description |
---|---|
'all' |
Returns True if every value is truthy. |
'any' |
Returns True if any value is truthy. |
'autocorr' |
Returns Pearson correlation of series with shifted self. Can override lag as keyword argument (default is 1). |
'corr' |
Returns Pearson correlation of series with other series. Need to specify other |
'count' |
Returns count of non-missing values. |
'cov' |
Returns covariance of series with other series. Need to specify other |
'dtype' |
Type of the series. |
'dtypes' |
Type of the series. |
'empty' |
True is no values in series. |
'hasnans' |
True if missing values in series. |
'idxmax' |
Returns index value of maximum value. |
'idxmin' |
Returns index value of minimum value. |
'is_monotonic' |
True if values always increase. |
'is_monotonic_decreasing' |
True if values always decrease. |
'is_monotonic_increasing' |
True if values always increase. |
'kurt' |
Returns “excess” kurtosis (0 is normal distribution). Values greater than 0 have more outliers than normal. |
'mad' |
Returns the mean absolute deviation. |
'max' |
Returns the maximum value. |
'mean' |
Returns the mean value. |
'median' |
Returns the median value. |
'min' |
Returns the minimum value. |
'nbytes' |
Returns the number of bytes of the data. |
'ndim' |
Return the number of dimensions (1) of the data. |
'nunique' |
Returns the count of unique values. |
'quantile' |
Returns the median value. Can override q to specify other quantile. |
'sem' |
Returns the unbiarsed standard error. |
'size' |
Returns the size of the data. |
'skew' |
Returns the unbiased skew of the data. Negative indicates tail is on the left side. |
'std' |
Returns the standard deviation of the data. |
'sum' |
Returns the sum of the series. |
Aggregation methods and properties:
Method | Description |
---|---|
s.agg(func=None, axis=0, *args, **kwargs) |
Returns a scalar if func is a single aggregation function. Returns a series if a list of aggregations are passed to func . |
s.all(axis=0, bool_only=None, skipna=True, level=None) |
Returns True if every value is truthy. Otherwise False . |
s.any(axis=0, bool_only=None, skipna=True, level=None) |
Returns True if at least one value is truthy. Otherwise False . |
s.autocorr(lag=1) |
Returns Pearson correlation between s and shifted s . |
s.corr(other, method='pearson') |
Returns correlation coefficient for 'pearson' , 'spearman' , 'kendall' , or a callable. |
s.cov(other, min_periods=None) |
Returns covariance. |
s.max(axis=None, skipna=None, level=None, numeric_only=None) |
Returns maximum value. |
s.min(axis=None, skipna=None, level=None, numeric_only=None) |
Returns minimum value. |
s.mean(axis=None, skipna=None, level=None, numeric_only=None) |
Returns mean value. |
s.median(axis=None, skipna=None, level=None, numeric_only=None) |
Returns median value. |
s.prod(axis=None, skipna=None, level=None, numeric_only=None, min_count=0) |
Returns product of s values. |
s.quantile(q=0.5, interpolation='linear') |
Returns 50% quantile by default. Returns Series if q is a list. |
s.sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) |
Returns unbiased standard error of mean. |
s.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) |
Returns sample standard deviation. |
s.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None) |
Returns unbiased variance. |
s.skew(axis=None, skipna=None, level=None, numeric_only=None) |
Returns unbiased skew. |
s.kurtosis(axis=None, skipna=None, level=None, numeric_only=None) |
Returns unbiased kurtosis. |
s.nunique(dropna=True) |
Returns count of unique items. |
s.count(level=None) |
Returns count of non-missing items. |
s.size |
Number of items in series. (Property) |
s.is_unique |
True if all values are unique. |
s.is_monotonic |
True if all values are increasing. |
s.is_monotonic_increasing |
True if all values are increasing. |
s.is_monotonic_decreasing |
True if all values are decreasing. |
With a dataset of your choice:
.agg
method to find all of the above..convert_dtypes
tries to convert a Series
to a type that supports pd.NA
. In the case of our city_mpg
series it will change the type from int64
to Int64
.
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: Int64
.astype
works more explicitly. The maximumm 8-bit integer is 127, so we need 16-bit integer for city_mpg
since it’s max is 150.
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: Int16
If you can use a narrower type, you can cut back on memory usage, giving you memory to process more data.
Use NumPy to inspect limits on integer and float types:
Use the .nbytes
property or the .memory_usage
method to calculate memory usage of the Series
.
Pass deep=True
to .memory_usage
when dealing with object
types in the Series
.
To get the amount of memory that includes strings in the Series
(like the make
column), we need to use the .memory_usage
method:
0 Alfa Romeo
1 Ferrari
2 Dodge
3 Dodge
4 Subaru
...
41139 Subaru
41140 Subaru
41141 Subaru
41142 Subaru
41143 Subaru
Name: make, Length: 41144, dtype: object
.memory_usage
includes the index memory and can include the contribution from object
types. .nbytes
is just the memory that the data is using and not the ancillary parts of the Series
.
Converting to categorical will save a lot of memory for the make
data:
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: object
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]
When you convert to categorical data, pandas no longer uses Python strings for each value but optimizes it. Potentially large memory savings if you have many duplicate values.
To cretae ordered categories you need to define your own CategoricalDtype
:
values = pd.Series(sorted(set(city_mpg)))
city_type = pd.CategoricalDtype(categories=values, ordered=True)
city_mpg.astype(city_type)
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]
String or Type | Description |
---|---|
str 'str' 'string' |
Convert to Python string. |
'string' |
Convert type to pandas string (supports pd.NA ). |
int 'int' 'int64'|Convert type to NumPy int64.| |'int32' 'uint32'|Convert type to 32 signed or unsigned NumPy integer (can also use 16 and 8).| | ‘Int64’|Convert type to pandas Int64 (supports pd.NA). Might complain when you convert floats or strings.| | float ‘float’ ‘float64’|Convert type to NumPy float64 (can also support 32 or 16)| | ‘category’|Convert type to categorical (supports pd.NA). Can also use instance of CategoricalDtype| |dates|Don't use this for data conversion, use pd.to_datetime`. |
Using Python lists will slow down your code significantly.
You can convert a Series
into a DataFrame
:
city08 | |
---|---|
0 | 19 |
1 | 9 |
2 | 23 |
3 | 10 |
4 | 17 |
... | ... |
41139 | 19 |
41140 | 20 |
41141 | 18 |
41142 | 18 |
41143 | 16 |
41144 rows × 1 columns
Method | Description |
---|---|
s.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True) |
Convert types to appropriate pandas 1 types (that support NA ). Doesn’t try to reduce size of integer or float types |
s.astype(dtype, copy=True, errors='raise') |
Cast series into particular type. If errors='ignore' then return original series on error. |
pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True) |
Convert arg (a series) into datetime. Use format to specify strftime string. |
s.to_numpy(dtype=None, copy=False, na_value=object, **kwargs) |
Convert the series to a NumPy array. |
s.values |
Convert the series to a NumPy array. |
s.to_frame(name=None) |
Return a dataframe representation of the series. |
pd.CategoricalDtype(categories=None, ordered=False) |
Create a type for categorical data. |
With a dataset of your choice:
1. Convert a numeric column to a smaller type.
0 15.695714
1 29.964545
2 12.207778
3 29.964545
4 17.347895
...
41139 14.982273
41140 14.330870
41141 15.695714
41142 15.695714
41143 18.311667
Name: barrels08, Length: 41144, dtype: float64
0 15.695312
1 29.968750
2 12.210938
3 29.968750
4 17.343750
...
41139 14.984375
41140 14.328125
41141 15.695312
41142 15.695312
41143 18.312500
Name: barrels08, Length: 41144, dtype: float16
2. Calculate the memory savings by converting to smaller numeric types.
3. Convert a string column into a categorical type.
0 Alfa Romeo
1 Ferrari
2 Dodge
3 Dodge
4 Subaru
...
41139 Subaru
41140 Subaru
41141 Subaru
41142 Subaru
41143 Subaru
Name: make, Length: 41144, dtype: object
values = pd.Series(sorted(set(df.make)))
make_type = pd.CategoricalDtype(categories=values, ordered=False)
df.make.astype(make_type)
0 Alfa Romeo
1 Ferrari
2 Dodge
3 Dodge
4 Subaru
...
41139 Subaru
41140 Subaru
41141 Subaru
41142 Subaru
41143 Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']
4. Calculate the memory savings by converting to a categorical type.
Comparing non-broadcasted .apply
method with vectorized code:
7.08 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The broadcasted .gt
method is 40 times faster than non-broadcasted .apply
function.
Show the top 5 makes and label everything else as Other:
48.2 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A (10 times) faster and more idiomatic manner of doing this uses the .where
method, which keeps values from the series it is called on where the boolean array is true. If the boolean array is false, it uses the value of the second parameter, other
:
3.68 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
0 Other
1 Other
2 Dodge
3 Dodge
4 Other
...
41139 Other
41140 Other
41141 Other
41142 Other
41143 Other
Name: make, Length: 41144, dtype: object
The complement of .where
is .mask
—wherever the condition if False
it keeps the original values; if it is True
it replaces the value with the other
parameter.
0 Other
1 Other
2 Dodge
3 Dodge
4 Other
...
41139 Other
41140 Other
41141 Other
41142 Other
41143 Other
Name: make, Length: 41144, dtype: object
3.36 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The tilde ~
performs an inversion of the boolean array, switching all true values to false and vice versa.
There is no way to do the following: if I wanted to keep the top five makes and use Top10 for the remainder of the top ten makes, with Other for the rest.
0 Other
1 Other
2 Dodge
3 Dodge
4 Other
...
41139 Other
41140 Other
41141 Other
41142 Other
41143 Other
Name: make, Length: 41144, dtype: object
76.4 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
To replicate in pandas, chain calls to .where
:
0 Other
1 Other
2 Dodge
3 Dodge
4 Other
...
41139 Other
41140 Other
41141 Other
41142 Other
41143 Other
Name: make, Length: 41144, dtype: object
6.05 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The pandas approach is still about 13 times faster.
The select
function in NumPy works with pandas series. The interface takes a list of boolean arrays and a list with corresponding replacement values.
array(['Other', 'Other', 'Dodge', ..., 'Other', 'Other', 'Other'],
dtype=object)
19.8 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can wrap it in a Series
. I like this syntax for longer if statements than chaining .where
calls because I think it is easier to understand.
Count the number of missing items with .isna().sum()
:
Let’s index where the values are missing in the cylinders column and then show what those makes are:
It seems like cylinders are missing for cars that are electric (they have zero cylinders).
7138 NaN
7139 NaN
8143 NaN
8144 NaN
8146 NaN
..
34563 NaN
34564 NaN
34565 NaN
34566 NaN
34567 NaN
Name: cylinders, Length: 206, dtype: float64
0 32.0
1 40.0
2 NaN
3 42.0
4 39.0
5 32.0
dtype: float64
0 19
1 9
2 23
3 10
4 17
..
442 15
443 15
444 15
445 15
446 31
Name: city08, Length: 447, dtype: int64
0 19
1 11
2 23
3 11
4 17
..
442 15
443 15
444 15
445 15
446 27
Name: city08, Length: 447, dtype: int64
.clip
uses .where
under the hood.
The .sort_values
method will sort the values in ascending order and also rearrange the index accordingly.
7901 6
34557 6
37161 6
21060 6
35887 6
...
34563 138
34564 140
32599 150
31256 150
33423 150
Name: city08, Length: 41144, dtype: int64
Because of index alignment, you can still do math operations on a sorted series:
Below we unsort the index by sorting the values, then essentially revert that:
keep='first'
is the default value and keeps the first duplicate value found.
keep='last'
keeps the last duplicate value found.
keep=False
will remove any duplicated values (including the initial value).
drop_duplicates
keep the original index.
The .rank
method will return a series that keeps the original index but uses the ranks of values from the original series. By default, if two values are the same, their rank will be the average of the positions they take. You can specify method='min'
to put equal values in the same rank and method='dense'
to not skip any positions:
0 27060.5
1 235.5
2 35830.0
3 607.5
4 19484.0
...
41139 27060.5
41140 29719.5
41141 23528.0
41142 23528.0
41143 15479.0
Name: city08, Length: 41144, dtype: float64
0 25555.0
1 136.0
2 35119.0
3 336.0
4 17467.0
...
41139 25555.0
41140 28567.0
41141 21502.0
41142 21502.0
41143 13492.0
Name: city08, Length: 41144, dtype: float64
0 14.0
1 4.0
2 18.0
3 5.0
4 12.0
...
41139 14.0
41140 15.0
41141 13.0
41142 13.0
41143 11.0
Name: city08, Length: 41144, dtype: float64
The .replace
method allows you to map values to new values.
0 Alfa Romeo
1 Ferrari
2 Dodge
3 Dodge
4 SUBARU
...
41139 SUBARU
41140 SUBARU
41141 SUBARU
41142 SUBARU
41143 SUBARU
Name: make, Length: 41144, dtype: object
Using the cut
function, you can create bins of equal width:
0 (5.856, 20.4]
1 (5.856, 20.4]
2 (20.4, 34.8]
3 (5.856, 20.4]
4 (5.856, 20.4]
...
41139 (5.856, 20.4]
41140 (5.856, 20.4]
41141 (5.856, 20.4]
41142 (5.856, 20.4]
41143 (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]
You can specify sizes for bin edges. In the following, 5 bins are created (so you need to provide 6 edges):
0 (10, 20]
1 (0, 10]
2 (20, 40]
3 (0, 10]
4 (10, 20]
...
41139 (10, 20]
41140 (10, 20]
41141 (10, 20]
41142 (10, 20]
41143 (10, 20]
Name: city08, Length: 41144, dtype: category
Categories (5, interval[int64, right]): [(0, 10] < (10, 20] < (20, 40] < (40, 70] < (70, 150]]
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64
Note the bins have a half-open interval. They do not have the start value but do include the end value. If the city_mpg
series had values with 0 or values above 150, they would be missing after binning the series.
If you wanted 10 bins that had approximately the same number of entries in each bin (rather than each bin width being the same) use the qcut
function:
0 (18.0, 20.0]
1 (5.999, 13.0]
2 (21.0, 24.0]
3 (5.999, 13.0]
4 (16.0, 17.0]
...
41139 (18.0, 20.0]
41140 (18.0, 20.0]
41141 (17.0, 18.0]
41142 (17.0, 18.0]
41143 (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]
Both allow you to set the labels to use instead of the categorical intervals they generate:
0 7
1 1
2 9
3 1
4 5
..
41139 7
41140 7
41141 6
41142 6
41143 4
Name: city08, Length: 41144, dtype: category
Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10]
Manipulation methods and properties:
Method | Description |
---|---|
s.apply(func, convert_dtype=True, args=(), **kwds) |
Pass in a NumPy function that works on the series, or a Python function that works on a single value. args and kwds are arguments for func . Returns a series, or dataframe if func returns a series. |
s.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False) |
Pass in a boolean series/dataframe, list, or callable as cond . If the value is True , keep it, otherwise use other value. If it is a function, it takes a series and should return a boolean sequence. |
np.select(condlist, choicelist, default=0) |
Pass in a list of boolean arrays for condlist . If the value is true use the corresponding value from choicelist . If multiple conditions are True, only use the first. Returns a NumPy array. |
s.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) |
Pass in a scalar, dict, series or dataframe for value . If it is a scalar, use that value, otherwise use the index from the old value to the new value. |
s.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs) |
Perform interpolation with missing values. method may be linear , time among others. |
s.clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs) |
Return a new series with values clipped to lower and upper . |
s.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None) |
Return a series with values sorted. The kind option may be 'quicksort' , 'mergesort' (stable), or 'heapsort' . na_position indicates location of NaNs and may be 'first' or 'last' . |
s.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None) |
Return a series with index sorted. The kind option may be 'quicksort' , 'mergesort' (stable), or 'heapsort' . na_position indicates location of NaNs and may be 'first' or 'last' . |
s.drop_duplicates(keep='first', inplace=False) |
Drop duplicates. keep may be 'first' , 'last' , or False. (If False , it removes all values that were duplicated). |
s.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) |
Return a series with numerical ranks. method allows you to specify tie handling. 'average' , 'min' , 'max' , 'first' (usses order they appear in series), 'dense' (like 'min' , but rank only increases by one after tie). na_option allows you to specify NaN handling. 'keep' (stay at NaN ), 'top' (move to smallest), 'bottom' (move to largest). |
s.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad') |
Return a series with new values. to_replace can be many things. If it is a string, number or regular expression, you can replace it with a scalar value . It can also be a list of those things which requires value to be a list of the same size. Finally, it can be a dictionary mapping old values to new values. |
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True) |
Bin values from x (a series). If bins is an integer, use equal-width bins. If bins is a list of numbers (defining minimum and maximum positions) use those for the edges. right defines whether the right edge is open or closed. labels allows you to specify the bin names. Out of bounds values will be missing. |
pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise') |
Bin values from x (a series) into q equal sized bins. Alternatively, can pass in a list of quantile edges. Out of bounds values will be missing. |
1. Create a series from a numeric column that has the value of 'high'
if it is equal to or above the mean and 'low'
if it is below the mean using .apply
.
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64
def generalize_mean(x, mean_val):
if x >= mean_val:
return 'high'
return 'low'
city_mpg.apply(generalize_mean, mean_val=city_mpg.mean())
0 high
1 low
2 high
3 low
4 low
...
41139 high
41140 high
41141 low
41142 low
41143 low
Name: city08, Length: 41144, dtype: object
2. Create a series from a numeric column that has the value of 'high'
if it is equal to or above the mean and 'low'
if it is below the mean using np.select
.
0 high
1 low
2 high
3 low
4 low
...
41139 high
41140 high
41141 low
41142 low
41143 low
Length: 41144, dtype: object
3. Time the differences between the previous two solutions to see which is faster.
22.3 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.45 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.select
is around 5 times as fast as .apply
in this exercise.
4. Replace the missing values of a numeric series with the median value.
7136 False
7137 False
7138 True
7139 True
7140 False
Name: cylinders, dtype: bool
7136 6.0
7137 6.0
7138 6.0
7139 6.0
7140 6.0
Name: cylinders, dtype: float64
5.Clip the values of a numeric series to between the 10th and 90th percentiles.
clip_s = pd.Series([1,2,3,4,5,6,7,8,9,10])
(clip_s
.clip(lower=clip_s.quantile(0.1),
upper=clip_s.quantile(0.9))
)
0 1.9
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 9.1
dtype: float64
6. Using a categorical column, replace any value that is not in the top 5 most frequent values with 'Other'
.
array(['Regular', 'Premium', 'Diesel', 'CNG', 'Gasoline or natural gas',
'Gasoline or E85', 'Electricity', 'Gasoline or propane',
'Premium or E85', 'Midgrade', 'Premium Gas or Electricity',
'Regular Gas and Electricity', 'Premium and Electricity',
'Regular Gas or Electricity'], dtype=object)
Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity'], dtype='object', name='fuelType')
0 Regular
1 Regular
2 Regular
3 Regular
4 Premium
...
41139 Regular
41140 Regular
41141 Regular
41142 Regular
41143 Premium
Name: fuelType, Length: 41144, dtype: object
7. Using a categorical column, replace any value that is not in the top 10 most frequent values with 'Other'
.
Index(['Regular', 'Premium', 'Gasoline or E85', 'Diesel', 'Electricity',
'Premium or E85', 'Midgrade', 'CNG', 'Premium and Electricity',
'Regular Gas and Electricity'],
dtype='object', name='fuelType')
0 Regular
1 Regular
2 Regular
3 Regular
4 Premium
...
41139 Regular
41140 Regular
41141 Regular
41142 Regular
41143 Premium
Name: fuelType, Length: 41144, dtype: object
8. Make a function that takes a categorical series and a number (n
) and returns a replace series that replaces any value that is not in the top n
most frequent values with 'Other'
.
array(['Regular', 'Premium', 'Diesel', 'CNG', 'Other', 'Gasoline or E85',
'Electricity', 'Premium or E85', 'Midgrade',
'Regular Gas and Electricity', 'Premium and Electricity'],
dtype=object)
array(['Regular', 'Premium', 'Diesel', 'Other', 'Gasoline or E85',
'Electricity'], dtype=object)
9. Using a numeric column, bin it into 10 groups that have the same width.
0 (5.856, 20.4]
1 (5.856, 20.4]
2 (20.4, 34.8]
3 (5.856, 20.4]
4 (5.856, 20.4]
...
41139 (5.856, 20.4]
41140 (5.856, 20.4]
41141 (5.856, 20.4]
41142 (5.856, 20.4]
41143 (5.856, 20.4]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.856, 20.4] < (20.4, 34.8] < (34.8, 49.2] < (49.2, 63.6] ... (92.4, 106.8] < (106.8, 121.2] < (121.2, 135.6] < (135.6, 150.0]]
10. Using a numeric column, bin it into 10 groups that have equal sized bins.
0 (18.0, 20.0]
1 (5.999, 13.0]
2 (21.0, 24.0]
3 (5.999, 13.0]
4 (16.0, 17.0]
...
41139 (18.0, 20.0]
41140 (18.0, 20.0]
41141 (17.0, 18.0]
41142 (17.0, 18.0]
41143 (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]
This Stack Overflow response addresses the below error:
ValueError: Bin edges must be unique: array([ 2., 4., 4., 4., 5., 6., 6., 6., 8., 8., 16.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
If I decrease the number of bins to 5, the error is not raised:
0 (1.999, 4.0]
1 (8.0, 16.0]
2 (1.999, 4.0]
3 (6.0, 8.0]
4 (1.999, 4.0]
...
41139 (1.999, 4.0]
41140 (1.999, 4.0]
41141 (1.999, 4.0]
41142 (1.999, 4.0]
41143 (1.999, 4.0]
Name: cylinders, Length: 41144, dtype: category
Categories (5, interval[float64, right]): [(1.999, 4.0] < (4.0, 5.0] < (5.0, 6.0] < (6.0, 8.0] < (8.0, 16.0]]
Both a series and a dataframe have an index. Both types support the Python indexing operator ([]
). Both have attributes .loc
and .iloc
that you can index against.
Use .rename
method to hange the index labels. We can pass in a dictionary to map the previous index label to the new label:
{0: 'Alfa Romeo',
1: 'Ferrari',
2: 'Dodge',
3: 'Dodge',
4: 'Subaru',
5: 'Subaru',
6: 'Subaru',
7: 'Toyota',
8: 'Toyota',
9: 'Toyota'}
Alfa Romeo 19
Ferrari 9
Dodge 23
Dodge 10
Subaru 17
..
Subaru 19
Subaru 20
Subaru 18
Subaru 18
Subaru 16
Name: city08, Length: 41144, dtype: int64
Index(['Alfa Romeo', 'Ferrari', 'Dodge', 'Dodge', 'Subaru', 'Subaru', 'Subaru',
'Toyota', 'Toyota', 'Toyota',
...
'Saab', 'Saturn', 'Saturn', 'Saturn', 'Saturn', 'Subaru', 'Subaru',
'Subaru', 'Subaru', 'Subaru'],
dtype='object', length=41144)
The .rename
method also accepts a series, a scalar, a function that takes an old label and returns a new lable or a sequence. When we pass in a series and the index values are the same, the values from the series that we passed in are used as the index.
Alfa Romeo 19
Ferrari 9
Dodge 23
Dodge 10
Subaru 17
..
Subaru 19
Subaru 20
Subaru 18
Subaru 18
Subaru 16
Name: city08, Length: 41144, dtype: int64
If you pass a scalar value (a single string) into .rename
the index will stay the same but the .name
attribute of the series will update:
.reset_index
by default will return a dataframe, moving the current index into a new column:
index | city08 | |
---|---|---|
0 | Alfa Romeo | 19 |
1 | Ferrari | 9 |
2 | Dodge | 23 |
3 | Dodge | 10 |
4 | Subaru | 17 |
... | ... | ... |
41139 | Subaru | 19 |
41140 | Subaru | 20 |
41141 | Subaru | 18 |
41142 | Subaru | 18 |
41143 | Subaru | 16 |
41144 rows × 2 columns
drop=True
drops the current index and returns a Series.
0 19
1 9
2 23
3 10
4 17
..
41139 19
41140 20
41141 18
41142 18
41143 16
Name: city08, Length: 41144, dtype: int64
Note that .sort_values
and .sort_index
keep the same index but just rearrange the order so they do not impact operations that align on the index.
.loc
AttributeThe .loc
attribute deals with index labels. You can pass the following into an index operation on .loc
:
If there are duplicate labels in the index, and you pass in a scalar with the label of an index, it will return a series. If there is only one value for that label it will return a scalar.
Subaru 17
Subaru 21
Subaru 22
Subaru 19
Subaru 20
..
Subaru 19
Subaru 20
Subaru 18
Subaru 18
Subaru 16
Name: city08, Length: 885, dtype: int64
If you want to guarantee that a series is returned, pass in a list rather than passing in a scalar value:
Ferrari 9
Ferrari 12
Ferrari 11
Ferrari 10
Ferrari 11
..
Lamborghini 6
Lamborghini 8
Lamborghini 8
Lamborghini 8
Lamborghini 8
Name: city08, Length: 357, dtype: int64
Sort the index if you are slicing with duplicate index labels:
KeyError: "Cannot get left slice bound for non-unique label: 'Ferrari'"
Ferrari 10
Ferrari 13
Ferrari 13
Ferrari 9
Ferrari 10
..
Lamborghini 12
Lamborghini 9
Lamborghini 8
Lamborghini 13
Lamborghini 8
Name: city08, Length: 11210, dtype: int64
Slicing with .loc
follows the closed interval, includes both the start index and the final index.
If you have a sorted index, you can slice with strings that are not actual labels.
Federal Coach 15
Federal Coach 13
Federal Coach 13
Federal Coach 14
Federal Coach 13
..
Isuzu 15
Isuzu 15
Isuzu 15
Isuzu 27
Isuzu 18
Name: city08, Length: 9040, dtype: int64
You can also pass an Index
to .loc
:
Dodge 23
Dodge 10
Dodge 12
Dodge 11
Dodge 11
..
Dodge 18
Dodge 17
Dodge 14
Dodge 14
Dodge 11
Name: city08, Length: 2583, dtype: int64
If we duplicate 'Dodge'
in the Index
, the previous operation has twice as many values, a combinatoric explosion:
Dodge 23
Dodge 10
Dodge 12
Dodge 11
Dodge 11
..
Dodge 18
Dodge 17
Dodge 14
Dodge 14
Dodge 11
Name: city08, Length: 5166, dtype: int64
You can also pass a boolean array to .loc
:
Alfa Romeo False
Ferrari False
Dodge False
Dodge False
Subaru False
...
Subaru False
Subaru False
Subaru False
Subaru False
Subaru False
Name: city08, Length: 41144, dtype: bool
Nissan 81
Toyota 81
Toyota 81
Ford 74
Nissan 84
...
Tesla 140
Tesla 115
Tesla 104
Tesla 98
Toyota 55
Name: city08, Length: 236, dtype: int64
You can use a function with .loc
. If I calculate the boolean array before taking into account the inflation, I get the wrong answer:
Melon 4.389
Carrots 3.069
dtype: float64
There is an implicit return statement in the lambda function. You can only put an expression in it, you can have a statement. It is limited to a single line of code.
.iloc
AttributeThe .iloc
attribute supports indexing with the following:
Because index positions are unique, we will always get the scalar value when indexing with .iloc
at a position:
If we want to return a series object, we can index it with a list of positions:
We can also use slices with .iloc
(they follow the half-open interval):
Saturn 21
Saturn 24
Saturn 21
Subaru 19
Subaru 20
Subaru 18
Subaru 18
Subaru 16
Name: city08, dtype: int64
You can use a NumPy array of booleans (or a Python list) but if you use what we call a boolean array (a pandas series with booleans), this will fail:
ValueError: iLocation based boolean indexing cannot use an indexable as a mask
The code below randomly pulls out six values:
The filter
method will filter index labels by exact match (items
), substring (like
), or regex (regex
).
ValueError: cannot reindex on an axis with duplicate labels
Ford 18
Ford 16
Ford 17
Ford 17
Ford 15
..
Ford 26
Ford 19
Ford 21
Ford 18
Ford 19
Name: city08, Length: 3371, dtype: int64