This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?
An example will make the question clearer. Let’s load our olympic medal dataset:
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master/olympics.csv", sep="\t")
data
City | Year | Sport | … | Medal | Country | Int Olympic Committee code | |
---|---|---|---|---|---|---|---|
0 | Athens | 1896 | Aquatics | … | Gold | Hungary | HUN |
1 | Athens | 1896 | Aquatics | … | Silver | Austria | AUT |
2 | Athens | 1896 | Aquatics | … | Bronze | Greece | GRE |
3 | Athens | 1896 | Aquatics | … | Gold | Greece | GRE |
4 | Athens | 1896 | Aquatics | … | Silver | Greece | GRE |
… | … | … | … | … | … | … | … |
29211 | Beijing | 2008 | Wrestling | … | Silver | Germany | GER |
29212 | Beijing | 2008 | Wrestling | … | Bronze | Lithuania | LTU |
29213 | Beijing | 2008 | Wrestling | … | Bronze | Armenia | ARM |
29214 | Beijing | 2008 | Wrestling | … | Gold | Cuba | CUB |
29215 | Beijing | 2008 | Wrestling | … | Silver | Russia | RUS |
29216 rows × 12 columns
and measure, for each different country, the number of medals they’ve won in each different sport:
summary = data.groupby(['Country', 'Sport']).size().unstack().fillna(0)
summary
Sport | Aquatics | Archery | Athletics | … | Water Motorsports | Weightlifting | Wrestling |
---|---|---|---|---|---|---|---|
Country | |||||||
Afghanistan | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 |
Algeria | 0.0 | 0.0 | 6.0 | … | 0.0 | 0.0 | 0.0 |
Argentina | 3.0 | 0.0 | 5.0 | … | 0.0 | 2.0 | 0.0 |
Armenia | 0.0 | 0.0 | 0.0 | … | 0.0 | 4.0 | 4.0 |
Australasia | 11.0 | 0.0 | 1.0 | … | 0.0 | 0.0 | 0.0 |
… | … | … | … | … | … | … | … |
Virgin Islands* | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 |
West Germany | 62.0 | 0.0 | 67.0 | … | 0.0 | 7.0 | 9.0 |
Yugoslavia | 91.0 | 0.0 | 2.0 | … | 0.0 | 0.0 | 16.0 |
Zambia | 0.0 | 0.0 | 1.0 | … | 0.0 | 0.0 | 0.0 |
Zimbabwe | 7.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 |
137 rows × 42 columns
Now we’ll pick two countries:
summary.loc[['Germany', 'Italy']]
Sport | Aquatics | Archery | Athletics | … | Water Motorsports | Weightlifting | Wrestling |
---|---|---|---|---|---|---|---|
Country | |||||||
Germany | 175.0 | 6.0 | 99.0 | … | 0.0 | 20.0 | 24.0 |
Italy | 113.0 | 12.0 | 71.0 | … | 0.0 | 14.0 | 20.0 |
2 rows × 42 columns
Each country has 42 columns giving the total number of medals won in each sport. Our job is to come up with a single number that summarizes how different those two lists of numbers are. Mathematicians have figured out lots of different ways of doing that, many of which are implemented in the scipy.spatial.distance
module.
If we just import pdist
from the module, and pass in our dataframe of two countries, we’ll get a measuremnt:
from scipy.spatial.distance import pdist
pdist(summary.loc[['Germany', 'Italy']])
array([342.3024978])
That’s the distance score using the default metric, which is called the euclidian distance. Think of it as the straight line distance between the two points in space defined by the two lists of 42 numbers.
Now, what happens if we pass in a dataframe with three countries?
pdist(summary.loc[['Germany', 'Italy', 'France']])
array([342.3024978 , 317.98584874, 144.82403116])
As we might expect, we have three measurements:
- Germany and Italy
- Germnay and France
- Italy and France
But it’s not easy to figure out which belongs to which. Happily, scipy
also has a helper function that will take this list of numbers and turn it back into a square matrix:
from scipy.spatial.distance import squareform
squareform(pdist(summary.loc[['Germany', 'Italy', 'France']]))
array([[ 0. , 342.3024978 , 317.98584874],
[342.3024978 , 0. , 144.82403116],
[317.98584874, 144.82403116, 0. ]])
In order to make sense of this, we need to re-attach the country names, which we can just do by turning it into a DataFrame:
pd.DataFrame(
squareform(pdist(summary.loc[['Germany', 'Italy', 'France']])),
columns = ['Germany', 'Italy', 'France'],
index = ['Germany', 'Italy', 'France']
)
Germany | Italy | France | |
---|---|---|---|
Germany | 0.000000 | 342.302498 | 317.985849 |
Italy | 342.302498 | 0.000000 | 144.824031 |
France | 317.985849 | 144.824031 | 0.000000 |
Hopefully this agrees with our intuition; the numbers on the diagonal are all zero, because each country is identical to itself, and the numbers above and below are mirror images, because the distance between Germany and France is the same as the distance between France and Germany (remember that we are talking about distance in terms of their medal totals, not geographical distance!)
Finally, to get pairwise measurements for the whole input dataframe, we just pass in the complete object and get the country names from the index:
pairwise = pd.DataFrame(
squareform(pdist(summary)),
columns = summary.index,
index = summary.index
)
pairwise
Country | Afghanistan | Algeria | Argentina | … | Yugoslavia | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|
Country | |||||||
Afghanistan | 0.000000 | 8.774964 | 96.643675 | … | 171.947666 | 1.732051 | 17.492856 |
Algeria | 8.774964 | 0.000000 | 95.199790 | … | 171.688672 | 7.348469 | 19.519221 |
Argentina | 96.643675 | 95.199790 | 0.000000 | … | 148.128323 | 96.348326 | 89.810912 |
Armenia | 5.830952 | 9.848858 | 96.477977 | … | 171.604196 | 5.744563 | 18.384776 |
Australasia | 18.708287 | 20.024984 | 97.744565 | … | 166.991018 | 18.627936 | 22.360680 |
… | … | … | … | … | … | … | … |
Virgin Islands* | 1.414214 | 8.774964 | 96.457244 | … | 171.947666 | 1.732051 | 17.492856 |
West Germany | 153.052279 | 150.306354 | 142.537714 | … | 184.945938 | 152.577849 | 144.045132 |
Yugoslavia | 171.947666 | 171.688672 | 148.128323 | … | 0.000000 | 171.874955 | 169.103519 |
Zambia | 1.732051 | 7.348469 | 96.348326 | … | 171.874955 | 0.000000 | 17.521415 |
Zimbabwe | 17.492856 | 19.519221 | 89.810912 | … | 169.103519 | 17.521415 | 0.000000 |
137 rows × 137 columns
A nice way to visualize these is with a heatmap. 137 countries is a bit too much to show on a webpage, so let’s restrict it to just the countries that have scored at least 500 medals total:
import seaborn as sns
import matplotlib.pyplot as plt
# make summary table for just top countries
top_countries = (
data
.groupby('Country')
.filter(lambda x : len(x) > 500)
.groupby(['Country', 'Sport'])
.size()
.unstack()
.fillna(0)
)
# make pairwise distance matrix
pairwise_top = pd.DataFrame(
squareform(pdist(top_countries)),
columns = top_countries.index,
index = top_countries.index
)
# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
pairwise_top,
cmap='OrRd',
linewidth=1
)
Now that we have a plot to look at, we can see a problem with the distance metric we’re using. The US has won so many more medals than other countries that it distorts the measurement. And if we think about it, what we’re really interested in is not the exact number of medals in each category, but the relative number. In other words, we want two contries to be considered similar if they both have about twice as many medals in boxing as athletics, for example, regardless of the exact numbers.
Luckily for us, there is a distance measure already implemented in scipy that has that property – it’s called cosine distance. Think of it as a measurement that only looks at the relationships between the 44 numbers for each country, not their magnitude. We can switch to cosine distance by specifying the metric
keyword argument in pdist
:
# make pairwise distance matrix
pairwise_top = pd.DataFrame(
squareform(pdist(top_countries, metric='cosine')),
columns = top_countries.index,
index = top_countries.index
)
# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
pairwise_top,
cmap='OrRd',
linewidth=1
)
And as you can see we spot some much more interstesting patterns. Notice, for example, that Russia and Soviet Union have a very low distance (i.e. their medal distributions are very similar).
When looking at data like this, remember that the shade of each cell is not telling us anything about how many medals a country has won – simply how different or similar each country is to each other. Compare the above heatmap with this one which displays the proportion of medals in each sport per country:
plt.figure(figsize=(10,10))
sns.heatmap(
top_countries.apply(lambda x : x / x.sum(), axis=1),
cmap='BuPu',
square=True,
cbar_kws = {'fraction' : 0.02}
)
Finally, how might we find pairs of countries that have very similar medal distributions (i.e. very low numbers in the pairwise table)? By far the easiest way is to start of by reshaping the table into long form, so that each comparison is on a separate row:
# create our pairwise distance matrix
pairwise = pd.DataFrame(
squareform(pdist(summary, metric='cosine')),
columns = summary.index,
index = summary.index
)
# move to long form
long_form = pairwise.unstack()
# rename columns and turn into a dataframe
long_form.index.rename(['Country A', 'Country B'], inplace=True)
long_form = long_form.to_frame('cosine distance').reset_index()
Now we can write our filter as normal, remembering to filter out the unintersting rows that tell us a country’s distance from itself!
long_form[
(long_form['cosine distance'] < 0.05)
& (long_form['Country A'] != long_form['Country B'])
]
Country A | Country B | cosine distance | |
---|---|---|---|
272 | Algeria | Zambia | 0.026671 |
1034 | Azerbaijan | Mongolia | 0.045618 |
1105 | Bahamas | Barbados | 0.021450 |
1111 | Bahamas | British West Indies | 0.021450 |
1113 | Bahamas | Burundi | 0.021450 |
… | … | … | … |
17033 | United Arab Emirates | Haiti | 0.010051 |
17037 | United Arab Emirates | Independent Olympic Participants | 0.000000 |
17051 | United Arab Emirates | Kuwait | 0.000000 |
18164 | Virgin Islands | Netherlands Antilles | 0.000000 |
18496 | Zambia | Algeria | 0.026671 |
462 rows × 3 columns
https://www.drawingfromdata.com/making-a-pairwise-distance-matrix-with-pandas dm = squareform(pdist(input_data,metric=’braycurtis’))
Recent Comments