Making a pairwise distance matrix in pandas

This is a somewhat specialized problem that forms part of a lot of data science and clustering workflows. It starts with a relatively straightforward question: if we have a bunch of measurements for two different things, how do we come up with a single number that represents the difference between the two things?

An example will make the question clearer. Let’s load our olympic medal dataset:

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data = pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master/olympics.csv", sep="\t")
data

	City	Year	Sport	…	Medal	Country	Int Olympic Committee code
0	Athens	1896	Aquatics	…	Gold	Hungary	HUN
1	Athens	1896	Aquatics	…	Silver	Austria	AUT
2	Athens	1896	Aquatics	…	Bronze	Greece	GRE
3	Athens	1896	Aquatics	…	Gold	Greece	GRE
4	Athens	1896	Aquatics	…	Silver	Greece	GRE
…	…	…	…	…	…	…	…
29211	Beijing	2008	Wrestling	…	Silver	Germany	GER
29212	Beijing	2008	Wrestling	…	Bronze	Lithuania	LTU
29213	Beijing	2008	Wrestling	…	Bronze	Armenia	ARM
29214	Beijing	2008	Wrestling	…	Gold	Cuba	CUB
29215	Beijing	2008	Wrestling	…	Silver	Russia	RUS

29216 rows × 12 columns

and measure, for each different country, the number of medals they’ve won in each different sport:

summary = data.groupby(['Country', 'Sport']).size().unstack().fillna(0)
summary

Sport	Aquatics	Archery	Athletics	…	Water Motorsports	Weightlifting	Wrestling
Country
Afghanistan	0.0	0.0	0.0	…	0.0	0.0	0.0
Algeria	0.0	0.0	6.0	…	0.0	0.0	0.0
Argentina	3.0	0.0	5.0	…	0.0	2.0	0.0
Armenia	0.0	0.0	0.0	…	0.0	4.0	4.0
Australasia	11.0	0.0	1.0	…	0.0	0.0	0.0
…	…	…	…	…	…	…	…
Virgin Islands*	0.0	0.0	0.0	…	0.0	0.0	0.0
West Germany	62.0	0.0	67.0	…	0.0	7.0	9.0
Yugoslavia	91.0	0.0	2.0	…	0.0	0.0	16.0
Zambia	0.0	0.0	1.0	…	0.0	0.0	0.0
Zimbabwe	7.0	0.0	0.0	…	0.0	0.0	0.0

137 rows × 42 columns

Now we’ll pick two countries:

summary.loc[['Germany', 'Italy']]

Sport	Aquatics	Archery	Athletics	…	Water Motorsports	Weightlifting	Wrestling
Country
Germany	175.0	6.0	99.0	…	0.0	20.0	24.0
Italy	113.0	12.0	71.0	…	0.0	14.0	20.0

2 rows × 42 columns

Each country has 42 columns giving the total number of medals won in each sport. Our job is to come up with a single number that summarizes how different those two lists of numbers are. Mathematicians have figured out lots of different ways of doing that, many of which are implemented in the scipy.spatial.distance module.

If we just import pdist from the module, and pass in our dataframe of two countries, we’ll get a measuremnt:

from scipy.spatial.distance import pdist
pdist(summary.loc[['Germany', 'Italy']])

array([342.3024978])

That’s the distance score using the default metric, which is called the euclidian distance. Think of it as the straight line distance between the two points in space defined by the two lists of 42 numbers.

Now, what happens if we pass in a dataframe with three countries?

pdist(summary.loc[['Germany', 'Italy', 'France']])

array([342.3024978 , 317.98584874, 144.82403116])

As we might expect, we have three measurements:

Germany and Italy
Germnay and France
Italy and France

But it’s not easy to figure out which belongs to which. Happily, scipy also has a helper function that will take this list of numbers and turn it back into a square matrix:

from scipy.spatial.distance import squareform

squareform(pdist(summary.loc[['Germany', 'Italy', 'France']]))

array([[  0.        , 342.3024978 , 317.98584874],
       [342.3024978 ,   0.        , 144.82403116],
       [317.98584874, 144.82403116,   0.        ]])

In order to make sense of this, we need to re-attach the country names, which we can just do by turning it into a DataFrame:

pd.DataFrame(
    squareform(pdist(summary.loc[['Germany', 'Italy', 'France']])),
    columns = ['Germany', 'Italy', 'France'],
    index = ['Germany', 'Italy', 'France']
)

	Germany	Italy	France
Germany	0.000000	342.302498	317.985849
Italy	342.302498	0.000000	144.824031
France	317.985849	144.824031	0.000000

Hopefully this agrees with our intuition; the numbers on the diagonal are all zero, because each country is identical to itself, and the numbers above and below are mirror images, because the distance between Germany and France is the same as the distance between France and Germany (remember that we are talking about distance in terms of their medal totals, not geographical distance!)

Finally, to get pairwise measurements for the whole input dataframe, we just pass in the complete object and get the country names from the index:

pairwise = pd.DataFrame(
    squareform(pdist(summary)),
    columns = summary.index,
    index = summary.index
)

pairwise

Country	Afghanistan	Algeria	Argentina	…	Yugoslavia	Zambia	Zimbabwe
Country
Afghanistan	0.000000	8.774964	96.643675	…	171.947666	1.732051	17.492856
Algeria	8.774964	0.000000	95.199790	…	171.688672	7.348469	19.519221
Argentina	96.643675	95.199790	0.000000	…	148.128323	96.348326	89.810912
Armenia	5.830952	9.848858	96.477977	…	171.604196	5.744563	18.384776
Australasia	18.708287	20.024984	97.744565	…	166.991018	18.627936	22.360680
…	…	…	…	…	…	…	…
Virgin Islands*	1.414214	8.774964	96.457244	…	171.947666	1.732051	17.492856
West Germany	153.052279	150.306354	142.537714	…	184.945938	152.577849	144.045132
Yugoslavia	171.947666	171.688672	148.128323	…	0.000000	171.874955	169.103519
Zambia	1.732051	7.348469	96.348326	…	171.874955	0.000000	17.521415
Zimbabwe	17.492856	19.519221	89.810912	…	169.103519	17.521415	0.000000

137 rows × 137 columns

A nice way to visualize these is with a heatmap. 137 countries is a bit too much to show on a webpage, so let’s restrict it to just the countries that have scored at least 500 medals total:

import seaborn as sns
import matplotlib.pyplot as plt

# make summary table for just top countries
top_countries = (
    data
    .groupby('Country')
    .filter(lambda x : len(x) > 500)
    .groupby(['Country', 'Sport'])
    .size()
    .unstack()
    .fillna(0)
    )

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries)),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

png

Now that we have a plot to look at, we can see a problem with the distance metric we’re using. The US has won so many more medals than other countries that it distorts the measurement. And if we think about it, what we’re really interested in is not the exact number of medals in each category, but the relative number. In other words, we want two contries to be considered similar if they both have about twice as many medals in boxing as athletics, for example, regardless of the exact numbers.

Luckily for us, there is a distance measure already implemented in scipy that has that property – it’s called cosine distance. Think of it as a measurement that only looks at the relationships between the 44 numbers for each country, not their magnitude. We can switch to cosine distance by specifying the metric keyword argument in pdist:

# make pairwise distance matrix
pairwise_top = pd.DataFrame(
    squareform(pdist(top_countries, metric='cosine')),
    columns = top_countries.index,
    index = top_countries.index
)

# plot it with seaborn
plt.figure(figsize=(10,10))
sns.heatmap(
    pairwise_top,
    cmap='OrRd',
    linewidth=1
)

png

And as you can see we spot some much more interstesting patterns. Notice, for example, that Russia and Soviet Union have a very low distance (i.e. their medal distributions are very similar).

When looking at data like this, remember that the shade of each cell is not telling us anything about how many medals a country has won – simply how different or similar each country is to each other. Compare the above heatmap with this one which displays the proportion of medals in each sport per country:

plt.figure(figsize=(10,10))
sns.heatmap(
    top_countries.apply(lambda x : x / x.sum(), axis=1),
    cmap='BuPu',
    square=True,
    cbar_kws = {'fraction' : 0.02}
)

png

Finally, how might we find pairs of countries that have very similar medal distributions (i.e. very low numbers in the pairwise table)? By far the easiest way is to start of by reshaping the table into long form, so that each comparison is on a separate row:

# create our pairwise distance matrix
pairwise = pd.DataFrame(
    squareform(pdist(summary, metric='cosine')),
    columns = summary.index,
    index = summary.index
)

# move to long form
long_form = pairwise.unstack()

# rename columns and turn into a dataframe
long_form.index.rename(['Country A', 'Country B'], inplace=True)
long_form = long_form.to_frame('cosine distance').reset_index()

Now we can write our filter as normal, remembering to filter out the unintersting rows that tell us a country’s distance from itself!

long_form[
    (long_form['cosine distance'] < 0.05) 
    & (long_form['Country A'] != long_form['Country B'])
]

	Country A	Country B	cosine distance
272	Algeria	Zambia	0.026671
1034	Azerbaijan	Mongolia	0.045618
1105	Bahamas	Barbados	0.021450
1111	Bahamas	British West Indies	0.021450
1113	Bahamas	Burundi	0.021450
…	…	…	…
17033	United Arab Emirates	Haiti	0.010051
17037	United Arab Emirates	Independent Olympic Participants	0.000000
17051	United Arab Emirates	Kuwait	0.000000
18164	Virgin Islands	Netherlands Antilles	0.000000
18496	Zambia	Algeria	0.026671

462 rows × 3 columns

https://www.drawingfromdata.com/making-a-pairwise-distance-matrix-with-pandas dm = squareform(pdist(input_data,metric=’braycurtis’))

小生这厢有礼了(BioFaceBook Personal Blog)

分类

Recent Comments

链接表

Making a pairwise distance matrix in pandas

Leave a Reply Cancel reply

Archives

Meta

小生这厢有礼了(BioFaceBook Personal Blog)

分类

Recent Comments

链接表

Making a pairwise distance matrix in pandas

Leave a Reply Cancel reply

Tags

Archives

Meta