Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Detailed example:
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let’s use .size()
to get the row counts:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let’s use .size().reset_index(name='counts')
to get the row counts:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
28 ko:K01048 0.0 0.0 0.0000 388.1259 I
29 ko:K01048 0.0 0.0 0.0000 405.4732 I
... ... ... ... ... ... ...
9421 ko:K01736 0.0 0.0 0.0000 432.0616 E
9422 ko:K00891 0.0 0.0 0.0000 254.8451 E
9423 ko:K13829 0.0 0.0 0.0000 254.8451 E
9424 ko:K01735 0.0 0.0 0.0000 491.9222 E
9425 ko:K13829 0.0 0.0 0.0000 491.9222 E
9426 ko:K07282 0.0 0.0 0.0000 572.9031 M
9427 ko:K22468 0.0 0.0 0.0000 392.0322 S
9428 ko:K02356 0.0 0.0 0.0000 450.0223 J
9429 ko:K03625 0.0 0.0 0.0000 403.4689 K
9430 ko:K00942 0.0 0.0 0.0000 616.1304 J
9431 ko:K00942 0.0 0.0 0.0000 401.6179 F
9432 ko:K01591 0.0 0.0 0.0000 401.6179 F
9433 ko:K03060 0.0 0.0 0.0000 614.0546 K
9434 ko:K13038 0.0 0.0 0.0000 437.8839 H
9435 ko:K00789 0.0 0.0 0.0000 461.7063 H
9436 ko:K04066 0.0 0.0 0.0000 431.2169 L
9437 ko:K01876 0.0 0.0 0.0000 361.4074 J
9438 ko:K07478 0.0 0.0 0.0000 478.5512 L
9439 ko:K01872 0.0 0.0 0.0000 490.8955 J
9440 ko:K07447 0.0 0.0 0.0000 402.4180 L
9441 ko:K02768 0.0 0.0 0.0000 519.1639 G
9442 ko:K02798 0.0 0.0 0.0000 519.1639 G
9443 ko:K02799 0.0 0.0 0.0000 519.1639 G
9444 ko:K02800 0.0 0.0 0.0000 519.1639 G
9445 ko:K11183 0.0 0.0 0.0000 519.1639 G
9446 ko:K00008 0.0 0.0 0.0000 478.7627 C
9447 ko:K00459 0.0 0.0 0.0000 438.7087 J
9448 ko:K09761 0.0 0.0 0.0000 438.7087 J
9449 ko:K03686 0.0 0.0 0.0000 423.2326 O
9450 ko:K03705 0.0 0.0 0.0000 352.3628 K
[9451 rows x 6 columns]
>>> fortune = pd.read_csv("ko_combine_all_71.rpkm.csv", sep="\t")
>>> df=fortune.groupby('ID').sum()
>>> df.to_csv("combine_all_71.rpkm.output.csv")
Recent Comments