Daten aggregieren mit pandas
I recently came across a „challenge“ where I needed to combine various rows. Each row was identified by Key1
and Key2
and had two interesting columns, Foo
and Bar
. For each Key1
there may be a few Key2
, for each Key2
n Foo/Bar entries. While all Foos are distinct per Key1
and Key2
the Bar
column may appear j times.
The goal was to get a list of unique Bar items for each Key1/Key2 combination.
Key1 | Key2 | Foo | Bar | |
---|---|---|---|---|
0 | C1 | T1 | a1 | rc-1 |
1 | C1 | T1 | a2 | rc-1 |
2 | C1 | T1 | a3 | rc-1 |
3 | C1 | T1 | a4 | rc-1 |
4 | C2 | T2 | b1 | rc-1 |
5 | C2 | T2 | b2 | rc-2 |
6 | C3 | T3 | c1 | rc-3 |
7 | C4 | T4 | d1 | rc-4 |
8 | C4 | T4 | d2 | rc-5 |
9 | C4 | T4 | d3 | rc-4 |
The following Python code nicely did the job, thanks to http://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings
# -*- coding: utf-8 -*- import pandas as pd def unique(liste): """ takes a list of elements, separated by comma and returns sorted string of unique items separated by comma """ a = liste.split(',') b = sorted(set(a)) return ','.join(b) df = pd.read_excel('groupb_Beispiel.xlsx') print(df) grouped = df.groupby(['Key1','Key2'],as_index=False)['Bar'].agg(lambda col: ','.join(col)) grouped = pd.DataFrame(grouped) grouped['Unique'] = grouped['Bar'].apply(unique) print(grouped) grouped.to_excel('result.xlsx') |
Key1 | Key2 | Bar | Unique | |
---|---|---|---|---|
0 | C1 | T1 | rc-1,rc-1,rc-1,rc-1 | rc-1 |
1 | C2 | T2 | rc-1,rc-2 | rc-1,rc-2 |
2 | C3 | T3 | rc-3 | rc-3 |
3 | C4 | T4 | rc-4,rc-5,rc-4 | rc-4,rc-5 |