Daten aggregieren mit pandas
I recently came across a „challenge“ where I needed to combine various rows. Each row was identified by Key1 and Key2 and had two interesting columns, Foo and Bar. For each Key1 there may be a few Key2, for each Key2 n Foo/Bar entries. While all Foos are distinct per Key1 and Key2 the Bar column may appear j times.
The goal was to get a list of unique Bar items for each Key1/Key2 combination.
| Key1 | Key2 | Foo | Bar | |
|---|---|---|---|---|
| 0 | C1 | T1 | a1 | rc-1 |
| 1 | C1 | T1 | a2 | rc-1 |
| 2 | C1 | T1 | a3 | rc-1 |
| 3 | C1 | T1 | a4 | rc-1 |
| 4 | C2 | T2 | b1 | rc-1 |
| 5 | C2 | T2 | b2 | rc-2 |
| 6 | C3 | T3 | c1 | rc-3 |
| 7 | C4 | T4 | d1 | rc-4 |
| 8 | C4 | T4 | d2 | rc-5 |
| 9 | C4 | T4 | d3 | rc-4 |
The following Python code nicely did the job, thanks to http://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings
# -*- coding: utf-8 -*- import pandas as pd def unique(liste): """ takes a list of elements, separated by comma and returns sorted string of unique items separated by comma """ a = liste.split(',') b = sorted(set(a)) return ','.join(b) df = pd.read_excel('groupb_Beispiel.xlsx') print(df) grouped = df.groupby(['Key1','Key2'],as_index=False)['Bar'].agg(lambda col: ','.join(col)) grouped = pd.DataFrame(grouped) grouped['Unique'] = grouped['Bar'].apply(unique) print(grouped) grouped.to_excel('result.xlsx') |
| Key1 | Key2 | Bar | Unique | |
|---|---|---|---|---|
| 0 | C1 | T1 | rc-1,rc-1,rc-1,rc-1 | rc-1 |
| 1 | C2 | T2 | rc-1,rc-2 | rc-1,rc-2 |
| 2 | C3 | T3 | rc-3 | rc-3 |
| 3 | C4 | T4 | rc-4,rc-5,rc-4 | rc-4,rc-5 |