Olga Botvinnik Olga Botvinnik - 1 year ago 184
Python Question

Pandas: Get duplicated indexes

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of
and hopefully some
ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1 MIR3
genome_location1 AluJb

EDIT: added toy example

>>> df[df.groupby(level=0).transform(len)['type'] > 1]
genome_location1   MIR3
genome_location1  AluJb