Garrett Eickelberg Garrett Eickelberg - 3 months ago 161
Python Question

How to make multiline graph with matplotlib subplots and pandas?

I'm fairly new at coding (completely self taught), and have started using it at at my job as a research assistant in a cancer lab. I need some help setting up a few line graphs in matplot lab.

I have a dataset that includes nextgen sequencing data for about 80 patients. on each patient, we have different timepoints of analysis, different genes detected (out of 40), and the associated %mutation for the gene.

My goal is to write two scripts, one that will generate a "by patient" plot, that will be a linegraph with y-%mutation, x-time of measurement, and will have a different color line for all lines made by each of the patient's associated genes. The second plot will be a "by gene", where I will have one plot contain different color lines that represent each of the different patient's x/y values for that specific gene.

Here is an example dataframe for 1 genenumber for the above script:

gene yaxis xaxis pt# gene#
ASXL1-3 34 1 3 1
ASXL1-3 0 98 3 1
IDH1-3 24 1 3 11
IDH1-3 0 98 3 11
RUNX1-3 38 1 3 21
RUNX1-3 0 98 3 21
U2AF1-3 33 1 3 26
U2AF1-3 0 98 3 26

I have setup a groupby script that when I iterate over it, gives me a dataframe for every gene-timepoint for each patient.

grouped = df.groupby('pt #')
for groupObject in grouped:
group = groupObject[1]

For patient 1, this gives the following output:

y x gene patientnumber patientgene genenumber dxtotransplant \
0 40.0 1712 ASXL1 1 ASXL1-1 1 1857
1 26.0 1835 ASXL1 1 ASXL1-1 1 1857
302 7.0 1835 RUNX1 1 RUNX1-1 21 1857

I need help writing a script that will create either of the plots described above. using the bypatient example, my general idea is that I need to create a different subplot for every gene a patient has, where each subplot is the line graph represented by that one gene.

Using matplotlib this is about as far as I have gotten:


grouped = df.groupby('patient number')

for groupObject in grouped:
group = groupObject[1]
df = group #may need to remove this
for element in range(len(group)):
xs = np.array(df[df.columns[1]]) #"x" column
ys= np.array(df[df.columns[0]]) #"y" column
gene = np.array(df[df.columns[2]])[element] #"gene" column
plt.scatter(xs,ys, label=gene)
plt.plot(xs,ys, label=gene)

This produces the following output:

enter image description here

In this output, the circled line is not supposed to be connected to the other 2 points. In this case, this is patient 1, who has the following datapoint:

x y gene
1712 40 ASXL1
1835 26 ASXL1
1835 7 RUNX1

Using seaborn I have gotten close to my desired graph using this code:

grouped = df.groupby(['patientnumber'])
for groupObject in grouped:
group = groupObject[1]
g = sns.FacetGrid(group, col="patientgene", col_wrap=4, size=4, ylim=(0,100))
g =, "x", "y", alpha=0.5)
g =, "x", "y", alpha=0.5)
plt.title= "gene:%s"%element

Using this code I get the following:

If I adjust the line:

g = sns.FacetGrid(group, col="patientnumber", col_wrap=4, size=4, ylim=(0,100))

I get the following result:

enter image description here

As you can see in the 2d example, the plot is treating every point on my plot as if they are from the same line (but they are actually 4 separate lines).

How I can tweak my iterations so that each patient-gene is treated as a separate line on the same graph?


I wrote a subplot function that may give you a hand. I modified the data a tad to help illustrate the plotting functionality.

gene,yaxis,xaxis,pt #,gene #

This is the subplotting function...with some extra bells and whistles :)

def plotByGroup(df, group, xCol, yCol, title = "", xLab = "", yLab = "", lineColors = ["red", "orange", "yellow", "green", "blue", "purple"], lineWidth = 2, lineOpacity = 0.7, plotStyle = 'ggplot', showLegend = False):
    Plot multiple lines from a Pandas Data Frame for each group using DataFrame.groupby() and MatPlotLib PyPlot.
        df          - Required  - Data Frame    - Pandas Data Frame
        group       - Required  - String        - Column name to group on           
        xCol        - Required  - String        - Column name for X axis data
        yCol        - Required  - String        - Column name for y axis data
        title       - Optional  - String        - Plot Title
        xLab        - Optional  - String        - X axis label
        yLab        - Optional  - String        - Y axis label
        lineColors  - Optional  - List          - Colors to plot multiple lines
        lineWidth   - Optional  - Integer       - Width of lines to plot
        lineOpacity - Optional  - Float         - Alpha of lines to plot
        plotStyle   - Optional  - String        - MatPlotLib plot style
        showLegend  - Optional  - Boolean       - Show legend
        MatPlotLib Plot Object

    # Import MatPlotLib Plotting Function & Set Style
    from matplotlib import pyplot as plt
    figure = plt.figure()                   # Initialize Figure
    grouped = df.groupby(group)             # Set Group
    i = 0                                   # Set iteration to determine line color indexing
    for groupObj in grouped:
        colorIndex = i % len(lineColors)    # Define line color index
        groupDf = groupObj[1]               # Get group data frame
        lineLab = groupDf[group].values[0]  # Get group label from first position
        xValues = groupDf[xCol]             # Get x vector
        yValues = groupDf[yCol]             # Get y vector
        plt.subplot(1,1,1)                  # Initialize subplot and plot (next line)
        plt.plot(xValues, yValues, label = lineLab, color = lineColors[colorIndex], lw = lineWidth, alpha = lineOpacity)
        # Plot legend
        if showLegend:
        i += 1
    # Set title & Labels
    axis = figure.add_subplot(1,1,1)
    # Return plot for saving, showing, etc.
    return plt

And to use it...

import pandas

# Load the Data into Pandas
df = pandas.read_csv('data.csv')    

# Plotting - by Patient

# Create Patient Grouping
patientGroup = df.groupby('pt #')

# Iterate Over Groups
for group in patientGroup:
    patientDf = group[1]
    # Let's give them specific titles
    plotTitle = "Gene Frequency over Time by Gene (Patient %s)" % str(patientDf['pt #'].values[0])
    # Call the subplot function
    plot = plotByGroup(patientDf, 'gene', 'xaxis', 'yaxis', title = plotTitle, xLab = "Days", yLab = "Gene Frequency")
    # Add Vertical Lines at Assay Timepoints
    timepoints = set(patientDf.xaxis.values)
    [plot.axvline(x = timepoint, linewidth = 1, linestyle = "dashed", color='gray', alpha = 0.4) for timepoint in timepoints]
    # Let's see it

enter image description here

And of course, we can do the same by gene.

# Plotting - by Gene

# Create Gene Grouping
geneGroup   = df.groupby('gene')

# Generate Plots for Groups
for group in geneGroup:
    geneDf = group[1]
    plotTitle = "%s Gene Frequency over Time by Patient" % str(geneDf['gene'].values[0])
    plot = plotByGroup(geneDf, 'pt #', 'xaxis', 'yaxis', title = plotTitle, xLab = "Days", yLab = "Frequency")

enter image description here

If this isn't what you're looking for, provide a clarification and I'll take another crack at it.