erip erip - 8 days ago 7
Python Question

Can I mimic a log scale of an axis in matplotlib without transforming the associated data?

I am trying to display a Zipf plot, which is typically displayed on a log-log scale.

I'm using a library which gives rank in linear scale and frequencies in log scale. I have the following code which plots my data fairly correctly:

ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,
-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]

data = {
'ranks': ranks,
'freqs': freqs,
}

df = pd.DataFrame(data=data)

_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")
sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.show()


The resulting plot is:

Zipf 1

The plot looks good, but the y-label is weird. I'd like it to be displayed in log-increments as well. My current workaround is to raise 10 to the power of each element in the
freqs
list; i.e.,

freqs = [10**freq for freq in freqs]
# ...


and change the
yscale
in
ax.set
to log; i.e.,

_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="log")
ax.set_title("Zipf plot")
# ...


This gives me the expected plot (below), but it requires a transform of the data which is a) relatively expensive, b) redundant, c) lossy.

Zipf 2

Is there a way to mimic the log scale of the axes in a matplotlib plot without transforming the data?

Answer

First a comment: Personally i would prefer the method of rescaling the data, since it makes everything much easier at the expense of some more memory/cpu time and accurary should not matter

Now to the question, which is acutally how to mimic a log scale on a linear axis
This is not easy. Setting the axes to log scale changes a lot in the background and one needs to mimic all of that.

  • The easy part is to set the major tickmark frequency to 1 by using matplotlib.ticker.MultipleLocator()
  • Creating the minor tickmarks at positions which look logarithmic is harder. The best solution I could come up with is to set them manually using the matplotlib.ticker.FixedLocator()
  • Last we need to change the tickmarks to represent the actual numbers, meaning that they should look like 10^(-x) instead of -x. I am aware of two options here:
    • Using a FuncFormatter that sets the values 10**x in scientific format.
    • Using a FuncFormatter that sets the values 10^x in Latex format. This looks much nicer but contrasts to the rest of the plot.

I do not know any better solution for that last point, but maybe someone else does.

Here is the code and how it looks.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from matplotlib.ticker import MultipleLocator, FixedLocator, FuncFormatter

###### Locators for Y-axis
# set tickmarks at multiples of 1.
majorLocator = MultipleLocator(1.)
# create custom minor ticklabels at logarithmic positions
ra = np.array([ [n+np.log10(i)]  for n in xrange(10,20) for i in [2,3,4,5,6,7,8,9]]).flatten()*-1.
minorLocator = FixedLocator(ra)
###### Formatter for Y-axis (chose any of the following two)
# show labels as powers of 10 (looks ugly)
majorFormatter= FuncFormatter(lambda x,p: "{:.1e}".format(10**x) ) 
# or using MathText (looks nice, but not conform to the rest of the layout)
majorFormatter= FuncFormatter(lambda x,p: r"$10^{"+"{x:d}".format(x=int(x))+r"}$" ) 

ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414, 
        -19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]

data = {
    'ranks': ranks, 
    'freqs': freqs,
}

df = pd.DataFrame(data=data)

_, ax = plt.subplots(figsize=(6, 6))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")

sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)

# Set the locators
ax.yaxis.set_major_locator(majorLocator)
ax.yaxis.set_minor_locator(minorLocator)
# Set formatter if you like to have the ticklabels consistently in power notation
ax.yaxis.set_major_formatter(majorFormatter)

ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.savefig(__file__+".png")
plt.show()

enter image description here

Comments