erip - 4 months ago 36

Python Question

I am trying to display a Zipf plot, which is typically displayed on a log-log scale.

I'm using a library which gives rank in linear scale and frequencies in log scale. I have the following code which plots my data fairly correctly:

`ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]`

# These frequencies are already log-scale

freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,

-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]

data = {

'ranks': ranks,

'freqs': freqs,

}

df = pd.DataFrame(data=data)

_, ax = plt.subplots(figsize=(7, 7))

ax.set(xscale="log", yscale="linear")

ax.set_title("Zipf plot")

sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)

ax.set_xlabel("Frequency rank of token")

ax.set_ylabel("Absolute frequency of token")

ax.grid(True, which="both")

plt.show()

The resulting plot is:

The plot looks good, but the y-label is weird. I'd like it to be displayed in log-increments as well. My current workaround is to raise 10 to the power of each element in the

`freqs`

`freqs = [10**freq for freq in freqs]`

# ...

and change the

`yscale`

`ax.set`

`_, ax = plt.subplots(figsize=(7, 7))`

ax.set(xscale="log", yscale="log")

ax.set_title("Zipf plot")

# ...

This gives me the expected plot (below), but it requires a transform of the data which is a) relatively expensive, b) redundant, c) lossy.

Is there a way to mimic the log scale of the axes in a matplotlib plot

Answer

*First a comment: Personally i would prefer the method of rescaling the data, since it makes everything much easier at the expense of some more memory/cpu time and accurary should not matter*

Now to the question, which is acutally **how to mimic a log scale on a linear axis**

This is not easy. Setting the axes to log scale changes a lot in the background and one needs to mimic all of that.

- The easy part is to set the major tickmark frequency to 1 by using
`matplotlib.ticker.MultipleLocator()`

- Creating the minor tickmarks at positions which look logarithmic is harder. The best solution I could come up with is to set them manually using the
`matplotlib.ticker.FixedLocator()`

- Last we need to change the tickmarks to represent the actual numbers, meaning that they should look like 10^(-x) instead of -x. I am aware of two options here:
- Using a
`FuncFormatter`

that sets the values 10**x in scientific format. - Using a
`FuncFormatter`

that sets the values 10^x in Latex format. This looks much nicer but contrasts to the rest of the plot.

- Using a

I do not know any better solution for that last point, but maybe someone else does.

Here is the code and how it looks.

```
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from matplotlib.ticker import MultipleLocator, FixedLocator, FuncFormatter
###### Locators for Y-axis
# set tickmarks at multiples of 1.
majorLocator = MultipleLocator(1.)
# create custom minor ticklabels at logarithmic positions
ra = np.array([ [n+np.log10(i)] for n in xrange(10,20) for i in [2,3,4,5,6,7,8,9]]).flatten()*-1.
minorLocator = FixedLocator(ra)
###### Formatter for Y-axis (chose any of the following two)
# show labels as powers of 10 (looks ugly)
majorFormatter= FuncFormatter(lambda x,p: "{:.1e}".format(10**x) )
# or using MathText (looks nice, but not conform to the rest of the layout)
majorFormatter= FuncFormatter(lambda x,p: r"$10^{"+"{x:d}".format(x=int(x))+r"}$" )
ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414,
-19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
data = {
'ranks': ranks,
'freqs': freqs,
}
df = pd.DataFrame(data=data)
_, ax = plt.subplots(figsize=(6, 6))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")
sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
# Set the locators
ax.yaxis.set_major_locator(majorLocator)
ax.yaxis.set_minor_locator(minorLocator)
# Set formatter if you like to have the ticklabels consistently in power notation
ax.yaxis.set_major_formatter(majorFormatter)
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.savefig(__file__+".png")
plt.show()
```