kmkmkmkmkm kmkmkmkmkm - 11 days ago 6
Python Question

Explanation of the scatter method in python (for machine learning)

I'm brand new to python and machine learning, and as part of my course at university we're using numpy, matplotlib, and sci-kit learn. Ok so I have a question. The code below works perfectly fine, my issue is that I don't really understand what's happening. So for this one:

%matplotlib inline
X=iris.data
Y=iris.target
#first two features are sepal length and sepal width
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')


I tried to check the documentation but it didn't really make any sense to me.
Here I would like to know what the arguments in plt.scatter() mean. I don't really understand it, what does c=Y mean, what is cmap why are there two X coordinates?

As for this next code:

%matplotlib inline

#here's also how to plot in 3d:
from mpl_toolkits.mplot3d import Axes3D #

#create a new figure
fig = plt.figure(figsize=(5,5))


#this creates a 1x1 grid (just one figure), and now we are plotting
#subfigure 1 (this is what 111 means)
ax = fig.add_subplot(111, projection='3d')

#plot first three features in a 3d Plot. Using : means that we take all
#elements in the correspond array dimension
ax.scatter(X[:, 0], X[:, 1], X[:, 2],c=Y)


What I want to know here is:

fig.add_subplot(111, projection = '3d'). This third 1 doesn't really make sense to me. I understand the 1x1 grid, but I don't understand "and now we are plotting subfigure 1".

Also:

The ax.scatter() arguments make no sense here either. Why is the format different than the one above? Why are there three X's and no cmap? I don't really understand it. Why do they not use plt.scatter()?

Answer

Lets go step by step through your questions.

First, "why are there two X coordinates?": The scatter function (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) takes x and y coordinates for the markers; in this case both the x and y coordinates are stored in one single 2D array called X: X[:,0] are simply the x coordinates, and X[:,1] the y coordinates. "what does c=Y mean?": There are several options for colouring the markers:

c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

So in this case, the relevant part seems to be "N numbers to be mapped to colors using the cmap", i.e. you provide data values, and scatter colours the markers using some color map (http://matplotlib.org/examples/color/colormaps_reference.html).

"fig.add_subplot(111, projection = '3d'). This third 1 doesn't really make sense to me.": The add_subplot adds sub plots on a grid, where 111 (or lets use 322 as a more clear example) means a grid with 3 rows, 2 columns, and you are using the second position on the grid (i.e. row 0, column 1) as the current subplot. So subplot(111) simply means a figure with 1 row and 1 column of sub plots, and you are using (what a surprise..) the first and only position.

"Why are there three X's and no cmap?": You first imported mpl_toolkits.mplot3d and next specified projection='3d' to add_subplot, so you are now creating a 3D scatter plot, which requires the specification of x, y and z coordinates of the markers.

Comments