 user3507584 -4 years ago 119
R Question

# Get data observations used by regression in R (plm)

I am estimating a panel model with the package

`plm`
.
Some of the individuals in the panel do not have data for all the explanatory variables, so they are excluded from the regression.
How could I see which particular observations have been used for the estimation?

In Stata the usual command is
`e(sample)`
. What is the equivalent in R? eipi10

The data used for the model is stored in the list returned by the `plm` function. The list contains several elements, one of which is named `model`. That's where the data used for the model is stored. Here's an example based on the help for `plm`:

``````library(plm)

data("Produc")
``````

Let's set the first 20 values of `Produc\$pcap` to `NA` (missing data):

``````Produc\$pcap[1:20] = NA
``````

Now we'll create a `plm` model using `Produc`:

``````zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
``````

`zz` is a list containing the information returned by the `plm` function. You can run `str(zz)` to see what `zz` contains. The data used for the model is stored in `zz\$model`. You can see by the rownames, which start at 21, that the first 20 rows are missing, because those are the ones in which we set `Produc\$pcap` to `NA`.

``````head(zz\$model)  # You can also do: head(zz[["model"]])
``````
``````   log(gsp) log(pcap)  log(pc) log(emp) unemp
21 10.13634  9.358610 10.21481 6.571583   4.1
22 10.15417  9.403360 10.26915 6.614726   5.6
23 10.12323  9.467233 10.31703 6.591811  12.0
24 10.16743  9.518111 10.28821 6.631606   9.8
25 10.24388  9.559265 10.31137 6.696170   8.2
26 10.34374  9.603196 10.34623 6.797271   6.1
``````

If you want to select the rows of your data frame that were used in the model, you can use the rownames of `zz\$model` as the indices for subsetting:

``````Produc[rownames(zz\$model), ]
``````

`Produc[complete.cases(Produc), ]` will return only those rows of the data frame without any missing data. Note, though, that if there are columns in your data frame that have missing data, but that were not used in the model formula, then this approach will, in general, exclude some rows of data that were nevertheless used in the model (the exception being the case where missing data in columns not used in the model is always accompanied in the same rows by missing data in columns used in the model).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download