thiagoveloso thiagoveloso - 3 months ago 24
R Question

R - Conditional math expression in data table

I am working with a FAO database of agricultural crop production by country. Specifically, I have a data table showing rice production by year from every country in the world starting in 1961 until 2014. Here is a simplified version of my data:

d <- data.table(structure(list(Year = c(1961, 1962, 1963, 1964, 1965, 1966, 1967,
1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978,
1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
2002, 2003, 2004, 2005, 2006), Country = c("Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania",
"Albania"), totalprod = c(319000, 319000, 319000, 380000, 380000,
337000, 396000, 402000, 407000, 366000, 350000, 4e+05, 420000,
420000, 435000, 448000, 4e+05, 428000, 439000, 415000, 390000,
364000, 350000, 334000, 317000, 336000, 324000, 343000, 320000,
333000, 335000, 3e+05, 3e+05, 342000, 390000, 340000, 4e+05,
450000, 280000, 260000, 242000, 388000, 434000, 463000, 485000,
540000, 552000, 612000, 645000, 672000, 672000, 5e+05, 512094,
537000, 4603, 5683, 9135, 8173, 10225, 10524, 11254, 12807, 14276,
14924, 10760, 12000, 15168, 12000, 13500, 14000, 14400, 14800,
15520, 13000, 13900, 11900, 13000, 12600, 12000, 11000, 10600,
8830, 8450, 7000, 2283, 960, 585, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0)), .Names = c("Year", "Country", "totalprod"), class = "data.frame", row.names = c(NA,
-100L)))


I need to summarize this data table in two different ways:

a) First I need to calculate the global rice production per year, by aggregating each country's production per year. I managed to answer this by doing:

d.global.year <- d[, list(totalprodyear=sum(totalprod)), by=Year]


b) Once I have the global rice production per year, I need to determine each country's contribution per year. This would be done by dividing each country's production by the global rice production, on a yearly basis.

However, I am still trying to figure out a way to solve b).

EDITED:
Expected output:

Let's take for example the global rice consumption for 1961: 323603

In that case, Afghanistan's contribution would be 319000/323603 = 0.986 and Albania's would be 4603/323603 = 0.014

Any tips?

Answer

You can do it in a two-stage groupby process, firstly group by Year and calculate the total product for each year, and then group by both Year and Country where you can use the total product calculated in the previous stage to calculate the proportion of each country's contribution:

sumDt <- d[, totalprodyear :=sum(totalprod), by=Year]
          [, .(totalprodyear, percentprod = sum(totalprod)/totalprodyear), by = .(Year, Country)]
sumDt[, head(.SD, 3), by = Country]
#        Country Year totalprodyear percentprod
# 1: Afghanistan 1961        323603  0.98577578
# 2: Afghanistan 1962        324683  0.98249677
# 3: Afghanistan 1963        328135  0.97216085
# 4:     Albania 1961        323603  0.01422422
# 5:     Albania 1962        324683  0.01750323
# 6:     Albania 1963        328135  0.02783915