Nemo Nemo - 29 days ago 10
R Question

Top Athlete from a statistical point of view

Let's say you are a junior track and field athlete specializing in 100m.
I have the rankings of 400 junior players for each individual year since 2006 until 2016.(each year is a separate csv file (table))

And I have the rankings of senior players for each individual year since 2006 until 2016.(each year is a separate csv file (table))

The question I want to answer: is there a correlation between being a good junior athlete and your chances of being a world star?

So how should I approach this problem. I have some good skills in R. Just point me to the direction.

Answer

is there a correlation between being a good junior athlete and your chances of being a world star?

Is being a world star equal to appearing in the second group of csv`s?

Is being in the first group of csvs proof of being a good junior athlete?

Will you suppose that each name is unique and that names don't chance over time?

You might want to build a table similar to that in McNemar test.

            Name in top athlethes
                yes  |  no
              +------+-------  
  top    yes  |  150 |  250  
junior   no   |  250 |  550  

Right now, I fail to see a reason why not to compute an Odds Ratio from that to answer your question.

All you needed to do is rbind all junior-CSV and uniquethe names, do the same with top-CSV and merge these two as an inner join to find overlapping names. Joins can be done using merge.