5 min read

Using R to examine the world population datasheet

Using R to examine the world population datasheet

These data are from the 2008 Population Reference Bureau World Population Data Sheet

library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv")
prbdata <- read.csv(text = x)

Let’s have a look at some descriptive information about the data:

#Frequency Table of # of Contries by Continent
table(prbdata$Continent)
## 
##        Africa          Asia        Europe North America       Oceania 
##            55            51            45            27            17 
## South America 
##            13
#basic summary statistics for the variable TFR or the total fertility rate
summary(prbdata$TFR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.775   2.350   2.875   3.800   7.600

We see the mean is 2.8754808 and that NA case is missing. That case is .

More Descriptive statistics

#just want a mean
mean(prbdata$TFR, na.rm=T)
## [1] 2.875481
#what does R give if there's a missing case and I don't specify na.rm=T?
mean(prbdata$TFR)
## [1] 2.875481

NA means something is missing. NA is used for all kinds of missing data. So in this case, R won’t compute the mean because one case is missing, so it contributes no informaiton to the estimation of the mean.

#standard deviation
sd(prbdata$TFR, na.rm=T)
## [1] 1.481488
#Quantiles
quantile(prbdata$TFR, na.rm=T)
##    0%   25%   50%   75%  100% 
## 1.200 1.775 2.350 3.800 7.600

The median is 2.35

require(lattice)
## Loading required package: lattice
#histogram of the infant mortality rate
hist(prbdata$TFR, main="Histogram of Total Fertility Rate")

#Box plot for TFR* Continent
bwplot(TFR~Continent, prbdata,main="Boxplot of Total Fertility Rate by continent")

#scatter plot of TFR * IMR, the infant mortality rate
xyplot(TFR~IMR, data=prbdata, main="Bivariate Association between TFR and IMR")

T-tests

If our outcome is continuous and approximately normal, then we can use a t-test to compare the mean of two groups.

#t-test for Africa vs Rest of the world
#Useing the I() funciton, which generates a T/F value, i.e. 2 groups, Africa or Not Africa
t.test(TFR~I(Continent=="Africa"), prbdata, var.equal=F)
## 
##  Welch Two Sample t-test
## 
## data:  TFR by I(Continent == "Africa")
## t = -11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.769268 -1.954226
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2.250980            4.612727

Variable summaries

Here is a summary of a factor variable, gives you the number of observations in each level of the factor

summary(prbdata$Continent)
##        Africa          Asia        Europe North America       Oceania 
##            55            51            45            27            17 
## South America 
##            13
#summary of a numeric variable, gives you the numeric summary
summary(prbdata$Population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    1.10    6.60   34.26   22.65 1357.40

Histograms are useful for showing the “shape” of the data

library(ggplot2)
#for a factor variable, the histogram shows the % in each category
ggplot(prbdata, aes(x=Continent))+geom_bar()

#for a numeric variable, the histogram shows the % in a given bin
ggplot(prbdata, aes(x=Population))+geom_histogram(bins = 20)

There are many descriptive statistics, first we will calculate measures of central tendency, or scale

mean(prbdata$Population)
## [1] 34.25937
median(prbdata$Population)
## [1] 6.6
#These two numbers are far from the same, why?

#There are two main measures of variablity, variance and the standard deviation
var(prbdata$Population)
## [1] 17831.45
sd(prbdata$Population)
## [1] 133.5345
#or
sqrt(var(prbdata$Population))
## [1] 133.5345

Group statistics

Here we show how to make statistics by a grouping variable

library(mosaic)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: mosaicData
## Loading required package: Matrix
## 
## The 'mosaic' package masks several functions from core packages in order to add additional features.  
## The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
mean(e0Total~Continent, data=prbdata, na.rm=T)
##        Africa          Asia        Europe North America       Oceania 
##      59.60000      72.60784      78.00000      75.00000      71.05882 
## South America 
##      73.69231
sd(e0Total~Continent, data=prbdata, na.rm=T)
##        Africa          Asia        Europe North America       Oceania 
##      8.607921      5.751794      4.035556      3.961352      6.259698 
## South America 
##      3.923957

Now we plot some variables by a group variable

#Box and Whisker plots
ggplot(prbdata, aes(x=Continent,y=e0Total))+geom_boxplot()+ggtitle(label = "Variation in Life Expectancy by Continent",subtitle = "2008 PRB Datasheet")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

ggplot(prbdata, aes(x=Continent, y=TFR))+geom_boxplot()+ggtitle(label = "Variation in Total Fertility Rate by Continent",subtitle = "2008 PRB Datasheet")

Scatter plots

ggplot(prbdata, aes(x=TFR, y=e0Total))+geom_point(aes(x=TFR, y=e0Total, colour=Continent))
## Warning: Removed 2 rows containing missing values (geom_point).

#Scatter plots with group labels
xyplot(e0Total~TFR, groups=Continent,auto.key=T, data=prbdata)

#separated by a group variable
 ggplot(prbdata, aes(x=TFR, y=e0Total))+geom_point()+facet_grid(.~Continent)
## Warning: Removed 2 rows containing missing values (geom_point).