Using R to examine the world population datasheet
Corey S. Sparks
2017-07-13
These data are from the 2008 Population Reference Bureau World Population Data Sheet
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv")
prbdata <- read.csv(text = x)
Let’s have a look at some descriptive information about the data:
#Frequency Table of # of Contries by Continent
table(prbdata$Continent)
##
## Africa Asia Europe North America Oceania
## 55 51 45 27 17
## South America
## 13
#basic summary statistics for the variable TFR or the total fertility rate
summary(prbdata$TFR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.775 2.350 2.875 3.800 7.600
We see the mean is 2.8754808 and that NA case is missing. That case is .
More Descriptive statistics
#just want a mean
mean(prbdata$TFR, na.rm=T)
## [1] 2.875481
#what does R give if there's a missing case and I don't specify na.rm=T?
mean(prbdata$TFR)
## [1] 2.875481
NA means something is missing. NA is used for all kinds of missing data. So in this case, R won’t compute the mean because one case is missing, so it contributes no informaiton to the estimation of the mean.
#standard deviation
sd(prbdata$TFR, na.rm=T)
## [1] 1.481488
#Quantiles
quantile(prbdata$TFR, na.rm=T)
## 0% 25% 50% 75% 100%
## 1.200 1.775 2.350 3.800 7.600
The median is 2.35
require(lattice)
## Loading required package: lattice
#histogram of the infant mortality rate
hist(prbdata$TFR, main="Histogram of Total Fertility Rate")
#Box plot for TFR* Continent
bwplot(TFR~Continent, prbdata,main="Boxplot of Total Fertility Rate by continent")
#scatter plot of TFR * IMR, the infant mortality rate
xyplot(TFR~IMR, data=prbdata, main="Bivariate Association between TFR and IMR")
T-tests
If our outcome is continuous and approximately normal, then we can use a t-test to compare the mean of two groups.
#t-test for Africa vs Rest of the world
#Useing the I() funciton, which generates a T/F value, i.e. 2 groups, Africa or Not Africa
t.test(TFR~I(Continent=="Africa"), prbdata, var.equal=F)
##
## Welch Two Sample t-test
##
## data: TFR by I(Continent == "Africa")
## t = -11.559, df = 69.813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.769268 -1.954226
## sample estimates:
## mean in group FALSE mean in group TRUE
## 2.250980 4.612727
Variable summaries
Here is a summary of a factor variable, gives you the number of observations in each level of the factor
summary(prbdata$Continent)
## Africa Asia Europe North America Oceania
## 55 51 45 27 17
## South America
## 13
#summary of a numeric variable, gives you the numeric summary
summary(prbdata$Population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 1.10 6.60 34.26 22.65 1357.40
Histograms are useful for showing the “shape” of the data
library(ggplot2)
#for a factor variable, the histogram shows the % in each category
ggplot(prbdata, aes(x=Continent))+geom_bar()
#for a numeric variable, the histogram shows the % in a given bin
ggplot(prbdata, aes(x=Population))+geom_histogram(bins = 20)
There are many descriptive statistics, first we will calculate measures of central tendency, or scale
mean(prbdata$Population)
## [1] 34.25937
median(prbdata$Population)
## [1] 6.6
#These two numbers are far from the same, why?
#There are two main measures of variablity, variance and the standard deviation
var(prbdata$Population)
## [1] 17831.45
sd(prbdata$Population)
## [1] 133.5345
#or
sqrt(var(prbdata$Population))
## [1] 133.5345
Group statistics
Here we show how to make statistics by a grouping variable
library(mosaic)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: mosaicData
## Loading required package: Matrix
##
## The 'mosaic' package masks several functions from core packages in order to add additional features.
## The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
mean(e0Total~Continent, data=prbdata, na.rm=T)
## Africa Asia Europe North America Oceania
## 59.60000 72.60784 78.00000 75.00000 71.05882
## South America
## 73.69231
sd(e0Total~Continent, data=prbdata, na.rm=T)
## Africa Asia Europe North America Oceania
## 8.607921 5.751794 4.035556 3.961352 6.259698
## South America
## 3.923957
Now we plot some variables by a group variable
#Box and Whisker plots
ggplot(prbdata, aes(x=Continent,y=e0Total))+geom_boxplot()+ggtitle(label = "Variation in Life Expectancy by Continent",subtitle = "2008 PRB Datasheet")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
ggplot(prbdata, aes(x=Continent, y=TFR))+geom_boxplot()+ggtitle(label = "Variation in Total Fertility Rate by Continent",subtitle = "2008 PRB Datasheet")
Scatter plots
ggplot(prbdata, aes(x=TFR, y=e0Total))+geom_point(aes(x=TFR, y=e0Total, colour=Continent))
## Warning: Removed 2 rows containing missing values (geom_point).
#Scatter plots with group labels
xyplot(e0Total~TFR, groups=Continent,auto.key=T, data=prbdata)
#separated by a group variable
ggplot(prbdata, aes(x=TFR, y=e0Total))+geom_point()+facet_grid(.~Continent)
## Warning: Removed 2 rows containing missing values (geom_point).