2.4 Exercises

Conceptual

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

1. The sample size n is extremely large, and the number of predictors p is small.
- Answer: We would expect the performance of a flexible statistical learning method to be better than an inflexible one because with a large n you can approach the true distribution.
1. The number of predictors p is extremely large, and the number of observations n is small.
- Answer: The performance of a flexible statistical learning method would be worse as the probability of overfitting would be very high.
1. The relationship between the predictors and response is highly non-linear.
- Answer: Flexible statistical learning methods are more adapted to non-linear relationships than inflexible methods. The flexible method has better options to approximate the real distribution.
1. The variance of the error terms, i.e. $\sigma$² = Var(), is extremely high.
- Answer: The performance of a flexible statistical method would be worse when the variance of the error term is very high. Overfitting would be a large worry, i.e. that the model is following the errors in the data, so then the flexible approach would likely have lower performance.

2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary
- Answer: regression, inference, n = 500, p = record profit, # of employees, industry, CEO salary
1. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
- Answer: classification, prediction, n = 20, p = success or failure, $, marketing, budget competition price, +10 others
1. We are interest in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
- Answer: regression, prediction, n = 52, p = 4

3. We now revisit the bias-variance decomposition.

1. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
- Answer:
1. Explain why each of the five curves has the shape displayed in part (a).
- Answer:
  - bias - decreases with flexibility because more likely to appropriately fit the data
  - variance - increases with flexibility because more wobbly, follows the data more
  - training error - decreases with flexibility - possible to better follow the data with more flexible more
  - test error - decreases and then increases with flexibility, error increases because model is following noise of data in training set and test data do not have the same noise.
  - V(E) irreducible error - stays constant with the method because it is an error inherent in the data.

4. You will now think of some real-life applications for statistical learning.

1. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- Answer:
  - variants, are they true positives or false positives in NGS data, predictor: variant fraction, cell type, coverage, genomic context, goal: prediction
  - microbial OTUs, - what is the richness of OTUs found by environment. predictors: environmental parameters, goal: prediction, but also inference
  - new books, fan-favourites or not based on Goodreads rating, predictors: use parameters of page length, category, author goal: prediction.
1. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- Answer:
  - crime with temperature, goal: prediction and inference
  - price of coffee over time in Switzerland goal:prediction
  - price of used cars by location goal:prediction
1. Describe three real-life applications in which cluster analysis might be useful.
- Answer:
  - microarray or gene expression data - samples with similar patterns
  - microbial communities - samples with similar functional pathways
  - people with similar behaviours in financial transaction data

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Answer:
- Regression
  - flexible - harder to interpret function , test scores (MSE) may be lower, may more accurately model the true function
  - less flexible - opposite of flexible
  - Why take a more or less flexible option? chose more flexible when lots of data and many different groups and chose less flexible when few data points.
- Classification
  - the same method can be more or less flexible (e.g. k-nearest neighbours)
  - highly flexible - training MSE low, but test MSE high. High variance in the classification
  - less flexible - lower variance but higher bias
  - Why take a more or less flexible option? (same as regression) chose more flexible when lots of data and many different groups and chose less flexible when few data points.

6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Answer:
- Parametric methods make an assumption about the function of the model and that it is linear. Non-parametric methods do not assume anything about the function when trying to estimate the fit of the data.
- Advantages of parametric: needs less data than a non-parametric test
- Disadvantages of paramteric: May not model the true functions and thus may have errors

7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable

Obs.	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbours.

1. Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
- Answer:

Euclidian distance is $\sqrt{x1^2 + x2^2 + x3^2}$

training_df <- data.frame(observation=c(1,2,3,4,5,6),
                          X1 = c(0,2,0,0,1,1),
                          X2 = c(3,0,1,1,0,1),
                          X3 = c(0,0,3,2,1,1),
                          Y = c("Red", "Red", "Red", "Green", "Green", "Red"),
                          Euclidean_distance = c(sqrt(3*3), sqrt(2*2),sqrt(1*1+3*3), sqrt(1*1+2*2),sqrt(1*1+1*1),sqrt(1*1+1*1+1*1)))
training_df

##   observation X1 X2 X3     Y Euclidean_distance
## 1           1  0  3  0   Red           3.000000
## 2           2  2  0  0   Red           2.000000
## 3           3  0  1  3   Red           3.162278
## 4           4  0  1  2 Green           2.236068
## 5           5  1  0  1 Green           1.414214
## 6           6  1  1  1   Red           1.732051

1. What is our prediction with K = 1? Why?
- Answer:

library(ggplot2)
ggplot(training_df, aes(x = Euclidean_distance, y = 0))+
  geom_point(aes(colour = Y))+
  scale_color_manual(values = c( "green", "red"))+
  xlim(0,4)

Green.

1. What is our prediction with K = 3? Why?
- Answer: Red, because most of the points included are red.
1. If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?
- Answer: We would expect the best value to be small if the Bayes decision boundary is highly non-linear. This is because a large value would not be flexible enough to model the nonlinear boundary.

Applied

8. This exercise relates to the College data set, which can be found in the file College.csv.

It contains a number of variables for 777 different universities and colleges in the US. The variables are:

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a texteditor.

1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

## downloaded data from http://www-bcf.usc.edu/~gareth/ISL/data.html
college <- read.csv("./College.csv")

1. Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[,1]
#fix(college) ## brings up an editor in RStudio, will switch to just printing the output
head(college)

##                                                         X Private Apps
## Abilene Christian University Abilene Christian University     Yes 1660
## Adelphi University                     Adelphi University     Yes 2186
## Adrian College                             Adrian College     Yes 1428
## Agnes Scott College                   Agnes Scott College     Yes  417
## Alaska Pacific University       Alaska Pacific University     Yes  193
## Albertson College                       Albertson College     Yes  587
##                              Accept Enroll Top10perc Top25perc F.Undergrad
## Abilene Christian University   1232    721        23        52        2885
## Adelphi University             1924    512        16        29        2683
## Adrian College                 1097    336        22        50        1036
## Agnes Scott College             349    137        60        89         510
## Alaska Pacific University       146     55        16        44         249
## Albertson College               479    158        38        62         678
##                              P.Undergrad Outstate Room.Board Books
## Abilene Christian University         537     7440       3300   450
## Adelphi University                  1227    12280       6450   750
## Adrian College                        99    11250       3750   400
## Agnes Scott College                   63    12960       5450   450
## Alaska Pacific University            869     7560       4120   800
## Albertson College                     41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni
## Abilene Christian University     2200  70       78      18.1          12
## Adelphi University               1500  29       30      12.2          16
## Adrian College                   1165  53       66      12.9          30
## Agnes Scott College               875  92       97       7.7          37
## Alaska Pacific University        1500  76       72      11.9           2
## Albertson College                 675  67       73       9.4          11
##                              Expend Grad.Rate
## Abilene Christian University   7041        60
## Adelphi University            10527        56
## Adrian College                 8735        54
## Agnes Scott College           19016        59
## Alaska Pacific University     10922        15
## Albertson College              9727        55

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[,-1]
head(college)

##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55

Now you should see that the first data column is Private. Note that another column labelled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

1. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[,1:10])

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

boxplot(college$Outstate ~ college$Private) ## neat I did not know the formula notation could be used here

I would be more apt to use ggplot2 to create plots

ggplot(college, aes(x = Private, y = Outstate))+
  geom_boxplot()

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite <- rep("No", nrow(college ))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(college$Elite)

##  No Yes 
## 699  78

boxplot(college$Outstate ~ college$Elite)

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

hist(college$Apps)

hist(college$Accept)

I feel there must be a more efficient way

library(dplyr)
library(tidyr)
college_gathered <- college %>%
  ## remove the variables that are factors
  select(-Private, -Elite) %>% 
  gather()

ggplot(college_gathered, aes(x = value))+
  geom_histogram(stat = "bin")+
  facet_wrap(.~key, scales = "free")+
  theme(axis.text.x = element_text(angle = 90))

Continue exploring the data, and provide a brief summary of what you discover

How do the histograms differ between private or elite schools?

college_gathered <- college %>%
  ## remove the variables that are factors
  select(-Elite) %>% 
  gather("key", "value", -Private)

ggplot(college_gathered, aes(x = value, colour = Private))+
  geom_density()+
  facet_wrap(.~key, scales = "free")+
  theme(axis.text.x = element_text(angle = 90))

Differences between private and public schools are seen in the number of applications, enrolment and acceptance. This is likely just related to the number of student total which we would expect to be higher in the public schools.

college_gathered <- college %>%
  ## remove the variables that are factors
  select(-Private) %>% 
  gather("key", "value", -Elite)

ggplot(college_gathered, aes(x = value, colour = Elite))+
  geom_density()+
  facet_wrap(.~key, scales = "free")+
  theme(axis.text.x = element_text(angle = 90))

Elite vs non-Elite schools show a big difference, whereby Elite schools have higher numbers of PhDs teaching, contributions of alums after graduation, graduation rates, and expenditures.

9. This exercise involves the Auto data set studied in the lab.

Make sure that the missing values have been removed from the data.

auto <- read.table("./Auto.data", header = TRUE, na.strings = "?")
## where are the NAs?
summary(auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   1st Qu.: 75.0  
##  Median :23.00   Median :4.000   Median :146.0   Median : 93.5  
##  Mean   :23.52   Mean   :5.458   Mean   :193.5   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                  NA's   :5      
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2223   1st Qu.:13.80   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2800   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.56   Mean   :75.99   Mean   :1.574  
##  3rd Qu.:3609   3rd Qu.:17.10   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##              name    
##  ford pinto    :  6  
##  amc matador   :  5  
##  ford maverick :  5  
##  toyota corolla:  5  
##  amc gremlin   :  4  
##  amc hornet    :  4  
##  (Other)       :368

## remove the rows with NAs
auto <- na.omit(auto)
head(auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1. Which of the predictors are quantitative, and which are qualitative?

Quantitative:

quantitative <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")

Qualitative:

qualitative <- c("origin", "name")

1. What is the range of each quantitative predictor? You can answer this using the range() function. range()

quant_data <- subset(auto, select = colnames(auto) %in% quantitative)

## take the range for each column
apply(quant_data, 2, range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

1. What is the mean and standard deviation of each quantitative predictor?

apply(quant_data, 2, mean)

##          mpg    cylinders displacement   horsepower       weight 
##    23.445918     5.471939   194.411990   104.469388  2977.584184 
## acceleration         year 
##    15.541327    75.979592

apply(quant_data, 2, sd)

##          mpg    cylinders displacement   horsepower       weight 
##     7.805007     1.705783   104.644004    38.491160   849.402560 
## acceleration         year 
##     2.758864     3.683737

1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

quant_data_trimmed <- quant_data[-c(10:85),]
apply(quant_data_trimmed, 2, range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82

apply(quant_data_trimmed, 2, mean)

##          mpg    cylinders displacement   horsepower       weight 
##    24.404430     5.373418   187.240506   100.721519  2935.971519 
## acceleration         year 
##    15.726899    77.145570

apply(quant_data_trimmed, 2, sd)

##          mpg    cylinders displacement   horsepower       weight 
##     7.867283     1.654179    99.678367    35.708853   811.300208 
## acceleration         year 
##     2.693721     3.106217

1. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(auto)

Histograms of quantitative

## remember gather() akin to melt()
quant_data_gathered <- quant_data %>%
  gather()
## quick histograms
ggplot(quant_data_gathered, aes(x = value))+
  geom_histogram()+
  facet_wrap(.~key, scales = "free")+
  theme(axis.text.x = element_text(angle = 90))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3 and 5 cylinders? Is there really such a thing?

Apparently there are three cylinder car where the pistons are arranged in a straight line: https://en.wikipedia.org/wiki/Straight-three_engine and similar for the five cylinder engines: https://en.wikipedia.org/wiki/Straight-five_engine. I thought that there must have been issues with the data, but now I’ve learned something about internal combustion engines.

Weight vs mpg

ggplot(auto, aes(x = weight, y = mpg, colour = year))+
  geom_point()

Year and Horsepower

ggplot(auto, aes(x = as.factor(year), y = horsepower))+
  geom_violin()+
  geom_point()

ggplot(auto, aes(x = as.factor(year), y = horsepower))+
  geom_boxplot()+
  geom_jitter()

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Displacement, weight, horsepower and year. The first three are likely correlated to each other. I think that year also reflects what I assume to be an increase in the efficiency of the motors.

cor(quant_data)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

Displacement, weight, and horsepower are indeed strongly correlated to each other.

This exercise involves the Boston housing data set.

1. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

library(MASS)

Now the data set is contained in the object Boston.

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Read about the data set:

?Boston

How many rows are in this data set? How many columns?

506 rows and 14 columns from the help, but also:

dim(Boston)

## [1] 506  14

What do the rows and columns represent?

Colnames described in help:

crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

head(rownames(Boston))

## [1] "1" "2" "3" "4" "5" "6"

Rows are different suburbs of Boston

1. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston)

crime seems to have a relationship with rm, age, dis , black and medv, but it is hard to tell in these pairs plots. I will look a bit more closely

Number of rooms vs. crime

ggplot(Boston, aes(x= rm, y = crim))+
  geom_point()

Seems not to be a very clear relationship.

Proportion of houses built before 1940

ggplot(Boston, aes(x= age, y = crim))+
  geom_point()

Here we do see something, where there can be higher crime in regions with older houses.

ggplot(Boston, aes(x= black, y = crim))+
  geom_point()

No clear relationship between proportion of blacks and crime.

ggplot(Boston, aes(x= lstat, y = crim))+
  geom_point()

Not really.

median value of owner-occupied homes in $1000s vs crime

ggplot(Boston, aes(x= medv, y = crim))+
  geom_point()

Here we have the clearest relationship

1. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

The median value of owner occupied homes seems to predict crime rate by town. The towns with lower median values have higher crime rates with the exception that towns with the most expensive houses also have a slightly elevated crime rate.

1. Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25650  3.61400  3.67700 88.98000

which.max(Boston$crim)

## [1] 381

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

which.max(Boston$tax)

## [1] 489

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

which.max(Boston$ptratio)

## [1] 355

Yes the suburb 381 has a very high crime rate of 88.98 which is well above the median of 0.26.

boxplot(Boston$crim)

Suburb 489 has the highest tax rate 711, but this value is not an outlier.

boxplot(Boston$tax)

Suburb 355has the highest pupil-teacher ratio at 22, but this value is not an outlier. Here we do see some outliers on the lower end of the range whereby there are suburbs with ratios below 14.

boxplot(Boston$ptratio)

1. How many of the suburbs in this data set bound the Charles river?

sum(Boston$chas == 1)

## [1] 35

1. What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

1. Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

which.min(Boston$medv)

## [1] 399

Boston[which.min(Boston$medv),]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio black
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9
##     lstat medv
## 399 30.59    5

Suburb 399 has the lowest median value of owner-occupied homes. It does not have the highest crime rate, but it is high. It is not on the Charles River and the houses are very old, 100% of the owner-occupied houses were built before 1940.

1. In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

sum(Boston$rm > 7)

## [1] 64

sum(Boston$rm > 8)

## [1] 13

summary(Boston[Boston$rm > 8,])

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0

The median crime rate is higher in the suburbs with 8 or more rooms on average than for the overall data set. I thought this might be because of the association with older houses and crime. There are generally only suburbs with older houses which have an average of more than 8 rooms per house.

ggplot(Boston, aes(x = age, y = rm))+
  geom_point()

Exercises Chapter 2: