Obs. | X1 | X2 | X3 | Y |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbours.
Euclidian distance is \(\sqrt{x1^2 + x2^2 + x3^2}\)
training_df <- data.frame(observation=c(1,2,3,4,5,6),
X1 = c(0,2,0,0,1,1),
X2 = c(3,0,1,1,0,1),
X3 = c(0,0,3,2,1,1),
Y = c("Red", "Red", "Red", "Green", "Green", "Red"),
Euclidean_distance = c(sqrt(3*3), sqrt(2*2),sqrt(1*1+3*3), sqrt(1*1+2*2),sqrt(1*1+1*1),sqrt(1*1+1*1+1*1)))
training_df
## observation X1 X2 X3 Y Euclidean_distance
## 1 1 0 3 0 Red 3.000000
## 2 2 2 0 0 Red 2.000000
## 3 3 0 1 3 Red 3.162278
## 4 4 0 1 2 Green 2.236068
## 5 5 1 0 1 Green 1.414214
## 6 6 1 1 1 Red 1.732051
library(ggplot2)
ggplot(training_df, aes(x = Euclidean_distance, y = 0))+
geom_point(aes(colour = Y))+
scale_color_manual(values = c( "green", "red"))+
xlim(0,4)
Green.
It contains a number of variables for 777 different universities and colleges in the US. The variables are:
Before reading the data into R, it can be viewed in Excel or a texteditor.
## downloaded data from http://www-bcf.usc.edu/~gareth/ISL/data.html
college <- read.csv("./College.csv")
rownames(college) <- college[,1]
#fix(college) ## brings up an editor in RStudio, will switch to just printing the output
head(college)
## X Private Apps
## Abilene Christian University Abilene Christian University Yes 1660
## Adelphi University Adelphi University Yes 2186
## Adrian College Adrian College Yes 1428
## Agnes Scott College Agnes Scott College Yes 417
## Alaska Pacific University Alaska Pacific University Yes 193
## Albertson College Albertson College Yes 587
## Accept Enroll Top10perc Top25perc F.Undergrad
## Abilene Christian University 1232 721 23 52 2885
## Adelphi University 1924 512 16 29 2683
## Adrian College 1097 336 22 50 1036
## Agnes Scott College 349 137 60 89 510
## Alaska Pacific University 146 55 16 44 249
## Albertson College 479 158 38 62 678
## P.Undergrad Outstate Room.Board Books
## Abilene Christian University 537 7440 3300 450
## Adelphi University 1227 12280 6450 750
## Adrian College 99 11250 3750 400
## Agnes Scott College 63 12960 5450 450
## Alaska Pacific University 869 7560 4120 800
## Albertson College 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni
## Abilene Christian University 2200 70 78 18.1 12
## Adelphi University 1500 29 30 12.2 16
## Adrian College 1165 53 66 12.9 30
## Agnes Scott College 875 92 97 7.7 37
## Alaska Pacific University 1500 76 72 11.9 2
## Albertson College 675 67 73 9.4 11
## Expend Grad.Rate
## Abilene Christian University 7041 60
## Adelphi University 10527 56
## Adrian College 8735 54
## Agnes Scott College 19016 59
## Alaska Pacific University 10922 15
## Albertson College 9727 55
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college <- college[,-1]
head(college)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
Now you should see that the first data column is Private. Note that another column labelled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[,1:10])
boxplot(college$Outstate ~ college$Private) ## neat I did not know the formula notation could be used here
I would be more apt to use ggplot2 to create plots
ggplot(college, aes(x = Private, y = Outstate))+
geom_boxplot()
Elite <- rep("No", nrow(college ))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
summary(college$Elite)
## No Yes
## 699 78
boxplot(college$Outstate ~ college$Elite)
hist(college$Apps)
hist(college$Accept)
I feel there must be a more efficient way
library(dplyr)
library(tidyr)
college_gathered <- college %>%
## remove the variables that are factors
select(-Private, -Elite) %>%
gather()
ggplot(college_gathered, aes(x = value))+
geom_histogram(stat = "bin")+
facet_wrap(.~key, scales = "free")+
theme(axis.text.x = element_text(angle = 90))
How do the histograms differ between private or elite schools?
college_gathered <- college %>%
## remove the variables that are factors
select(-Elite) %>%
gather("key", "value", -Private)
ggplot(college_gathered, aes(x = value, colour = Private))+
geom_density()+
facet_wrap(.~key, scales = "free")+
theme(axis.text.x = element_text(angle = 90))
Differences between private and public schools are seen in the number of applications, enrolment and acceptance. This is likely just related to the number of student total which we would expect to be higher in the public schools.
college_gathered <- college %>%
## remove the variables that are factors
select(-Private) %>%
gather("key", "value", -Elite)
ggplot(college_gathered, aes(x = value, colour = Elite))+
geom_density()+
facet_wrap(.~key, scales = "free")+
theme(axis.text.x = element_text(angle = 90))
Elite vs non-Elite schools show a big difference, whereby Elite schools have higher numbers of PhDs teaching, contributions of alums after graduation, graduation rates, and expenditures.
Make sure that the missing values have been removed from the data.
auto <- read.table("./Auto.data", header = TRUE, na.strings = "?")
## where are the NAs?
summary(auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 1st Qu.: 75.0
## Median :23.00 Median :4.000 Median :146.0 Median : 93.5
## Mean :23.52 Mean :5.458 Mean :193.5 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
## NA's :5
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000
## Median :2800 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.56 Mean :75.99 Mean :1.574
## 3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## ford pinto : 6
## amc matador : 5
## ford maverick : 5
## toyota corolla: 5
## amc gremlin : 4
## amc hornet : 4
## (Other) :368
## remove the rows with NAs
auto <- na.omit(auto)
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Quantitative:
quantitative <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")
Qualitative:
qualitative <- c("origin", "name")
quant_data <- subset(auto, select = colnames(auto) %in% quantitative)
## take the range for each column
apply(quant_data, 2, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
apply(quant_data, 2, mean)
## mpg cylinders displacement horsepower weight
## 23.445918 5.471939 194.411990 104.469388 2977.584184
## acceleration year
## 15.541327 75.979592
apply(quant_data, 2, sd)
## mpg cylinders displacement horsepower weight
## 7.805007 1.705783 104.644004 38.491160 849.402560
## acceleration year
## 2.758864 3.683737
quant_data_trimmed <- quant_data[-c(10:85),]
apply(quant_data_trimmed, 2, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
apply(quant_data_trimmed, 2, mean)
## mpg cylinders displacement horsepower weight
## 24.404430 5.373418 187.240506 100.721519 2935.971519
## acceleration year
## 15.726899 77.145570
apply(quant_data_trimmed, 2, sd)
## mpg cylinders displacement horsepower weight
## 7.867283 1.654179 99.678367 35.708853 811.300208
## acceleration year
## 2.693721 3.106217
pairs(auto)
Histograms of quantitative
## remember gather() akin to melt()
quant_data_gathered <- quant_data %>%
gather()
## quick histograms
ggplot(quant_data_gathered, aes(x = value))+
geom_histogram()+
facet_wrap(.~key, scales = "free")+
theme(axis.text.x = element_text(angle = 90))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
3 and 5 cylinders? Is there really such a thing?
Apparently there are three cylinder car where the pistons are arranged in a straight line: https://en.wikipedia.org/wiki/Straight-three_engine and similar for the five cylinder engines: https://en.wikipedia.org/wiki/Straight-five_engine. I thought that there must have been issues with the data, but now I’ve learned something about internal combustion engines.
Weight vs mpg
ggplot(auto, aes(x = weight, y = mpg, colour = year))+
geom_point()
Year and Horsepower
ggplot(auto, aes(x = as.factor(year), y = horsepower))+
geom_violin()+
geom_point()
ggplot(auto, aes(x = as.factor(year), y = horsepower))+
geom_boxplot()+
geom_jitter()
Displacement, weight, horsepower and year. The first three are likely correlated to each other. I think that year also reflects what I assume to be an increase in the efficiency of the motors.
cor(quant_data)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## acceleration year
## mpg 0.4233285 0.5805410
## cylinders -0.5046834 -0.3456474
## displacement -0.5438005 -0.3698552
## horsepower -0.6891955 -0.4163615
## weight -0.4168392 -0.3091199
## acceleration 1.0000000 0.2903161
## year 0.2903161 1.0000000
Displacement, weight, and horsepower are indeed strongly correlated to each other.
library(MASS)
Now the data set is contained in the object Boston.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
Read about the data set:
?Boston
How many rows are in this data set? How many columns?
506 rows and 14 columns from the help, but also:
dim(Boston)
## [1] 506 14
What do the rows and columns represent?
Colnames described in help:
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(rownames(Boston))
## [1] "1" "2" "3" "4" "5" "6"
Rows are different suburbs of Boston
pairs(Boston)
crime seems to have a relationship with rm, age, dis , black and medv, but it is hard to tell in these pairs plots. I will look a bit more closely
Number of rooms vs. crime
ggplot(Boston, aes(x= rm, y = crim))+
geom_point()
Seems not to be a very clear relationship.
Proportion of houses built before 1940
ggplot(Boston, aes(x= age, y = crim))+
geom_point()
Here we do see something, where there can be higher crime in regions with older houses.
ggplot(Boston, aes(x= black, y = crim))+
geom_point()
No clear relationship between proportion of blacks and crime.
ggplot(Boston, aes(x= lstat, y = crim))+
geom_point()
Not really.
median value of owner-occupied homes in $1000s vs crime
ggplot(Boston, aes(x= medv, y = crim))+
geom_point()
Here we have the clearest relationship
The median value of owner occupied homes seems to predict crime rate by town. The towns with lower median values have higher crime rates with the exception that towns with the most expensive houses also have a slightly elevated crime rate.
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25650 3.61400 3.67700 88.98000
which.max(Boston$crim)
## [1] 381
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
which.max(Boston$tax)
## [1] 489
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
which.max(Boston$ptratio)
## [1] 355
Yes the suburb 381 has a very high crime rate of 88.98 which is well above the median of 0.26.
boxplot(Boston$crim)
Suburb 489 has the highest tax rate 711, but this value is not an outlier.
boxplot(Boston$tax)
Suburb 355has the highest pupil-teacher ratio at 22, but this value is not an outlier. Here we do see some outliers on the lower end of the range whereby there are suburbs with ratios below 14.
boxplot(Boston$ptratio)
sum(Boston$chas == 1)
## [1] 35
median(Boston$ptratio)
## [1] 19.05
which.min(Boston$medv)
## [1] 399
Boston[which.min(Boston$medv),]
## crim zn indus chas nox rm age dis rad tax ptratio black
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9
## lstat medv
## 399 30.59 5
Suburb 399 has the lowest median value of owner-occupied homes. It does not have the highest crime rate, but it is high. It is not on the Charles River and the houses are very old, 100% of the owner-occupied houses were built before 1940.
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
summary(Boston[Boston$rm > 8,])
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
The median crime rate is higher in the suburbs with 8 or more rooms on average than for the overall data set. I thought this might be because of the association with older houses and crime. There are generally only suburbs with older houses which have an average of more than 8 rooms per house.
ggplot(Boston, aes(x = age, y = rm))+
geom_point()