For theoretical explanations see course slides, and http://www.rdatamining.com/docs/r-and-data-mining-examples-and-case-studies chapter 4

Information and Classification

Entrophy

Compute entrophy

require(DescTools)
x <- as.factor(c("a","b","a","a","b","b"))
y <- as.factor(c("a","b","a","a","a","a"))
z <- as.factor(c("a","a","a","a","a","a"))
Entropy(table(x))
Entropy(table(y))
Entropy(table(z))

EXERCISE

  1. Compute the enthrophy of wine types in wine data set from rattle.data.
  2. Compute the enthrophy of wine types for a subset of data that has alcohol level above the average.

Exploring information value of variables for classification: info gain

Classification trees can get very complicated and as a result fails to explain the phenomenon. So one must try to include as few variables as possible, by looking at the information gain:

data(iris)
library(FSelector)
weights <- information.gain(Species~., iris)
print(weights)

EXERCISE 1. Compute the importance of attributes in wine dataset. 2. Compute the importance of attributes in iris dataset.

Tree based classification

require(rpart)
data(iris)
tree <- rpart(Species ~ ., data=iris, method="class")
summary(tree)
Call:
rpart(formula = Species ~ ., data = iris, method = "class")
  n= 150 

    CP nsplit rel error xerror       xstd
1 0.50      0      1.00   1.18 0.05017303
2 0.44      1      0.50   0.73 0.06121547
3 0.01      2      0.06   0.11 0.03192700

Variable importance
 Petal.Width Petal.Length Sepal.Length  Sepal.Width 
          34           31           21           14 

Node number 1: 150 observations,    complexity param=0.5
  predicted class=setosa      expected loss=0.6666667  P(node) =1
    class counts:    50    50    50
   probabilities: 0.333 0.333 0.333 
  left son=2 (50 obs) right son=3 (100 obs)
  Primary splits:
      Petal.Length < 2.45 to the left,  improve=50.00000, (0 missing)
      Petal.Width  < 0.8  to the left,  improve=50.00000, (0 missing)
      Sepal.Length < 5.45 to the left,  improve=34.16405, (0 missing)
      Sepal.Width  < 3.35 to the right, improve=19.03851, (0 missing)
  Surrogate splits:
      Petal.Width  < 0.8  to the left,  agree=1.000, adj=1.00, (0 split)
      Sepal.Length < 5.45 to the left,  agree=0.920, adj=0.76, (0 split)
      Sepal.Width  < 3.35 to the right, agree=0.833, adj=0.50, (0 split)

Node number 2: 50 observations
  predicted class=setosa      expected loss=0  P(node) =0.3333333
    class counts:    50     0     0
   probabilities: 1.000 0.000 0.000 

Node number 3: 100 observations,    complexity param=0.44
  predicted class=versicolor  expected loss=0.5  P(node) =0.6666667
    class counts:     0    50    50
   probabilities: 0.000 0.500 0.500 
  left son=6 (54 obs) right son=7 (46 obs)
  Primary splits:
      Petal.Width  < 1.75 to the left,  improve=38.969400, (0 missing)
      Petal.Length < 4.75 to the left,  improve=37.353540, (0 missing)
      Sepal.Length < 6.15 to the left,  improve=10.686870, (0 missing)
      Sepal.Width  < 2.45 to the left,  improve= 3.555556, (0 missing)
  Surrogate splits:
      Petal.Length < 4.75 to the left,  agree=0.91, adj=0.804, (0 split)
      Sepal.Length < 6.15 to the left,  agree=0.73, adj=0.413, (0 split)
      Sepal.Width  < 2.95 to the left,  agree=0.67, adj=0.283, (0 split)

Node number 6: 54 observations
  predicted class=versicolor  expected loss=0.09259259  P(node) =0.36
    class counts:     0    49     5
   probabilities: 0.000 0.907 0.093 

Node number 7: 46 observations
  predicted class=virginica   expected loss=0.02173913  P(node) =0.3066667
    class counts:     0     1    45
   probabilities: 0.000 0.022 0.978 
plot(tree,margin=0.2)
text(tree, use.n=TRUE, all=TRUE, cex=.6)

#alternative visualization
require(rpart.plot)
prp(tree,type=4,extra="auto",nn=TRUE)

Confusion matrix for classification

require(rpart)
data(iris)
tree <- rpart(Species ~ ., data=iris, method="class")
require(caret)
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
pred <- predict(tree, newdata=iris,type="class")
confusionMatrix(pred, iris$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         49         5
  virginica       0          1        45

Overall Statistics
                                         
               Accuracy : 0.96           
                 95% CI : (0.915, 0.9852)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.94           
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9800           0.9000
Specificity                 1.0000            0.9500           0.9900
Pos Pred Value              1.0000            0.9074           0.9783
Neg Pred Value              1.0000            0.9896           0.9519
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3267           0.3000
Detection Prevalence        0.3333            0.3600           0.3067
Balanced Accuracy           1.0000            0.9650           0.9450

Tree based classification with Rattle

Use Rattle -> Model -> Tree. Then “Evaluate” to see confusion matrix. Examine the command log and note similarities to above commands

You may also try random forest models in Rattle

Exercise and review questions

Download and imort the Mushroom data set from Kaggle and save in variable “mushrooms”: https://www.kaggle.com/uciml/mushroom-classification/data

library(readr)
mushrooms <- read_csv("~/Downloads/mushrooms.csv")
  1. Consider the edible/poisonous mix in the data set marked as “e” and “p” respectively:
summary(mushrooms$class)
   e    p 
4208 3916 

Write the mathematical expression to compute the entropy of this mix. Then use the Entropy function to actually find its value.

  1. Consider the information gains:
data(iris)
library(FSelector)
weights <- information.gain(class ~ ., data=mushrooms)
print(weights)

Which variable you think should be at the root of the classification tree?

  1. Consider the classification tree below
require(rpart)
tree <- rpart(class ~ ., data=mushrooms, method="class")
require(rpart.plot)
prp(tree,type=4,extra="auto",nn=T)

  1. How would you describe node 5 in the classification tree in plain English?
  2. What percentage of the initial sample is in node 5?
  3. What is the probability that an item in node 2 is poisonous?
  1. Consider the confusion matrix below:
require(caret)
pred <- predict(tree, newdata=mushrooms,type="class")
cm<- confusionMatrix(pred, mushrooms$class)
cm$table
          Reference
Prediction    e    p
         e 4208   48
         p    0 3868
  1. How many of the poisonous items are classified as edible?
  2. How many of the edible items are classified as poisonous?
  3. What percentage of items in total are wrongly classified (i.e. the error rate of the classifier)?

Further exercises and case studies

Exercise

Get the wine data https://archive.ics.uci.edu/ml/datasets/Wine+Quality Find a tree model for wine quality being more than 5.

Tutorial Case study: Credit card cutomers

Get the data from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients You are recommended to (1)open data from Excel, correct column names NOT to include spaces, (2) export as csv then import to Rstudio or Rattle

Follow this tutorial http://www.askanalytics.in/2015/10/decision-tree-in-r-telecom-case-study.html but use the above data

Exercise: Churn analysis

You can obtain the data for this case at https://www.ibm.com/communities/analytics/watson-analytics-blog/predictive-insights-in-the-telco-customer-churn-data-set/

  1. Find the information gain of variables.
  2. Build a decision tree of the churn output, and vew the error matrix.
  3. What are the alternatives for modeling?
LS0tCnRpdGxlOiAnQkEgNDY0LSBXZWVrIDZBICBOb24tTnVtZXJpY2FsIFByZWRpY3RpdmUgQW5hbHl0aWNzOiBTdXBlcnZpc2VkIENsYXNzaWZpY2F0aW9uCiAgd2l0aCBEZWNpc2lvbiBUcmVlcycKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6IGRlZmF1bHQKICBodG1sX25vdGVib29rOiBkZWZhdWx0Ci0tLQoKRm9yIHRoZW9yZXRpY2FsIGV4cGxhbmF0aW9ucyBzZWUgY291cnNlIHNsaWRlcywgYW5kIGh0dHA6Ly93d3cucmRhdGFtaW5pbmcuY29tL2RvY3Mvci1hbmQtZGF0YS1taW5pbmctZXhhbXBsZXMtYW5kLWNhc2Utc3R1ZGllcyBjaGFwdGVyIDQKCiMgSW5mb3JtYXRpb24gYW5kIENsYXNzaWZpY2F0aW9uCgojIyBFbnRyb3BoeQpDb21wdXRlIGVudHJvcGh5CmBgYHtyfQpyZXF1aXJlKERlc2NUb29scykKeCA8LSBhcy5mYWN0b3IoYygiYSIsImIiLCJhIiwiYSIsImIiLCJiIikpCnkgPC0gYXMuZmFjdG9yKGMoImEiLCJiIiwiYSIsImEiLCJhIiwiYSIpKQp6IDwtIGFzLmZhY3RvcihjKCJhIiwiYSIsImEiLCJhIiwiYSIsImEiKSkKRW50cm9weSh0YWJsZSh4KSkKRW50cm9weSh0YWJsZSh5KSkKRW50cm9weSh0YWJsZSh6KSkKYGBgCgoqKkVYRVJDSVNFKioKCjEuIENvbXB1dGUgdGhlIGVudGhyb3BoeSBvZiB3aW5lIHR5cGVzIGluIHdpbmUgZGF0YSBzZXQgZnJvbSByYXR0bGUuZGF0YS4KMi4gQ29tcHV0ZSB0aGUgZW50aHJvcGh5IG9mIHdpbmUgdHlwZXMgZm9yIGEgc3Vic2V0IG9mIGRhdGEgdGhhdCBoYXMgYWxjb2hvbCBsZXZlbCBhYm92ZSB0aGUgYXZlcmFnZS4KCiMjIEV4cGxvcmluZyBpbmZvcm1hdGlvbiB2YWx1ZSBvZiB2YXJpYWJsZXMgZm9yIGNsYXNzaWZpY2F0aW9uOiBpbmZvIGdhaW4KCkNsYXNzaWZpY2F0aW9uIHRyZWVzIGNhbiBnZXQgdmVyeSBjb21wbGljYXRlZCBhbmQgYXMgYSByZXN1bHQgZmFpbHMgdG8gZXhwbGFpbiB0aGUgcGhlbm9tZW5vbi4gU28gb25lIG11c3QgdHJ5IHRvIGluY2x1ZGUgYXMgZmV3IHZhcmlhYmxlcyBhcyBwb3NzaWJsZSwgYnkgbG9va2luZyBhdCB0aGUgaW5mb3JtYXRpb24gZ2FpbjoKYGBge3J9CmRhdGEoaXJpcykKbGlicmFyeShGU2VsZWN0b3IpCndlaWdodHMgPC0gaW5mb3JtYXRpb24uZ2FpbihTcGVjaWVzfi4sIGlyaXMpCnByaW50KHdlaWdodHMpCmBgYAoKKipFWEVSQ0lTRSoqIAoxLiBDb21wdXRlIHRoZSBpbXBvcnRhbmNlIG9mIGF0dHJpYnV0ZXMgaW4gd2luZSBkYXRhc2V0LiAKMi4gQ29tcHV0ZSB0aGUgaW1wb3J0YW5jZSBvZiBhdHRyaWJ1dGVzIGluIGlyaXMgZGF0YXNldC4KCiMjIFRyZWUgYmFzZWQgY2xhc3NpZmljYXRpb24KCmBgYHtyfQpyZXF1aXJlKHJwYXJ0KQpkYXRhKGlyaXMpCnRyZWUgPC0gcnBhcnQoU3BlY2llcyB+IC4sIGRhdGE9aXJpcywgbWV0aG9kPSJjbGFzcyIpCnN1bW1hcnkodHJlZSkKcGxvdCh0cmVlLG1hcmdpbj0wLjIpCnRleHQodHJlZSwgdXNlLm49VFJVRSwgYWxsPVRSVUUsIGNleD0uNikKI2FsdGVybmF0aXZlIHZpc3VhbGl6YXRpb24KcmVxdWlyZShycGFydC5wbG90KQpwcnAodHJlZSx0eXBlPTQsZXh0cmE9ImF1dG8iLG5uPVRSVUUpCmBgYAoKIyMgQ29uZnVzaW9uIG1hdHJpeCBmb3IgY2xhc3NpZmljYXRpb24KCmBgYHtyfQpyZXF1aXJlKHJwYXJ0KQpkYXRhKGlyaXMpCnRyZWUgPC0gcnBhcnQoU3BlY2llcyB+IC4sIGRhdGE9aXJpcywgbWV0aG9kPSJjbGFzcyIpCnJlcXVpcmUoY2FyZXQpCnByZWQgPC0gcHJlZGljdCh0cmVlLCBuZXdkYXRhPWlyaXMsdHlwZT0iY2xhc3MiKQpjb25mdXNpb25NYXRyaXgocHJlZCwgaXJpcyRTcGVjaWVzKQpgYGAKCgojIyBUcmVlIGJhc2VkIGNsYXNzaWZpY2F0aW9uIHdpdGggUmF0dGxlCgpVc2UgUmF0dGxlIC0+IE1vZGVsIC0+IFRyZWUuIFRoZW4gIkV2YWx1YXRlIiB0byBzZWUgY29uZnVzaW9uIG1hdHJpeC4gRXhhbWluZSB0aGUgY29tbWFuZCBsb2cgYW5kIG5vdGUgc2ltaWxhcml0aWVzIHRvIGFib3ZlIGNvbW1hbmRzCgpZb3UgbWF5IGFsc28gdHJ5IHJhbmRvbSBmb3Jlc3QgbW9kZWxzIGluIFJhdHRsZQoKIyMgRXhlcmNpc2UgYW5kIHJldmlldyBxdWVzdGlvbnMKCkRvd25sb2FkIGFuZCBpbW9ydCB0aGUgTXVzaHJvb20gZGF0YSBzZXQgZnJvbSBLYWdnbGUgYW5kIHNhdmUgaW4gdmFyaWFibGUgIm11c2hyb29tcyI6IGh0dHBzOi8vd3d3LmthZ2dsZS5jb20vdWNpbWwvbXVzaHJvb20tY2xhc3NpZmljYXRpb24vZGF0YQpgYGB7cn0KbGlicmFyeShyZWFkcikKbXVzaHJvb21zIDwtIHJlYWRfY3N2KCJ+L0Rvd25sb2Fkcy9tdXNocm9vbXMuY3N2IikKYGBgCgoxLiBDb25zaWRlciB0aGUgZWRpYmxlL3BvaXNvbm91cyBtaXggaW4gdGhlIGRhdGEgc2V0IG1hcmtlZCBhcyAiZSIgYW5kICJwIiByZXNwZWN0aXZlbHk6CmBgYHtyfQpzdW1tYXJ5KG11c2hyb29tcyRjbGFzcykKYGBgCiAgIFdyaXRlIHRoZSBtYXRoZW1hdGljYWwgZXhwcmVzc2lvbiB0byBjb21wdXRlIHRoZSBlbnRyb3B5IG9mIHRoaXMgbWl4LiBUaGVuIHVzZSB0aGUgRW50cm9weSBmdW5jdGlvbiB0byBhY3R1YWxseSBmaW5kIGl0cyB2YWx1ZS4KICAgCjIuIENvbnNpZGVyIHRoZSBpbmZvcm1hdGlvbiBnYWluczoKYGBge3J9CmRhdGEoaXJpcykKbGlicmFyeShGU2VsZWN0b3IpCndlaWdodHMgPC0gaW5mb3JtYXRpb24uZ2FpbihjbGFzcyB+IC4sIGRhdGE9bXVzaHJvb21zKQpwcmludCh3ZWlnaHRzKQpgYGAKICAgV2hpY2ggdmFyaWFibGUgeW91IHRoaW5rIHNob3VsZCBiZSBhdCB0aGUgcm9vdCBvZiB0aGUgY2xhc3NpZmljYXRpb24gdHJlZT8KCjMuIENvbnNpZGVyIHRoZSBjbGFzc2lmaWNhdGlvbiB0cmVlIGJlbG93CmBgYHtyfQpyZXF1aXJlKHJwYXJ0KQp0cmVlIDwtIHJwYXJ0KGNsYXNzIH4gLiwgZGF0YT1tdXNocm9vbXMsIG1ldGhvZD0iY2xhc3MiKQpyZXF1aXJlKHJwYXJ0LnBsb3QpCnBycCh0cmVlLHR5cGU9NCxleHRyYT0iYXV0byIsbm49VCkKYGBgCiAgIChhKSBIb3cgd291bGQgeW91IGRlc2NyaWJlIG5vZGUgNSBpbiB0aGUgY2xhc3NpZmljYXRpb24gdHJlZSBpbiBwbGFpbiBFbmdsaXNoPwogICAoYikgV2hhdCBwZXJjZW50YWdlIG9mIHRoZSBpbml0aWFsIHNhbXBsZSBpcyBpbiBub2RlIDU/CiAgIChjKSBXaGF0IGlzIHRoZSBwcm9iYWJpbGl0eSB0aGF0IGFuIGl0ZW0gaW4gbm9kZSAyIGlzIHBvaXNvbm91cz8KICAgCjQuIENvbnNpZGVyIHRoZSBjb25mdXNpb24gbWF0cml4IGJlbG93OgpgYGB7cn0KcmVxdWlyZShjYXJldCkKcHJlZCA8LSBwcmVkaWN0KHRyZWUsIG5ld2RhdGE9bXVzaHJvb21zLHR5cGU9ImNsYXNzIikKY208LSBjb25mdXNpb25NYXRyaXgocHJlZCwgbXVzaHJvb21zJGNsYXNzKQpjbSR0YWJsZQpgYGAKICAgKGEpIEhvdyBtYW55IG9mIHRoZSBwb2lzb25vdXMgaXRlbXMgYXJlIGNsYXNzaWZpZWQgYXMgZWRpYmxlPwogICAoYikgSG93IG1hbnkgb2YgdGhlIGVkaWJsZSBpdGVtcyBhcmUgY2xhc3NpZmllZCBhcyBwb2lzb25vdXM/CiAgIChjKSBXaGF0IHBlcmNlbnRhZ2Ugb2YgaXRlbXMgaW4gdG90YWwgYXJlIHdyb25nbHkgY2xhc3NpZmllZCAoaS5lLiB0aGUgZXJyb3IgcmF0ZSBvZiB0aGUgY2xhc3NpZmllcik/CiAgIAojIEZ1cnRoZXIgZXhlcmNpc2VzIGFuZCBjYXNlIHN0dWRpZXMKICAKIyMgRXhlcmNpc2UKR2V0IHRoZSB3aW5lIGRhdGEgaHR0cHM6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL2RhdGFzZXRzL1dpbmUrUXVhbGl0eQpGaW5kIGEgdHJlZSBtb2RlbCBmb3Igd2luZSBxdWFsaXR5IGJlaW5nIG1vcmUgdGhhbiA1LgoKCiMjIFR1dG9yaWFsIENhc2Ugc3R1ZHk6IENyZWRpdCBjYXJkIGN1dG9tZXJzCgpHZXQgdGhlIGRhdGEgZnJvbSBodHRwczovL2FyY2hpdmUuaWNzLnVjaS5lZHUvbWwvZGF0YXNldHMvZGVmYXVsdCtvZitjcmVkaXQrY2FyZCtjbGllbnRzIFlvdSBhcmUgcmVjb21tZW5kZWQgdG8gKDEpb3BlbiBkYXRhIGZyb20gRXhjZWwsIGNvcnJlY3QgY29sdW1uIG5hbWVzIE5PVCB0byBpbmNsdWRlIHNwYWNlcywgICgyKSBleHBvcnQgYXMgY3N2IHRoZW4gaW1wb3J0IHRvIFJzdHVkaW8gb3IgUmF0dGxlCgpGb2xsb3cgdGhpcyB0dXRvcmlhbCBodHRwOi8vd3d3LmFza2FuYWx5dGljcy5pbi8yMDE1LzEwL2RlY2lzaW9uLXRyZWUtaW4tci10ZWxlY29tLWNhc2Utc3R1ZHkuaHRtbCBidXQgdXNlIHRoZSBhYm92ZSBkYXRhCgoKIyMgRXhlcmNpc2U6IENodXJuIGFuYWx5c2lzCgpZb3UgY2FuIG9idGFpbiB0aGUgZGF0YSBmb3IgdGhpcyBjYXNlIGF0IGh0dHBzOi8vd3d3LmlibS5jb20vY29tbXVuaXRpZXMvYW5hbHl0aWNzL3dhdHNvbi1hbmFseXRpY3MtYmxvZy9wcmVkaWN0aXZlLWluc2lnaHRzLWluLXRoZS10ZWxjby1jdXN0b21lci1jaHVybi1kYXRhLXNldC8gCgoxLiBGaW5kIHRoZSBpbmZvcm1hdGlvbiBnYWluIG9mIHZhcmlhYmxlcy4KMi4gQnVpbGQgYSBkZWNpc2lvbiB0cmVlIG9mIHRoZSBjaHVybiBvdXRwdXQsIGFuZCB2ZXcgdGhlIGVycm9yIG1hdHJpeC4KMy4gV2hhdCBhcmUgdGhlIGFsdGVybmF0aXZlcyBmb3IgbW9kZWxpbmc/