Need help with the following assignment have attached the required data and materials of the professor’s lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary documents are attached. All of the 10 questions answers need to be accurate. Instructions and Assignment:Advanced Analytics in RAttached Files: IT836 Advanced R Assignment.pdf (98.987 KB) nbtrain.csv (306.612 KB) In this assignment you will train a Naïve Bayes classifier on categorical data and predict individuals’ incomes.  Import the nbtrain.csv file.  Use the first 9010 records as training data and the remaining 1000 records as testing data.In this assignment you will train a Naïve Bayes classifier on categorical data and predict individuals’ incomes. Import the nbtrain.csv file. Use the first 9010 records as training data and the remaining 1000 records as testing data. 1. Read the nbtrain.csv file into the R environment. 2. Construct the Naïve Bayes classifier from the training data, according to the formula “income ~ age + sex + educ”. To do this, use the “naiveBayes” function from the “e1071” package. Provide the model’s a priori and conditional probabilities. 3. Score the model with the testing data and create the model’s confusion matrix. Also, calculate the overall, 10-50K, 50-80K, and GT 80K misclassification rates. Explain the variation in the model’s predictive power across income classes. 4. Use the first 9010 records as training data and the remaining 1000 records as testing data. 5. What is propose of separating the data into a training set and testing set? 6. Construct the classifier according to the formula “sex ~ age + educ + income”, and calculate the overall, female, and male misclassification rates. Explain the misclassification rates? 7. Divide the training data into two partitions, according to sex, and randomly select 3500 records from each partition. Reconstruct the model from part (a) from these 7000 records. Provide the model’s a priori and conditional probabilities. 8. How well does the model classify the testing data? Explain why. 9. Repeat step (b) 4 several times. What effect does the random selection of records have on the model’s performance? 10. What conclusions can one draw from this exercise?#section5.5.1TheGroceriesDataset#section5.5.1TheGroceriesDatasetdata(Groceries)Groceriessummary(Groceries)class(Groceries)# display the first 20 grocery [email protected][1:20,]# display the 10th to 20th transactionsapply([email protected][,10:20], 2,       function(r) paste([email protected][r,”labels”], collapse=”, “))#section5.5.2FrequentItemsetGe≠ration#section5.5.2FrequentItemsetGe≠ration# frequent 1-itemsetsitemsets <- apriori(Groceries, parameter=list(minlen=1, maxlen=1, support=0.02, target="frequent itemsets"))summary(itemsets)inspect(head(sort(itemsets, by = "support"), 10))# frequent 2-itemsetsitemsets <- apriori(Groceries, parameter=list(minlen=2, maxlen=2, support=0.02, target="frequent itemsets"))summary(itemsets)inspect(head(sort(itemsets, by ="support"),10))# frequent 3-itemsetsitemsets <- apriori(Groceries, parameter=list(minlen=3, maxlen=3, support=0.02, target="frequent itemsets"))inspect(sort(itemsets, by ="support"))# frequent 4-itemsetsitemsets <- apriori(Groceries, parameter=list(minlen=4, maxlen=4, support=0.02, target="frequent itemsets"))inspect(sort(itemsets, by ="support"))# run Apriori without setting the maxlen parameteritemsets <- apriori(Groceries, parameter=list(minlen=1, support=0.02,                                              target="frequent itemsets"))#section5.5.3Re–Ge≠rationandVisualization#section5.5.3Re̲Ge≠rationandVisualizationrules <- apriori(Groceries, parameter=list(support=0.001,                                           confidence=0.6, target = "rules"))summary(rules)plot(rules)plot([email protected])# displays rules with top lift scoresinspect(head(sort(rules, by="lift"), 10))confidentRules <- rules[quality(rules)$confidence > 0.9]confidentRulesplot(confidentRules, method=”matrix”, measure=c(“lift”, “confidence”),     control=list(reorder=TRUE))# select the 5 rules with the highest lifthighLiftRules <- head(sort(rules, by="lift"), 5)plot(highLiftRules, method="graph", control=list(type="items")) This code covers the code presented in # Section 8.2 ARIMA Model### section 8.2.5 Building and Evaluating an ARIMA Model###install.packages("forecast")       # install, if necessarylibrary(forecast)# read in gasoline production time series# monthly gas production expressed in millions of barrelsgas_prod_input <- as.data.frame( read.csv("c:/data/gas_prod.csv") )# create a time series objectgas_prod <- ts(gas_prod_input[,2])#examine the time seriesplot(gas_prod, xlab = "Time (months)",     ylab = "Gasoline production (millions of barrels)")# check for conditions of a stationary time seriesplot(diff(gas_prod))abline(a=0, b=0)# examine ACF and PACF of differenced seriesacf(diff(gas_prod), xaxp = c(0, 48, 4), lag.max=48, main="")pacf(diff(gas_prod), xaxp = c(0, 48, 4), lag.max=48, main="")# fit a (0,1,0)x(1,0,0)12 ARIMA modelarima_1 <- arima (gas_prod,                  order=c(0,1,0),                  seasonal = list(order=c(1,0,0),period=12))arima_1# it may be necessary to calculate AICc and BIC # http://stats.stackexchange.com/questions/76761/extract-bic-and-aicc-from-arima-objectAIC(arima_1,k = log(length(gas_prod)))   #BIC# examine ACF and PACF of the (0,1,0)x(1,0,0)12 residualsacf(arima_1$residuals, xaxp = c(0, 48, 4), lag.max=48, main="")pacf(arima_1$residuals, xaxp = c(0, 48, 4), lag.max=48, main="")# fit a (0,1,1)x(1,0,0)12 ARIMA modelarima_2 <- arima (gas_prod,                  order=c(0,1,1),                  seasonal = list(order=c(1,0,0),period=12))arima_2# it may be necessary to calculate AICc and BIC # http://stats.stackexchange.com/questions/76761/extract-bic-and-aicc-from-arima-objectAIC(arima_2,k = log(length(gas_prod)))   #BIC# examine ACF and PACF of the (0,1,1)x(1,0,0)12 residualsacf(arima_2$residuals, xaxp = c(0, 48, 4), lag.max=48, main="")pacf(arima_2$residuals, xaxp = c(0, 48,4), lag.max=48, main="")# Normality and Constant Varianceplot(arima_2$residuals, ylab = "Residuals")abline(a=0, b=0)hist(arima_2$residuals, xlab="Residuals", xlim=c(-20,20))qqnorm(arima_2$residuals, main="")qqline(arima_2$residuals)# Forecasting#predict the next 12 monthsarima_2.predict <- predict(arima_2,n.ahead=12)matrix(c(arima_2.predict$pred-1.96*arima_2.predict$se,         arima_2.predict$pred,         arima_2.predict$pred+1.96*arima_2.predict$se), 12,3,       dimnames=list( c(241:252) ,c("LB","Pred","UB")) )plot(gas_prod, xlim=c(145,252),     xlab = "Time (months)",     ylab = "Gasoline production (millions of barrels)",     ylim=c(360,440))lines(arima_2.predict$pred)lines(arima_2.predict$pred+1.96*arima_2.predict$se, col=4, lty=2)lines(arima_2.predict$pred-1.96*arima_2.predict$se, col=4, lty=2) Need help with the following assignment have attached the required data and materials of the professor's lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Technical description of a logistic regression model •Common use cases for the logistic regression model •Interpretation and scoring with the logistic regression model •Diagnostics for validating the logistic regression model •Reasons to Choose (+) and Cautions ( -) of the logistic regression model Lesson 4b: Logistic Regression Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Logistic Regression • Used to estimate the probability that an event will occur as a function of other variables  The probability that a borrower will default as a function of his credit score, income, the size of the loan, and his existing debts • Can be considered a classifier, as well  Assign the class label with the highest probability • Input variables can be continuous or discrete • Output :  A set of coefficients that indicate the relative impact of each driver  A linear expression for predicting the log -odds ratio of outcome as a function of drivers. (Binary classification case)  Log -odds ratio easily converted to the probability of the outcome 3 Module 4: Analytics Theory/Methods We use logistic regression to estimate the probability that an event will occur as a function of other variables. An example is that the probability that a borrower will default as a function of his credit score , income, loan size, and his current debts. We will be discussing classifiers in the next lesson. Logistic regression can also be considered a classifier. Recall the discussions on classifiers in lesson 1 of this module(Clustering). Classifiers are methods to assign class labels (default or no_default) based on the highest probability. In logistic regression input variables can be continuous or discrete. The output is a set of coefficients that indicate the relative impact of each of the input variables. In a binary classification case (true/false) the output also provides a linear expression for predicting the log odds ratio of the outcome as a function of drivers. The log odds ratios can be converted to the probability of an outcome and many packages do this conversion in their outputs automatically. 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Logistic Regression Use Cases • The preferred method for many binary classification problems:  Especially if you are interested in the probability of an event, not just predicting the "yes or no“  Try this first; if it fails, then try something more complicated • Binary Classification examples:  The probability that a borrower will default  The probability that a customer will churn • Multi -class example  The probability that a politician will vote yes/vote no/not show up to vote on a given bill 4 Module 4: Analytics Theory/Methods Logistic regression is the preferred method for many binary classification problems Two examples of a binary classification problem are shown in the slide above. Other examples : • true/false • approve/deny • respond to medical treatment/no response • will purchase from a website/no purchase • likelihood Spain will win the next World Cup The third example on the slide “ The probability that a politician will vote yes/vote no/not show up to vote on a given bill” is a multiclass problem. We will only discuss binary problems (such as loan default) for simplicity in this lesson. Logistic regression is especially useful if you are interested in the probability of an event, not just predicting the class labels. In a binary class problem Logistic regression must be tried first to fit a model. And only if it does not work models such as GAMS (generalized additive methods), Support Vector Machines and Ensemble Methods are tried (these models are out of scope for this course). 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Logistic Regression Model -Example • Training data: default is 0/1  default=1 if loan defaulted • The model will return the probability that a loan with given characteristics will default • If you only want a "yes/no" answer, you need a threshold  The standard threshold is 0.5 5 Module 4: Analytics Theory/Methods The slide shows an example “Probability of Default” Default (output for this model) is defined as a function of credit score, income, loan amount and existing debt. The training data represents the default as either 0 or 1 where default = 1 if the loan is defaulted. Fitting and scoring the logistic regression model will return the probability that a loan with a given value for each of the input variables will default. If only Yes/No type answer is desired a threshold must be set for the value of probability to return the class label. The standard threshold is 0.5. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Logistic Regression -Visualizing the Model Overall fraction of default: ~20% Logistic regression returns a score that estimates the probability that a borrower will default The graph compares the distribution of defaulters and non -defaulters as a function of the model's predicted probability, for borrowers scoring higher than 0.1 Blue=defaulters 6 Module 4: Analytics Theory/Methods This is an example of how one might visualize the model. Logistic regression returns a score that estimates the probability that a borrower will default. The graph compares the distribution of defaulters and non defaulters as a function of model’s predicted probability for borrowers scoring higher than 0.1 and less than 0.98 The graph is overlaid –think of the blue graph (defaulters) as being transparent and "in front of" the red graph (non defaulters). The takeaway from the graph is that the higher a borrower scores, the more likely empirically that he will default. The graph only considers borrowers who score > 0.1 and < 0.98 because this graph had large spikes near 0 and 1, so the graph becomes hard to read. We can see, however, that a fraction of low scoring borrowers do actually default. (the overlap) 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Technical Description (Binary Case) • y=1 is the case of interest: 'TRUE' • LHS is called logit(P(y=1))  hence, "logistic regression" • logit(P(y=1)) is inverted by the sigmoid function  standard packages can return probability for you • Categorical variables are expanded as with linear regression • Iterative solution to obtain coefficient estimates, denoted bj  "Iteratively re -weighted least squares" 7 Module 4: Analytics Theory/Methods The quantity on LHS (Left Hand Side) is the log odds ratio. We first compute the ratio of probability of y equal to 1 vs. the probability of y not equal to 1 and take a log of this ratio. In logistic regression the log odds ratio is equal to linear additive combination of the drivers. LHS is called logit(P(y=1)) and hence this method came to be known as logistic regression. The inverse of the logit is the sigmoid function. The output of the sigmoid is the actual probabilities. Standard packages give the inverse as a standard output. Categorical values are expanded exactly the way we did in the linear regression. Computing the estimated coefficients, denoted bj, can also be accomplished as the least square method but implemented as iteratively re -weighted least squares converging to the true probabilities with every iteration. Logistic regression has exactly the same problems that a OLS method has and the computational complexity increases with more input variables and with categorical values with multiple levels. 7 Module 4: Analytics Theory/Methods (= 1) 1− (= 1) = 0+ 11+ 22… + −1 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Interpreting the Estimated Coefficients, bi • Invert the logit expression: • exp( bj) tells us how the odds -ratio of y=1 changes for every unit change in xj • Example: bcreditScore = -0.69 • exp( bcreditScore ) = 0.5 = 1/2 • for the same income, loan, and existing debt, the odds -ratio of default is halved for every point increase in credit score • Standard packages return the significance of the coefficients in the same way as in linear regression 8 Module 4: Analytics Theory/Methods If we invert the logit expression shown in the slide, we come up with the logit as a product of the exponents of the coefficients times the drivers. The exponent of the first coefficient, b 0, represents the odds -ratio of the outcome in the "reference situation" –the situation that is represented by all the continuous variables set to zero, and the categorical variables at their reference That means the exponent of the coefficients exp( bj) tells us how the odds -ratio of y=1 changes for every unit change in xj Suppose we have bcreditScore =-0.69 implies exp( -0.69) = 0.5 = 1/2 This means for the same income, loan amount, existing debt, the odds ratio of default is cut in half for every point of increase of credit score. The negative number on the coefficient indicates that there is a negative relation between the credit score and the probability of default. Higher credit score implies lower probability of default. Significance of the credit score is returned in the same way as in linear regression. So you should look for very low “p” values. 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. An Interesting Fact About Logistic Regression "The probability mass equals the counts" • If 20% of our loan risk training set defaults  The sum of all the training set scores will be 20% of the number of training examples • If 40% of applicants with income < $50,000 default  The sum of all the training set scores of people in this income category will be 40% of the number of examples in this income category 9 Module 4: Analytics Theory/Methods "Logistic regression preserves summary statistics of the training data" –in other words, logistic regression is a very good way of concisely describing the probability of all the different possible combination of features in the training data. Two examples of this feature are shown in the slide. If you sum up everybody’s score after putting them through the model the total computed will be equal to the sum of all the training set scores. What this means is that it is almost like a continuous look up probability table. Assume that we have all categorical variables and you have the table of probability of every possible combination of variables, Logistic regression is a concise version of the table. This is what can be defined as a “well calibrated” model. Reference: http://www.win -vector.com/blog/2011/09/the -simpler -derivation -of -logistic - regression/ 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics • Hold -out data:  Does the model predict well on data it hasn't seen? • N-fold cross -validation: Formal estimate of generalization error • "Pseudo -R2" : 1 –(deviance/null deviance)  Deviance, null deviance both reported by most standard packages  The fraction of "variance" that is explained by the model  Used the way R 2is used 10 Module 4: Analytics Theory/Methods This is all very similar to linear regression. We use the hold -out data method, and N -fold cross validation on the fitted model. This is exactly what we did with linear regression to determine if the model predicts well. The model should explain more than just this simple guess. Pseudo R 2 is the term we use in Logistic regression which we use the same way we use R 2 in linear regression. It is basically “the fraction” of the variance . Deviance, for the purposes of this discussion, is analogous to "variance" in linear regression. The null deviance is the deviance (or "error') that you would make if you always assumed that the probability of true were simply the global probability. 1 –(deviance/null deviance) is the “fraction” that defines Pseudo R 2 which is a measure of how well the model explains the data. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics (Cont.) • Sanity check the coefficients  Do the signs make sense? Are the coefficients excessively large?  Wrong sign is an indication of correlated inputs, but doesn't necessarily affect predictive power.  Excessively large coefficient magnitudes may indicate strongly correlated inputs; you may want to consider eliminating some variables , or using regularized regression techniques .  Infinite magnitude coefficients could indicate a variable that strongly predicts a subset of the output (and doesn't predict well on the rest). ▪ Try a Decision Tree on that variable, to see if you should segment the data before regressing. 11 Module 4: Analytics Theory/Methods The sanity checks are exactly the same as what we discussed in linear regression. Once we determine the fit is good we need to perform the sanity checks. Logistic regression is an explanatory model and the coefficients provide the required details. First check the sign of the coefficients. Do the signs make sense. For example, should the income increase with age or years of education? The coefficients should be positive. If not there might be something wrong. It is often an indicator that the variables are correlated to each other. Regression works best if all the drivers are independent. This does not in fact affect the predictive power but the explanatory capability is compromised here. We also need to check if the magnitude of the coefficients make sense? They sometimes can become excessively large and we prefer them not to be very large. This is also an indication of strongly correlated inputs. In this case consider eliminating some variables. Note that unlike linear regression, where we have regularized regression techniques, there are not any standard methods with logistic regression. If there is a requirement one should implement one’s own method. Sometimes you may get infinite magnitude coefficients which could indicate that there is a variable that strongly predicts a certain subset of the output and does not predict well on the rest. For example there is a range of age for which the output income is perfectly predicted. In such conditions plot the output vs. the input and determine the segment at which the prediction goes wrong. We should then segment the data before fitting the model. Decision Trees can be used on that variable, to see if you should segment the data before regressing. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics: ROC Curve Area under the curve (AUC) tells you how well the model predicts. (Ideal AUC = 1) For logistic regression, ROC curve can help set classifier threshold 12 Module 4: Analytics Theory/Methods Logistic models do very well at predicting class probabilities; but if you want to use them as a classifier you have to set a threshold. For a given threshold, the classifier will give false positives and false negatives. False positive rate (fpr) is the fraction of negative instances that were misclassified. False negative rate (fnr) is the fraction of positive instances that were misclassified. True positive rate (tpr) = 1 –fnr The ROC (Receiver Operating Characteristics) curve plots (fpr, tpr) as the threshold is varied from 0 (the upper right hand corner) to 1 (the lower left hand corner). As the threshold is raised, the false positive rate decreases, but the true positive rate decreases, too. The ideal classifier (only true instances have probability near 1) would trace the upper left triangle of the unit square: as the threshold increases, fpr decreases without lowering tpr. Usually, ROC curves are only used to evaluate prediction quality –how close the AUC is to 1. But they can also be used to set thresholds; if you have upper bounds on your desired fpr and fnr, you can use the ROC curve (or more accurately, the software that you use to plot the ROC curve) to give you the range of thresholds that meet those constraints. For logistic regression, the ROC curve can help set the classifier threshold. An excellent primer on ROC is available in the following reference: http://home.comcast.net/~tom.fawcett/public_html/papers/ROC101.pdf 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics: Plot the Histograms of Scores good separation 13 Module 4: Analytics Theory/Methods The next diagnostic method is plotting the histogram of the scores. The graph in the top half is what we saw earlier in the lesson. The graph tells us how well the model discriminates true instances from false instances. Ideally, true score high and false instances score low. If so, most of the mass of the two histograms are separated. That is what you see in the graph at the top. The graph shown at the bottom shows substantial overlap. The model did not predict well. This means the input variables are not strong predictors of the output. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Explanatory value : Relative impact ofeach variable on the outcome inamore complicated way than linear regression Does not handle missing values well Robust with redundant variables, correlated variables Lose some explanatory value Assumes that each variable affects the log -odds of the outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Concise representation with the the coefficients Cannot handle variables that affect the outcome in a discontinuous way. Step functions Easy toscore data Doesn't work well with discrete drivers that have a lot of distinct values For example, ZIP code Returns good probability estimates ofanevent Preserves the summary statistics ofthe training data "The probabilities equal the counts" Logistic Regression -Reasons to Choose (+) and Cautions ( -) Module 4: Analytics Theory/Methods 14 Logistic regressions have the explanatory values and we can easily determine how the variables affect the outcome. The explanatory values are a little more complicated than linear regression. It works well with (robust) redundant variables and correlated variables. In this case the prediction is not impacted but we lose some explanatory value with the fitted model. Logistic regression provides the concise representation of the outcome with the coefficients and it is easy to score the data with this model. Logistic regression returns probability estimates of an event. It also returns calibrated model it preserves the summary statistics of the training data. Cautions ( -) are that the Logistic regression does not handle missing values well. It assumes that each variable affects the log odds of the outcome linearly and additively. So if we have some variables that affect the outcome non -linearly and the relationships are not actually additive the model does not fit well. Variable transformations and modeling variable interactions can address this to some extent. It is recommended to take the log of monetary amounts or any variable with a wide dynamic range. It cannot handle variables that affect the outcome in a discontinuous way. We discussed the issue of infinite magnitude coefficients earlier where the prediction is inconsistent in ranges. Also when you have discrete drivers with a large number of distinct values the model becomes complex and computationally inefficient. Module 4: Analytics Theory/Methods 14 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge 1. What is a logit and how do we compute class probabilities from the logit? 2. How is ROC curve used to diagnose the effectiveness of the logistic regression model? 3. What is Pseudo R 2 and what does it measure in a logistic regression model? 4. How do you describe a binary class problem? 5. Compare and contrast linear and logistic regression methods. Your Thoughts? 15 Module 4: Analytics Theory/Methods Record your answers here. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: •Technical description of a logistic regression model •Common use cases for the logistic regression model •Interpretation and scoring with the logistic regression model •Diagnostics for validating the logistic regression model •Reasons to Choose (+) and Cautions ( -) of the logistic regression model Lesson 4b: Logistic Regression -Summary Module 4: Analytics Theory/Methods 16 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 16 Need help with the following assignment have attached the required data and materials of the professor's lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Overview of Decision Tree classifier •General algorithm for Decision Trees •Decision Tree use cases •Entropy, Information gain •Reasons to Choose (+) and Cautions ( -) of Decision Tree classifier •Classifier methods and conditions in which they are best suited Decision Trees Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Decision Tree Classifier -What is it? • Used for classification:  Returns probability scores of class membership  Well -calibrated, like logistic regression  Assigns label based on highest scoring class  Some Decision Tree algorithms return simply the most likely class  Regression Trees: a variation for regression  Returns average value at every node  Predictions can be discontinuous at the decision boundaries • Input variables can be continuous or discrete • Output :  A tree that describes the decision flow.  Leaf nodes return either a probability score, or simply a classification.  Trees can be converted to a set of "decision rules“  "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75% probability“ 3 Module 4: Analytics Theory/Methods Decision Trees are a flexible method very commonly deployed in data mining applications. In this lesson we will focus on Decision Trees used for classification problems. There are two types of trees; Classification Trees and Regression (or Prediction) Trees •Classification Trees –are used to segment observations into more homogenous groups (assign class labels). They usually apply to outcomes that are binary or categorical in nature. •Regression Trees –are variations of regression and what is returned in each node is the average value at each node (type of a step function with which the average value can be computed). Regression trees can be applied to outcomes that are continuous (like account spend or personal income). The input values can be continuous or discrete. Decision Tree models output a tree that describes the decision flow. The leaf nodes return class labels and in some implementations they also return the probability scores. In theory the tree can be converted into decision rules such as the example shown in the slide. Decision Trees are a popular method because they can be applied to a variety of situations. The rules of classification are very straight forward and the results can easily be presented visually. Additionally, because the end result is a series of logical “if -then” statements, there is no underlying assumption of a linear (or non -linear) relationship between the predictor variables and the dependent variable. 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Decision Tree – Example of Visual Structure Gender Income Age Yes No Yes No Male Female <=40 >40 >45,000 <=45,000 Internal Node –decision on variable Leaf Node –class label Branch –outcome of test Income Age Female Male 4 Module 4: Analytics Theory/Methods Decision Tree s are typically depicted in a flow -chart like manner. Branches refer to the outcome of a decision and are represented by the connecting lines here. When the decision is numerical, the “greater than” branch is usually shown on the right and “less than” on the left. Depending on the nature of the variable, you may need to include an “equal to” component on one branch. Internal Nodes are the decision or test points. Each refers to a single variable or attribute. In the example here the outcomes are binary, although there could be more than 2 branches stemming from an internal node. For example, if the variable was categorical and had 3 choices, you might need a branch for each choice. The Leaf Nodes are at the end of the last branch on the tree. These represent the outcome of all the prior decisions. The leaf nodes are the class labels, or the segment in which all observations that follow the path to the leaf would be placed. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Decision Tree Classifier -Use Cases • When a series of questions (yes/no) are answered to arrive at a classification  Biological species classification  Checklist of symptoms during a doctor’s evaluation of a patient • When “if -then” conditions are preferred to linear models.  Customer segmentation to predict response rates  Financial decisions such as loan approval  Fraud detection • Short Decision Trees are the most popular "weak learner" in ensemble learning techniques 5 Module 4: Analytics Theory/Methods An example of Decision Trees in practice is the method for classifying biological species. A series of questions (yes/no) are answered to arrive at a classification. Another example is a checklist of symptoms during a doctor’s evaluation of a patient. People mentally perform these types of analysis frequently when assessing a situation. Other use cases can be customer segmentation to better predict response rates to marketing and promotions. Computers can be “taught” to evaluate a series of criteria and automatically approve or deny an application for a loan. In the case of loan approval, computers can use the logical “if -then” statements to predict whether the customer will default on the loan. For customers with a clear (strong) outcome, no human interaction is required, for observations which may not generate a clear response, a human is needed for the decision. Short Decision Trees (where we have limited the number of splits) are often used as components (called "weak learners" or "base learners") in ensemble techniques (a set of predictive models which will all vote and we take decisions based on the combination of the votes) such as Random forests, bagging and boosting (Beyond the scope for this class) .The very simplest of the short trees are decision stumps: Decision Trees with one internal node (the root) which is immediately connected to the terminal nodes. A decision stump makes a prediction based on the value of just a single input feature. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Example: The Credit Prediction Problem savings=(500:1000), >=1000,no known savings personal=female, male div/sep personal=male mar/wid, male single housing=free, rent housing=own 700/1000p(good)=0.70 245/294p(good)=0.83 349/501p(good)=0.70 70/117p(good)=0.60 36/88 p(good) = 0.41 savings= <100, (100:500) 6 Module 4: Analytics Theory/Methods We will use the same example we used in the previous lesson with Naïve Bayesian classifier. For the people with good credit and we start at the top of the tree the probability is 70% (700 out of 1000 people have good credit). The process has decided that we are going to split how much is in the savings account into two groups. One group with savings less than $100 or between $100 to $ 500. The second group is the rest of the population which has savings of $500 to $1000 or greater than $1000 or no known savings. We compute the probability of good credit at the second node and we find in the second savings category 245 out of 294 have good credit and the probability at this node is 83%. Looking at the other node (Savings <100 or Savings 100:500) we look into housing. We split this node into Housing (free,rent) as one group and Housing (own) as the other. Computing probability of good credit at housing (own) node we see that 349 out of 501 people have good credit, a 70% probability. Traversing down the housing (free, rent) node we split now on the variable known as personal. The two groups are Personal (female, male divorced/ separated) and Personal (male,married/widowed,male_single). In the node on the right, the probability of good credit is 0.6; in the node on the left, the probability of good credit is 42% (which is less than 50%, so we have shaded this box red). We can see that for this case, we might want to work with the probabilities, rather than the class labels; this tree would only label 88 rows (out of 1000) of the training set as "bad", which is far less than the 30% "bad" rate of the training set, and of those cases labeled "bad", only 59% of them would truly be bad. Tuning the splitting parameters, or using a random forest or other ensemble technique (more on that later) might improve the performance. Decision Trees are greedy algorithms. They take decisions based on what is available at that moment and once a bad decision is taken it is propagated all the way down. An ensemble technique may randomize the splitting (or even randomize data) and come up with multiple tree structures. It then assigns labels by looking at the average of the nodes in all the trees and assigns class labels or probability values. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. General Algorithm • To construct tree T from training set S  If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf labeled C.  Otherwise:  select the “most informative” attribute A  partition S according to A’s values  recursively construct sub -trees T1, T2, ..., for the subsets of S • The details vary according to the specific algorithm –CART, ID3, C4.5 –but the general idea is the same 7 Module 4: Analytics Theory/Methods We now describe the general algorithm. Our objective is to construct a tree T from a training set S. If all examples in S belongs to some class “C “ (good_credit for example) or S is sufficiently “pure” (in our case node p(credit_good) is 70% pure) we make a leaf labeled “C”. Otherwise we will select another attribute considered as the “most informative” (savings, housing etc.) and partition S according to A’s values. Something similar to what we explained in the previous slide. We will construct sub -trees T1,T2….. or the subsets of S recursively until • You have all of the nodes as pure as required or • You cannot split further as per your specifications or • Any other stopping criteria specified. There are several algorithms that implement Decision Trees and the methods of tree construction vary with each one of them. CART,ID3 and C4.5 are some of the popular algorithms. 7 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Step 1: Pick the Most “Informative" Attribute • Entropy -based methods are one common way • H = 0 if p(c) = 0 or 1 for any class  So for binary classification, H=0 is a "pure" node • H is maximum when all classes are equally probable  For binary classification, H=1 when classes are 50/50 8 Module 4: Analytics Theory/Methods The first step is to pick the most informative attribute. There are many ways to do it. We detail Entropy based methods. Let p( c ) be the probability of a given class. H as defined by the formula shown above will have a value 0 if p (c ) is 0 or 1. So for binary classification H=0 means it is a “pure” node. H is maximum when all classes are equally probable. If the probability of classes are 50/50 then H=1 (maximum entropy). 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Step 1: Pick the most "informative" attribute (Continued) • First, we need to get the base entropy of the data 9 Module 4: Analytics Theory/Methods In our credit problem p(credit_good) is 0.7 and p(credit_bad) is 0.3. The base entropy H credit = -(0.7 log 2(0.7) + 0.3log 2(0.3)) = 0.88 ( very close to 1) Our unconditioned credit problem has fairly high entropy. 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Step 1: Pick the Most “Informative" Attribute (Continued) Conditional Entropy • The weighted sum of the class entropies for each value of the attribute • In English: attribute values (home owner vs. renter) give more information about class membership  "Home owners are more likely to have good credit than renters" • Conditional entropy should be lower than unconditioned entropy 10 Module 4: Analytics Theory/Methods Continuing with step 1 we now find the conditional entropy, which is the weighted sum of class entropies for each value of the attribute. Let us say we choose the attribute “Housing” we have three levels for this attribute (free, rent and own). Intuitively we can say that home owners are more likely to have better credit than renters. So the attribute value Housing will give more information about the class membership for credit_good. The conditional entropy of attribute Housing should be lower than the base entropy. At worst (in the case where the attribute is uncorrelated with the class label), the conditional entropy is the same as the unconditioned entropy. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Conditional Entropy Example for free own rent P(housing) 0.108 0.713 0.179 P(bad | housing) 0.407 0.261 0.391 p(good | housing ) 0.592 0.739 0.601 11 Module 4: Analytics Theory/Methods Let's compute the conditional entropy of credit class conditioned on housing status. In the top row of the table are the probabilities of each value. In the n ext two rows are the probabilities of the class labels conditioned on the housing value. Note that each term inside parentheses is the entropy of the class labels within a single housing value. The conditional entropy is still fairly high; but it is a little less than the unconditioned entropy. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Step 1: Pick the Most “Informative" Attribute (Continued) Information Gain • The information that you gain, by knowing the value of an attribute • So the "most informative" attribute is the attribute with the highest InfoGain 12 Module 4: Analytics Theory/Methods Information Gain is defined as the difference between the base entropy and the conditional entropy of the attribute. So the most informative attribute is the attribute with most information gain. Remember, this is just an example. There are other information/purity measures, but InfoGain is a fairly popular one for inducing Decision Trees. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Back to the Credit Prediction Example Attribute InfoGain job 0.001 housing 0.013 personal_status 0.006 savings_status 0.028 13 Module 4: Analytics Theory/Methods If we compute the InfoGain for all of our input variables, we see that savings_status is the most informative variable. We can see that savings_status gives the most infoGain and that is why it was the first variable on which the tree was split. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. savings= <100, (100:500) savings=(500:1000),>=1000,no known savings 700/1000p(good)=0.7 245/294p(good)=0.83 Step 2 & 3: Partition on the Selected Variable • Step 2: Find the partition with the highest InfoGain  In our example the selected partition has InfoGain = 0.028 • Step 3: At each resulting node, repeat Steps 1 and 2  until node is “pure enough” • Pure nodes => no information gain by splitting on other attributes 14 Module 4: Analytics Theory/Methods The selected partitioning has InfoGain almost as high as using each savings value as a separate node. And InfoGain happens to be biased to many partitions, so this partition is basically as informative. InfoGain can be used with continuous variables as well; in that case, finding the partition and computing the information gain are the same step. “Pure enough” usually means that no more information can be gained by splitting on other attributes 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics • Hold -out data • ROC/AUC • Confusion Matrix • FPR/FNR, Precision/Recall • Do the splits (or the “rules”) make sense?  What does the domain expert say? • How deep is the tree?  Too many layers are prone to over -fit • Do you get nodes with very few members?  Over -fit 15 Module 4: Analytics Theory/Methods The diagnostics are exactly the same as the one we detailed for Naïve Bayesian classifier. We use the hold -out data /AUC and confusion matrix. There are sanity checks that can be performed such as validating the “decision rules” with domain experts and determining if they make sense. Having too many layers and obtaining nodes with very few members are signs of over fitting. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Takes any input type (numeric, categorical) In principle, can handle categorical variables with many distinct values (ZIP code) Decision surfaces can only be axis -aligned Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the training data Naturally handles variable interaction A “deep” tree is probably over -fit Because each split reduces the training data for subsequent splits Handles variables that have non -linear effect on outcome Not good for outcomes that are dependent on many variables Related to over -fit problem, above Computationally efficient to build Doesn’t naturally handle missing values; However most implementations include a method for dealing with this Easy to score data In practice, decision rules can be fairly complex Many algorithms can return a measure of variable importance In principle, decision rules are easy to understand Decision Tree Classifier -Reasons to Choose (+) & Cautions ( -) Module 4: Analytics Theory/Methods 16 Decision Trees take both numerical and categorical variables. They can handle many distinct values such as the zip code in the data. Unlike Naïve Bayesian the Decision Tree method is robust with redundant or correlated variables. Decision Trees handles variables that are non -linear. Linear/logistic regression computes the value as b1*x1 + b2*x2 .. And so on. If two variables interact and say the value y depends on x1*x2, linear regression does not model this type of data correctly. Naïve Bayes also does not do variable interactions (by design). Decision Trees handle variable interactions naturally. Every node in the tree is in some sense an interaction. Decision Tree algorithms are computationally efficient and it is easy to score the data. The outputs are easy to understand. Many algorithms return a measure of variable importance. Basically the information gain from each variable is provided by many packages. In terms of Cautions ( -), decision surface is axis aligned and the decision regions are rectangular surfaces. However, if the decision surface is not axis aligned (say a triangular surface), the Decision Tree algorithms do not handle this type of data well. Tree structure is sensitive to small variations in the training data. If you have a large data set and you build a Decision Tree on one subset and another Decision Tree on a different subset the resulting trees can be very different even though they are from the same data set. If you get a deep tree you are probably over fitting as each split reduces the training data for subsequent splits. Module 4: Analytics Theory/Methods 16 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Typical Questions Recommended Method Do I want class probabilities, rather than just class labels? Logistic regression Decision Tree Do I want insight into how the variables affect the model? Logistic regression Decision Tree Isthe problem high -dimensional? Naïve Bayes Do I suspect some of the inputs are correlated? Decision Tree Logistic Regression Do I suspect some of the inputs are irrelevant? Decision Tree Naïve Bayes Are there categorical variables with a large number of levels? Naïve Bayes Decision Tree Are there m ixed variable types? Decision Tree Logistic Regression Is there non -linear data or discontinuities in the inputs that will affect the outputs? Decision Tree Which Classifier Should I Try? Module 4: Analytics Theory/Methods 18 This is only advisory. It’s a list of things to think about when picking a classifier, based on the Reasons to Choose (+) and Cautions ( -) we’ve discussed. Module 4: Analytics Theory/Methods 18 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge 1. How do you define information gain? 2. For what conditions is the value of entropy at a maximum and when is it at a minimum? 3. List three use cases of Decision Trees. 4. What are weak learners and how are they used in ensemble methods? 5. Why do we end up with an over fitted model with deep trees and in data sets when we have outcomes that are dependent on many variables? 6. What classification method would you recommend for the following cases:  High dimensional data  Data in which outputs are affected by non -linearity and discontinuity in the inputs Your Thoughts? 19 Module 4: Analytics Theory/Methods Record your answers here. 19 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: •Overview of Decision Tree classifier •General algorithm for Decision Trees •Decision Tree use cases •Entropy, Information gain •Reasons to Choose (+) and Cautions ( -) of Decision Tree classifier •Classifier methods and conditions in which they are best suited Decision Trees -Summary Module 4: Analytics Theory/Methods 20 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 20
Need help with the following assignment have attached the required data and materials of the professor’s lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do
Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •General description of regression models •Technical description of a linear regression model •Common use cases for the linear regression model •Interpretation and scoring with the linear regression model •Diagnostics for validating the linear regression model •The Reasons to Choose (+) and Cautions ( -) of the linear regression model Lesson 4a: Linear Regression Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Regression • Regression focuses on the relationship between an outcome and its input variables.  Provides an estimate of the outcome based on the input values.  Models how changes in the input variables affect the outcome. • The outcome can be continuous or discrete. • Possible use cases:  Estimate the lifetime value (LTV) of a customer and understand what influences LTV.  Estimate the probability that a loan will default and understand what leads to default. • Our approaches: linear regression and logistic regression 3 Module 4: Analytics Theory/Methods The term “regression” was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards an average (a phenomenon also known as regression toward the mean). Specifically, regression analysis helps one understand how the value of the dependent variable (also referred to as outcome) changes when any one of the independent (or input) variables changes, while the other independent variables are held fixed. Regression analysis estimates the conditional expectation of the outcome variable given the input variables — that is, the mean value of the outcome variable when the input variables are held fixed. Regression focuses on the relationship between the outcome and the inputs. It also provides a model that has some explanatory value , in addition to estimating outcomes. Although social scientists use regression primarily for its explanatory value, data scientists apply regression techniques as predictors or classifiers. The outcome can be continuous or discrete. For continuous outcomes, such as income, this lesson examines the use of linear regression . For discrete outcomes of a categorical attribute, such as success/fail, gender, or political party affiliation, the next lesson presents the use of logistic regression . 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Linear Regression • Used to estimate a continuous value as a linear (additive) function of other variables  Income as a function of years of education, age, and gender  House sales price as function of square footage, number of bedrooms/bathrooms, and lot size • Outcome variable is continuous. • Input variables can be continuous or discrete. • Model Output:  A set of estimated coefficients that indicate the relative impact of each input variable on the outcome  A linear expression for estimating the outcome as a function of input variables 4 Module 4: Analytics Theory/Methods Linear regression is a commonly used technique for modeling a continuous outcome. It is simple and works well in many instances. It is recommended that linear regression should be tried and if it is determined that the results are not reliable, other more complicated models should be considered. Alternative modeling approaches include ridge regression, local linear regression, regression trees, and neural nets (these models are out of scope for this course). Linear regression m odels a continuous outcome, such as income or housing sales prices, as a linear or additive function of other input variables. The input variables can be continuous or discrete. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Linear Regression Model Module 4: Analytics Theory/Methods 5 = 0+ 01+ 01+ ⋯ + −1−1+ where is the outcome variable are the input variables, for j= 1,2,…,p−1 0is the value of when each equals zero is the change in based on a unit change in ~ N(0, 2)and the ’s are independent of each other In linear regression, the outcome variable is expressed as a linear combination of the input variables. For a given set of input variables, the linear regression model provides the expected outcome value. Unless the situation being modeled is purely deterministic, there will be some random variability in the outcome. This random error, denoted by ɛ, is assumed to be normally distributed with a mean of zero and a constant variance (2). 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Example: Linear Regression with One Input Variable 6 • x1-the number of employees reporting to a manager • y -the hours per week spent in meetings by the manager Module 4: Analytics Theory/Methods = 0+ 11+ In this example, the human resources department decides to examine the effect that the number of employees directly reporting to a manager has on how many hours per week the manager spends in meetings. The expected time spent in meetings is represented by the equation of a line with unknown intercept and slope. Suppose the true value of the i ntercept is 3.27 hours and the true value of the slope is 2.2 hours per employee. Then, a manager can expect to spend an additional 2.2 hours per week in meetings for every additional employee. The distribution of the error term is represented by the rotated normal distribution plots provided at specific values of x 1. For example, a typical manager with 8 employees may be expected to spend 20.87 hours per week in meetings, but any amount of time from 15 to 27 hours per week is very probable. This example illustrates a theoretical regression model. In practice, it is necessary to collect and prepare the necessary data and use a software package such as R to estimate the values of the coefficients. Coefficient estimation is covered later in this lesson. Additional variables could be included to this model. For example, a categorical attribute can be added to this linear regression model to account for the manager’s functional organization, such as engineering, finance, manufacturing, or sales. It may be somewhat tempting to included one variable, x 2, to represent the organization and denote engineering by 1, finance by 2, manufacturing by 3, and sales by 4. However, such an approach incorrectly suggests that the interval between the assigned numeric values has meaning (for example, sales is three more than engineering). The proper implementation of categorical attributes in a regression model will be addressed next. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Representing Categorical Attributes • For a categorical attribute with m possible values  Add m-1binary (0/1) variables to the regression model  The remaining category is represented by setting the m-1binary variables equal to zero 7 Module 4: Analytics Theory/Methods Possible Situation Input Variables Finance manager with 8 employees (8,1,0,0) Manufacturing manager with 8 employees (8,0,1,0) Sales manager with 8 employees (8,0,0,1) Engineering manager with 8 employees (8,0,0,0) In expanding the previous example to include the manager’s functional organization, the input variables, denoted earlier by the x’s, have been replaced by more meaningful variable names. In addition to the employees variable for the number of employees reporting to a manager, three binary variables have been added to the model to identify finance, manufacturing ( mfg ), and sales managers. If a manager belongs to either of these functional organizations, the corresponding variable is set to 1. Otherwise, the variable is set to 0. Thus, for four functional organizations, engineering is represented by the case where the three binary variables are each set to 0. For this categorical variable, engineering is considered the reference level. For example, the coefficient of finance denotes the relative difference from the reference level. Choosing a different organization as the reference level changes the coefficient values, but not their relative differences. Interpreting the coefficients for categorical variables relative to the reference level is covered later in this lesson. In general, a categorical attribute with m possible distinct values can be represented in the linear regression model by adding m-1binary variables. For a categorical attribute, such as gender with only two possible values, female or male, then only one binary variable needs to be added with one value assigned a 0 and the other value assigned 1. Suppose it was decided to include the manager’s U.S. state of employment in the regression model. Then 49 binary variables would have to be added to the regression model to account for 50 states. However, that many categorical values can be quite cumbersome to interpret or analyze. Alternatively, it may make more sense to group the states into geographic regions or into other groupings such as type of location: headquarters, plant, field office, or remote. In the latter case, only three binary variables would need to be added. 7 Module 4: Analytics Theory/Methods            sales mfg finance employees y 4 3 2 1 0 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. • Choose the line that minimizes: • Provides the coefficient estimates, denoted bj Fitting a Line with Ordinary Least Squares (OLS) 8 Module 4: Analytics Theory/Methods ෍ =1 [− (0+ 11+ ⋯ + −1,−1)]2 Once a dataset has been collected, the objective is fit the “best” line to the data points. A very common approach to determine the best fitting line is to choose the line that minimizes the sum of the squares of the differences between the observed outcomes in the dataset and the estimated outcomes based on the equation of the fitted line. This method is known as Ordinary Least Squares (OLS). In the case of one input variable, the differences or distances between the observed outcome values and the estimated values along the fitted regression line are presented in the provided graph as the vertical line segments. Although this minimization problem can be solved by hand calculations, it becomes very difficult for more than one input variable. Mathematically, the problem involves calculating the inverse of a matrix. However, other methods such as QR decomposition are used to minimize numerical round -off errors. Depending on the implementation, the required storage to perform the OLS calculations may grow quadratically as the number of input variables grows. For a large number of observations and many variables, the storage and RAM requirements should be carefully considered. Note the provided equation of the fitted line. The use of the carat over y, read y-hat , is used to denote the estimated outcome for a given set of input. This notation helps to distinguish the observed y values from the fitted yvalues. In this example, the estimated coefficients are b0= 3.21 and b1= 2.19. 8 Module 4: Analytics Theory/Methods1 19.2 21.3 ˆ x y   Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Interpreting the Estimated Coefficients, bj • Coefficients for numeric input variables  Change in outcome due to a unit change in input variable *  Example: b1= 2.2  Extra 2.2 hrs/wk in meetings for each additional employee managed * • Coefficients for binary input variables  Represent the additive difference from the reference level *  Example: b2= 0.5  Finance managers meet 0.5 hr/wk more than engineering managers do * • Statistical significance of each coefficient  Are the coefficients significantly different from zero?  For small p-values (say < 0.05), the coefficient is statistically significant * when all other input values remain the same 9 Module 4: Analytics Theory/Methods For numeric variables, the estimated coefficients are interpreted in the same way as the concept of slope was introduced in algebra. For a unit change in a numeric variable, the outcome will change by the amount and in the direction of the corresponding coefficient. A fitted linear regression model is provided for the example where the hours per week spent in meeting by managers are modeled as a function of the number of employees and the manager’s functional organization. In this case, the coefficient of 2.2, corresponding to the employees variable, is interpreted as the expected amount of time spent in meetings will increase by 2.2 hours per week for each additional employee reporting to a manager. The interpretation of a binary variable coefficient is slightly different. When a binary variable only assumes a value of 0 or 1, the coefficient is the additive difference or shift in the outcome from the reference level. In this example, engineering is the reference level for the functional organizations. So, a manufacturing manager would be expected to spend 1.9 hours per week less in meetings than an engineering manager when the number of employees is the same. When used to fit linear regression models, many statistical software packages will provide a p - value with each coefficient estimate. This p -value can be used to determine if the coefficient is significantly different that zero. In other words, the software performs a hypothesis test where the null hypothesis is the coefficient equals zero and the alternate hypothesis is that the coefficient does not equal zero. For small p -values (say <0.05), then the null hypothesis would be rejected and the corresponding variable should remain in the linear regression model. If a larger p -value is observed, then the null hypothesis would not be rejected and the corresponding variable should be considered for removal from the model. 9 Module 4: Analytics Theory/Methodssales mfg finance employees y 6 . 0 9 . 1 5 . 0 2 . 2 0 . 4 ˆ      Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. • Residuals  Differences between the observed and estimated outcomes  The observed values of the error term, ε, in the regression model  Expressed as: • Errors are assumed to be normally distributed with  A mean of zero  Constant variance Diagnostics – Examining Residuals 10 Module 4: Analytics Theory/Methods Residuals are the differences between the observed and the estimated outcomes. The residuals are the observed values of the error term in the linear regression model. In linear regression modeling, these error terms are assumed to be normally distributed with a mean of zero and a constant variance regardless of the input variable values. Although this normality assumption is not required to fit a line using OLS, this assumption is the basis for many of the hypothesis tests and confidence interval calculations performed by statistical software packages such as R. The next few slides will address the use of residual plots to evaluate the adherence to this assumption as well as to access the appropriateness of a linear model to a given dataset. 10 Module 4: Analytics Theory/Methodsn i for y y e i i i ..., 2 , 1     Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics – Plotting Residuals 11 Module 4: Analytics Theory/Methods Ideal Residual Plot Quadratic Trend Non -centered Non -constant Variance When plotting the residuals agains t the estimated or fitted outcome values, the ideal residual plot will show residuals symmetrically centered around zero with a constant variance and with no apparent trends. If the ideal residual plot is not observed, it is often necessary to add additional variables to the model or transform some of the existing input and outcome variables. Common transformations include the square root and logarithmic functions. Residual plots are also useful for identifying outliers that may require further investigation or special handling. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics – Residual Normality Assumption 12 Module 4: Analytics Theory/Methods Ideal Histogram Ideal Q-Q Plot The provided histogram shows that the residuals are centered around zero and appear to be symmetric about zero in a bell -shaped curved as one would expect for a normally distributed random variable. Another option is to examine a Q -Q plot that compares the observed data against the quantiles (Q) of the assumed distribution. In this example, the observed residuals follow a theoretical normal distribution represented by the line. If any significant departures of the plotted points from the line are observed , transformations , such as logarithms, may be required to satisfy the normality assumption. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Train Train Test D3 D2 D1 Training Set #1 Train Test D3 D2 D1 Train Diagnostics – Using Hold -out Data • Hold -out data  Training and testing datasets  Does the model predict well on data it hasn't seen? • N-fold cross validation  Partition the data into N groups.  Holding out each group,  Fit the model  Calculate the residuals on the group  Estimated prediction error is the average over all the residuals . 13 Module 4: Analytics Theory/Methods Test Train D3 D2 D1 Train Training Set #2 Training Set #3 Creating a hold -out dataset (this was discussed in Apriori diagnostics earlier in lesson 2 of this module) before you fit the model, and using that dataset to estimate prediction error is by far the easiest thing to do. N-fold cross validation –it tells you if your set of variables is reasonable. This method is used when you don't have enough data to create a hold -out dataset. N -fold c ross validation is performed by randomly splitting the dataset into N non -overlapping subsets or groups and then fitting a model using N -1 groups and predicting its performance using the group that was held out. This process is repeated a total of N times, by holding out each group. After completing the N model fits, you estimate the mean performance of the model (maybe also the variance/standard deviation of the performance). "Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions ", by Seni and Elder, provides a succinct description of N -fold cross -validation. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics – Other Considerations • R2  The fraction of the variability in the outcome variable explained by the fitted regression model.  Attains values from 0 (poorest fit) to 1 (perfect fit) • Identify correlated input variables  Pair -wise scatterplots  Sanity check the coefficients  Are the magnitudes excessively large?  Do the signs make sense? 14 Module 4: Analytics Theory/Methods R2(goodness of fit metric) is reported by all standard packages. It is the fraction of the variability in the outcome variable that the fitted model explains. The definition of R 2is 1 – SSerr /Sstot where SSerr = Sum[(y -ypred )2] and SStot = Sum[(y -ymean )2]. For a good fit, we want an R2value near 1. Regression modeling works best if the input variables are independent of each other. A simple way to look for the correlated variables is to examine pair -wise scatterplots such as the one generated in Module 3 for the Iris dataset. If two input variables, x1and x2, are linearly related to the outcome variable y, but are also correlated to each other, it may be only necessary to include one of these variables in the model. After fitting a regression model, it is useful to examine the magnitude and signs of the coefficients. Coefficients with large magnitudes or intuitively incorrect signs are also indications on correlated input variables. If the correlated variables remain in the fitted model, the predictive power of the regression model may not suffer, but its explanatory power will be diminished when the magnitude and signs of the coefficients do not make sense. If correlated input variables need to remain in the model, restrictions on the magnitudes of the estimated coefficients can be accomplished with alternative regression techniques. Ridge regression , which applies a penalty based on the size of the coefficients, is one technique that can be applied. In fitting a linear regression model, the objective is to find the values of the coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term proportional to the sum of the squares of the coefficients is added to the sum of the residuals squared. A related technique is lasso regression , in which the penalty is proportional to the sum of the absolute values of the coefficients. Both of these techniques are outside of the scope of this course. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Concise representation (the coefficients) Does not handle missing values well Robust to redundant or correlated variables Lose some explanatory value Assumes that each variable affects the outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Explanatory value Relative impact of each variable on the outcome Does not easily handle variables that affect the outcome in a discontinuous way Step functions Easy to score data Does not work well with categorical attributes with a lot of distinct values For example, ZIP code Linear Regression -Reasons to Choose (+) and Cautions ( -) Module 4: Analytics Theory/Methods 15 The estimated coefficients provide a concise representation of the outcome variable as a function of the input variables. T he estimated coefficients provide t he explanatory value of the model and are used to easily determine how the individual input variables affect the outcome. Linear regression is robust to redundant or correlated variables. Although the predictive power may not be impacted, the model does lose some explanatory value in the case of correlated variables. With the fitted model, it is also easy to score a given set of input values. A caution is that linear regression does not handle missing values well. Another caution is that linear regression assumes that each variable affects the outcome linearly and additively. If some variables affect the outcome non -linearly and the relationships are not actually additive, the model will often not explain the data well. Variable transformations and modeling variable interactions can address this issue to some extent. Hypothesis testing and confidence intervals depend on the normality assumption of the error term. To satisfy the normality assumption, a common practice is take the log of an outcome variable with a skewed distribution for a given set of input values. Also, linear regression models are not ideal for handling variables that affect the outcome in a discontinuous way. In the case of a categorical attribute with a large number of distinct values, the model becomes complex and computationally inefficient. Module 4: Analytics Theory/Methods 15 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge 1. How is the measure of significance used in determining the explanatory value of a driver (input variable) with linear regression models? 2. Detail the challenges with categorical values in linear regression model. 3. Describe N -Fold cross validation method used for diagnosing a fitted model. 4. List two use cases of linear regression models. 5. List and discuss two standard checks that you will perform on the coefficients derived from a linear regression model. Your Thoughts? 16 Module 4: Analytics Theory/Methods Record your answers here. 16 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: • General description of regression models • Technical description of a linear regression model • Common use cases for the linear regression model • Interpretation and scoring with the linear regression model • Diagnostics for validating the linear regression model • The Reasons to Choose (+) and Cautions ( -) of the linear regression model Lesson 4a: Linear Regression -Summary Module 4: Analytics Theory/Methods 17 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 17 Need help with the following assignment have attached the required data and materials of the professor's lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Naïve Bayesian Classifier •Theoretical foundations of the classifier •Use cases •Evaluating the effectiveness of the classifier •The Reasons to Choose (+) and Cautions ( -) with the use of the classifier Naïve Bayesian Classifiers Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Classifiers • Classification: assign labels to objects. • Usually supervised: training set of pre -classified examples. • Our examples:  Naïve Bayesian  Decision Trees  (and Logistic Regression) Module 4: Analytics Theory/Methods 3 Where in the catalog should I place this product listing? Is this email spam? Is this politician Democrat/Republican/Green? The primary task performed by c lassifiers is to assign labels to objects. Labels in classifiers are pre -determined unlike in clustering where we discover the structure and assign labels. Classifier problems are supervised learning methods. We start with a training set of pre - classified examples and with the knowledge of probabilities we assign class labels. Some use case examples are shown in the slide. Based on the voting pattern on issues we could classify whether a politician has an affiliation to a party or a principle. Retailers use classifiers to assign proper catalog entry locations for their products. Most importantly the classification of emails as spam is another useful application of classifier methods. Logistic regression, discussed in the previous lesson, can be viewed and used as a classifier. We will discuss Naïve Bayesian Classifiers in this lesson and the use of Decision Trees in the next lesson. Module 4: Analytics Theory/Methods 3 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Naïve Bayesian Classifier • Determine the most probable class label for each object  Based on the observed object attributes  Naïvely assumed to be conditionally independent of each other  Example:  Based on the objects attributes {shape, color, weight}  A given object that is {spherical, yellow, < 60 grams}, may be classified (labeled) as a tennis ball  Class label probabilities are determined using Bayes’ Law • Input variables are discrete • Output :  Probability score –proportional to the true probability  Class label –based on the highest probability score 4 Module 4: Analytics Theory/Methods The Naïve Bayesian Classifier is a probabilistic classifier based on Bayes' Law and naïve conditional independence assumptions. In simple terms, a Naïve Bayes Classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, an object can be classified into a particular category based on its attributes such as shape, color, and weight. A reasonable classification for an object, that is spherical, yellow and less than 60 grams in weight, may be a tennis ball. Even if these features depend on each other or upon the existence of the other features, a Naïve Bayesian Classifier considers all of these properties to independently contribute to the probability that the object is a tennis ball. The input variables are generally discrete (categorical) but there are variations to the algorithms that work with continuous variables as well. For this lesson, we will consider only discrete input variables. Although weight may be considered a continuous variable, in the tennis ball example, weight was grouped into intervals in order to make weight a categorical variable. The output typically returns a probability score and class membership. The output from most implementations are log probability scores for the class (we will address this later in the lesson) and we assign the class label that corresponds to the highest log probability score. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Naïve Bayesian Classifier -Use Cases • Preferred method for many text classification problems.  Try this first; if it doesn't work, try something more complicated • Use cases  Spam filtering, other text classification tasks  Fraud detection 5 Module 4: Analytics Theory/Methods Naïve Bayesian Classifiers are among the most successful known algorithms for learning to classify text documents. Spam filtering is the best known use of Naïve Bayesian Text Classification. Bayesian Spam Filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email. Many modern mail clients implement Bayesian Spam Filtering. Naïve Bayesian Classifiers are used to detect fraud. For example in auto insurance, based on a training data set with attributes (such as driver’s rating, vehicle age, vehicle price, is it a claim by the policy holder, police report status, claim genuine ) we can classify a new claim as genuine or not. References: Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering) http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering (http://eprints.ecs.soton.ac.uk/18483/ Online applications (http://www.convo.co.uk/x02/) 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Building a Training Dataset to Predict Good or Bad Credit 6 Module 4: Analytics Theory/Methods • Predict the credit behavior of a credit card applicant from applicant's attributes:  Personal status  Job type  Housing type  Savings amount • These are all categorical variables and are better suited to Naïve Bayesian Classifier than to logistic regression. Let us look into a specific use case example. We present here the same example we worked with in Lesson 2 of this module with the Apriori algorithm. The training dataset consists of attributes: personal status, job type, housing type and amount of money in their savings account. They are represented as categorical variables which are well suited for Naïve Bayesian Classifier. With this training set we want to predict the credit behavior of a new customer. This problem could be solved with logistic regression as well. If there are multiple levels for the outcome you want to predict, then Naïve Bayesian Classifier is a better solution. Next, we will go through the technical basis for Naïve Bayesian Classifiers and will revisit this credit dataset later. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Technical Description -Bayes' Law • C is the class label:  C ϵ{C 1, C 2, … Cn} • A is the observed object attributes  A = (a 1, a 2, … a m) • P(C | A) is the probability of C given A is observed  Called the conditional probability 7 Module 4: Analytics Theory/Methods Bayes' Law states: P(C | A)*P(A) = P(A | C)*P(C) = P(A ^ C). That is, the conditional probability that C is true given that A is true, denoted P(C|A), times the probability of A is the same as the conditional probability that A is true given that C is true, denoted P(A|C), times the probability of C. Both of these terms are equal to P(A^C) that is the probability A and C are simultaneously true. If we divide all three terms by P(A), then we get the form shown on the slide. The reason that Bayes’ Law is important is that we may not know P(C|A) (and we want to), but we do know P(A|C) and P(C) for each possible value of C from the training data. As we will see later, it is not necessary to know P(A) for the purposes of Naïve Bayes Classifiers. An example using Bayes Law: John flies frequently and likes to upgrade his seat to first class. He has determined that, if he checks in for his flight at least two hours early, the probability that he will get the upgrade is .75; otherwise, the probability that he will get the upgrade is .35. With his busy schedule, he checks in at least two hours before his flight only 40% of the time. Suppose John didn’t receive an upgrade on his most recent attempt. What is the probability that he arrived late? C = John arrives late A = John did not receive an upgrade P(C) = Probability John arrives late = .6 P(A) = Probability John did not receive an upgrade = 1 –( .4 x .75 + .6 x .35) = 1 -.51 = .49 P(A|C) = Probability that John did not receive an upgrade given that he arrived late = 1 -.35 = .65 P(C|A) = Probability that John arrived late given that he did not receive his upgrade = P(A|C)P(C)/P(A) = (.65 x .6)/.49 = .80 (approx) In this simple example, C can take one of two possible values {arriving early, arriving late) and there is only one attribute which can take one of two possible values {received upgrade, did not receive upgrade}. Next, we will generalize Bayes’ Law to multiple attributes and apply the naïve independence assumptions. 7 Module 4: Analytics Theory/Methods) ( ) ( ) | ( ) ( ) ( ) | ( A P C P C A P A P C A P A C P    Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. • For observed attributes A = (a 1, a 2, … a m), we want to compute and assign the classifier, Ci, with the largest P(C i|A) • Two simplifications to the calculations  Apply naïve assumption -each aj is conditionally independent of each other, then  Denominator P(a 1,a2,…a m) is a constant and can be ignored Apply the Naïve Assumption and Remove a Constant 9 Module 4: Analytics Theory/Methods The general approach is to assign the classifier label, Ci, to the object with attributes A = (a 1, a 2, … a m) that corresponds to the largest value of P(Ci|A ). The probability that a set of attribute values A (comprised of m variables a 1thru a m) should be labeled with a classification C iwill equal the probability that of the set of variables a1thru a m given Ci is true, times the probability of Ciall divided by the probability of the set of attribute values a 1thru a m . The conditional independence assumption is that the probability of observing the value of a particular attribute given Ciis independent of the other attributes. This naïve assumption simplifies the calculation of P(a 1, a 2, …, am|C i) as shown on the slide. Since P(a 1, a 2, …, a m) appears in the denominator of P(Ci|A ), for all values of i, removing the denominator will have no impact on the relative probability scores and will simplify calculations. Next, these two simplifications to the calculations will be applied to build the Naïve Bayesian Classifier. 9 Module 4: Analytics Theory/Methodsn i a a a P C P C a a a P A C P m i i m i ,...,2 ,1 ) ,..., , ( ) ( ) | ,..., , ( ) | ( 2 1 2 1      m j i j i m i i i m C a P C a P C a P C a P C a a a P 1 2 1 2 1 ) | ( ) | ( ) | ( ) | ( ) | ,..., , (  Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Building a Naïve Bayesian Classifier • Applying the two simplifications • To build a Naïve Bayesian Classifier, collect the following statistics from the training data:  P(C i) for all the class labels.  P(a j| C i) for all possible a jand C i  Assign the classifier label, C i, that maximizes the value of 10 Module 4: Analytics Theory/Methods Applying the two simplifications, P(C i|a 1, a 2, …, a m) is proportional to the product of the various P( aj|C i), for j=1,2,…m, times P(C i). From a training dataset, these probabilities can be computed and stored for future classifier assignments. We now return to the credit applicant example. 10 Module 4: Analytics Theory/Methodsn i C P C a P a a a C P i m j i j m i ,...,2 ,1 ) ( ) | ( ) ,..., , | ( 1 2 1          n i C P C a P i m j i j ,...,2 ,1 ) ( ) | ( 1         Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Naïve Bayesian Classifiers for the Credit Example • Class labels: {good, bad}  P(good) = 0.7  P(bad) = 0.3 • Conditional Probabilities  P(own|bad ) = 0.62  P(own|good) = 0.75  P(rent|bad ) = 0.23  P(rent|good) = 0.14  … and so on 11 Module 4: Analytics Theory/Methods To build a Naïve Bayesian Classifier we need to collect the following statistics: 1. Probability of all class labels –Probability of good credit and probability of bad credit. From the all data available in the training set we determine P(good) = 0.7 and P(bad) = 0.3 2. In the training set, there are several attributes: personal_status, job, housing, and saving_status. For each attribute and its possible values, we need to compute the conditional probabilities given bad or good credit. For example, relative to the housing attribute, we need to compute P( own|bad ), P( own|good ), P( rent|bad ), P( rent|good ), etc. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Naïve Bayesian Classifier for a Particular Applicant • Given applicant attributes of A= {female single, owns home, self -employed, savings > $1000} • Since P(good|A) > (bad|A), assign the applicant the label “good” credit aj Ci P(a j| C i) female single good 0.28 female single bad 0.36 own good 0.75 own bad 0.62 self emp good 0.14 self emp bad 0.17 savings>1K good 0.06 savings>1K bad 0.02 P(good|A) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012 P(bad|A) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002 12 Module 4: Analytics Theory/Methods Here we have an example of an applicant who is female, single, owns a home, is self -employed and has savings over $1000 in her savings account. How will we classify this person? Will she be scored as a person with good or bad credit? Having built the classifier with the training set we find P( good|A ) which is equal to 0.0012 (see the computation on the slide) and P( bad|A ) is 0.0002. Since P( good|A ) is the maximum of the two probability scores, we assign the label “good” credit. The score is only proportional to the probability. It doesn’t equal the probability, because we haven’t included the denominator. However, both formulas have the same denominator, so we don’t need to calculate it in order to know which quantity is bigger. Notice, though, how small in magnitude these scores are. When we are looking at problems with a large number of attributes, or attributes with a very high number of levels, these values can become very small in magnitude. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Naïve Bayesian Implementation Considerations • Numerical underflow  Resulting from multiplying several probabilities near zero  Preventable by computing the logarithm of the products • Zero probabilities due to unobserved attribute/classifier pairs  Resulting from rare events  Handled by smoothing (adjusting each probability by a small amount) • Assign the classifier label, C i, that maximizes the value of 13 Module 4: Analytics Theory/Methods where i= 1,2,…,n and P’denotes the adjusted probabilities Multiplying several probability values, each possibly close to zero, invariably leads to the problem of numerical underflow. So an important implementation guideline is to compute the logarithm of the product of the probabilities, which is equivalent to the summation of the logarithm of the probabilities. Although the risk of underflow may increase as the number of attributes increase, the use of logarithms should be applied regardless of the number of attribute dimensions. Additionally, to address the possibility of probabilities equal to zero, smoothing techniques can be employed to adjust the probabilities to ensure non -zero values. Applying a smoothing technique assigns a small non -zero probability to rare events not included in the training dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero. The R implementation of Naïve Bayes incorporates the smoothing directly into the probability tables. Essentially, the Laplace smoothing that R uses adds one (or a small value) to every count. For example, if we have 100 “good” customers, and 20 of them rent their housing, the “raw” P(rent | good) = 20/100 = 0.2; with Laplace smoothing add adding one to the counts, the calculation would be P(rent | good) ~ (20 + 1)/(100+3) = 0.20388, where there are 3 possible values for housing (own, rent, for free). Fortunately, the use of the logarithms and the smoothing techniques are already implemented in standard software packages for Naïve Bayes Classifiers. However, if for performance reasons, the Naïve Bayes Classifier algorithm needs to be coded directly into an application, these considerations should be implemented. 13 Module 4: Analytics Theory/Methods ) ( log ) | ( log 1 i m j i j C P C a P         Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Diagnostics • Hold -out data  How well does the model classify new instances? • Cross -validation • ROC curve/AUC 14 Module 4: Analytics Theory/Methods The diagnostics we used in regression can be used to validate the effectiveness of the model we built. The technique of using the hold -out data and performing N -fold cross validations and using the ROC/Area Under the Curve methods can be deployed with Naïve Bayesian Classifier as well. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Prediction Actual Class good bad good 671 29 700 bad 38 262 300 709 291 1000 Diagnostics: Confusion Matrix Overall success rate (or accuracy): (TP + TN) / (TP+TN+FP+FN) = (671+262)/1000 ≈ 0.93 TPR: TP / (TP + FN) = 671 / (671+29) = 671/700 ≈ 0.96 FPR: FP / (FP + TN) = 38 / (38 + 262) = 38/300 ≈ 0.13 FNR: FN / (TP + FN) = 29 / (671 + 29) = 29/700 ≈ 0.04 Precision: TP/ (TP + FP) = 671/709 ≈ 0.95 Recall (or TPR): TP / (TP + FN) ≈ 0.96 false negatives (FN) false positives (FP) 15 Module 4: Analytics Theory/Methods true positives (TP) true negatives (TN) A confusion matrix is a specific table layout that allows visualization of the performance of a model. In the hypothetical example of confusion matrix shown: Of 1000 credit score samples, the system predicted that there were good and bad credit, and of the 700 good credits, the model predicted 29 as bad and similarly 38 of the actual bad credits were predicted as good. All correct guesses are located in the diagonal of the table, so it’s easy to visually inspect the table for errors, as they will be represented by any non -zero values outside the diagonal. We define overall success rate (or accuracy) as a metric defining –what we got right -which is the ratio between the sum of the diagonal values (i.e., TP and TN) vs. the sum of the table. In other words, the confusion table of a good model has large numbers diagonally and small (ideally zero) numbers off – diagonally. We saw a true positive rate (TPR) and a false positive rate (FPR) when we discussed ROC curves: •TPR –what percent of positive instances did we correctly identify. •FPR –what percent of negatives we marked positive. Additionally we can measure the false negative rate (FNR): •FNR –what percent of positives we marked negative The computation of TPR, FPR and FNR are shown in the slide. Precision and Recall are accuracy metrics used by the information retrieval community; they are often used to characterize classifiers as well. We will detail these metrics in lesson 8 of this module. Note: •precision –what percent of things we marked positive really are positive •recall –what percent of positive instances did we correctly identify. Recall is equivalent to TPR. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Handles missing values quite well Numeric variables have to be discrete (categorized) Intervals Robust to irrelevant variables Sensitive to correlated variables “Double -counting” Easy to implement Not good for estimating probabilities Stick to class label or yes/no Easy to score data Resistant to over -fitting Computationally efficient Handles very high dimensional problems Handles categorical variables with a lot of levels Naïve Bayesian Classifier -Reasons to Choose (+) and Cautions ( -) Module 4: Analytics Theory/Methods 16 The Reasons to Choose (+) and Cautions ( -) of the Naïve Bayesian Classifier are listed. Unlike Logistic regression, missing values are handled well by the Naïve Bayesian Classifier. It is also very robust to irrelevant variables (irrelevant variables are distributed among all the classes and their effects are not pronounced). The model is easy to implement and we will see how easily a basic version can be implemented in the lab without using any packages. Scoring data (predicting) is very simple and the model is resistant to over fitting. (Over fitting refers to fitting training data so well that we fit the idiosyncrasies such as the data that are not relevant in characterizing the data). It is computationally efficient and handles high dimensional problems efficiently. Unlike logistic regression Naïve Bayesian Classifier handles categorical variables with a lot of levels. The Cautions ( -) are that it is sensitive to correlated variables as the algorithm double counts the effect of the correlated variables. For example people with low income tend to default and people with low credit tend to default. It is also true that people with low income tend to have low credit. If we try to score “default” with both low income and low credit as variables we will see the double counting effect in our model output and in the scoring. Though the probabilities are provided as an output of the scored data, Naïve Bayesian Classifier is not very reliable for the probability estimation and should be used for class label assignments only. Naïve Bayesian Classifier in its simple form is used only with categorical variables and any continuous variables should be rendered discrete into intervals. You will learn more about this in the lab. However it is not necessary to have the continuous variables as “discrete” and several standard implementations can handle continuous variables as well. Module 4: Analytics Theory/Methods 16 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge 1. Consider the following Training Data Set: • Apply the Naïve Bayesian Classifier to this data set and compute the probability score for P(y = 1|X) for X = (1,0,0) Show your work 2. List some prominent use cases of the Naïve Bayesian Classifier. 3. What gives the Naïve Bayesian Classifier the advantage of being computationally inexpensive? 4. Why should we use log -likelihoods rather than pure probability values in the Naïve Bayesian Classifier? Training Data Set Your Thoughts? 17 Module 4: Analytics Theory/Methods Record your answers here. More Check Your Knowledge questions are on the next page. 17 Module 4: Analytics Theory/MethodsX1 X2 X3 Y 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge (Continued) 5. What is a confusion matrix and how it is used to evaluate the effectiveness of the model? 6. Consider the following data set with two input features temperature and season • What is the Naïve Bayesian assumption? • Is the Naïve Bayesian assumption satisfied for this problem? Your Thoughts? 18 Module 4: Analytics Theory/Methods Record your answers here. 18 Module 4: Analytics Theory/MethodsTemperature Season Electricty Usage -10 to 50 F Winter High 50 to 70 F Winter Low 70 to 85 F Summer Low 85 to 110 F Summer High Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: •Naïve Bayesian Classifier •Theoretical foundations of the classifier •Use cases •Evaluating the effectiveness of the classifier •The Reasons to Choose (+) and Cautions ( -) with the use of the classifier Naïve Bayesian Classifiers -Summary Module 4: Analytics Theory/Methods 19 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 19
Need help with the following assignment have attached the required data and materials of the professor’s lecture. It needs to be performed in RStudio. All of the detailed instructions and necessary do
Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Time Series Analysis and its applications in forecasting •ARMA and ARIMA Models •Implementing the Box -Jenkins Methodology using R •Reasons to Choose (+) and Cautions ( -) with Time Series Analysis Time Series Analysis Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. ARIMA and Box -Jenkins methodology are explained in the following slides. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Time Series Analysis • Time Series: Ordered sequence of equally spaced values over time • Time Series Analysis: Accounts for the internal structure of observations taken over time  Trend  Seasonality  Cycles  Random • Goals  To identify the internal structure of the time series  To forecast future events  Example: Based on sales history, what will next December sales be? • Method: Box -Jenkins (ARMA) 3 Module 4: Analytics Theory/Methods Businesses perform sales forecasting to look ahead in order to plan their investments, launch new products, decide when to close or withdraw products, etc. The sales forecasting process is a critical one for most businesses. Part of the sales forecasting process is to examine the past. How well did we do in the last few months or what were our sales in the same time period for the last few years? Time Series Analysis provides a scientific methodology for sales forecasting. Time Series Analysis is the analysis of sequential data across equally spaced units of time. Time Series is a basic research methodology in which data for one or more variables are collected for many observations at different time periods. The main objectives in Time Series Analysis are: •To understand the underlying structure of the time series by breaking it down to its components. •Fit a mathematical model and then proceed to forecast the future The time periods are usually regularly spaced and the observations may be either univariate or multivariate. Univariate time series are those where only one variable is measured over time, whereas multivariate time series are those, where multiple variables are measured simultaneously. The internal structure of the data may specify a trend, seasonality or cycles: 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. •Models historical behavior to forecast the future •Applies ARMA (Autoregressive Moving Averages)  Input : Time Series  Accounting for Trends and Seasonality components  Output : Expected future value of the time series Box -Jenkins Method: What is it? 5 Module 4: Analytics Theory/Methods Box -Jenkins methodology developed by Professors G.E.P. Box and G.M. Jenkins, enables the forecasting with time series data with both high accuracy and low computational requirements. The technique may be applied to quickly determine forecasts that are as uncomplicated in form as the simple smoothing methods, or that involve a number of economic variables. In either case, use of this technique enables efficient utilization of other predictive information contained in the data. It offers assurance of obtaining the highest forecasting accuracy possible in terms of the variables on which the forecast is based. The input for the model is the trend and seasonality adjusted time series and the output is the expected future value of the time series. Box Jenkins Methodology applies autoregressive moving average ARMA models to find the best fit of a time series to past values of this time series, in order to make forecasts. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Use Cases Forecast: • Next month’s sales • Tomorrow’s stock price • Hourly power demand 6 Module 4: Analytics Theory/Methods The key application of Time Series Analysis is in forecasting. Economic and business planning, inventory and production control of industrial processes are some of the key applications in which time series analysis is deployed. Time Series data provide useful information about the physical, biological, social or economic systems generating the time series, such as: Economics/ Finance: share prices, profits, imports, exports, stock exchange indices Sociology: school enrollments, unemployment, crime rate Environment: Amount of pollutants, such as suspended particulate matter (SPM), in the environment Meteorology: Rainfall, temperature, wind speed Epidemiology: Number of SARS cases over time Medicine: Blood pressure measurements over time for evaluating drugs to control hypertension 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Modeling a Time Series • Let’s model the time series as Yt=T t+S t+R t, t=1,…,n. • Tt: Trend term  Air travel steadily increased over the last few years • St: The seasonal term  Air travel fluctuates in a regular pattern over the course of a year • Rt: Random component  To be modeled with ARMA 7 Module 4: Analytics Theory/Methods We present a simple model for the time series with the trend, seasonality and a random fluctuation. There is sometimes a low frequency cyclic term as well, but we are ignoring that for simplicity. Examples of trend and seasonality are also detailed in the slide 7 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Stationary Sequences •Box -Jenkins methodology assumes the random component is a stationary sequence  Constant mean  Constant variance  Autocorrelation does not change over time  Constant correlation of a variable with itself at different times • In practice, to obtain a stationary sequence, the data must be:  De -trended  Seasonally adjusted 8 Module 4: Analytics Theory/Methods A stationary sequence is a random sequence in which the joint probability distribution does not vary over time. In other words the mean, variance and auto correlations do not change in the sequence over time. In order to render a sequence stationary we need to remove the effects of trend and seasonality. The ARIMA model (implemented with Box Jenkins) uses the method of differencing to render the data stationary. 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. De -trending • In this example, we see a linear trend, so we fit a linear model  Tt= m·t + b • The de -trended series is then  Y1t= Y t–Tt • In some cases, may have to fit a non -linear model  Quadratic  Exponential 9 Module 4: Analytics Theory/Methods Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation. De -trending is a pre -processing step to prepare time series for analysis by methods that assume stationarity. A simple linear trend can be removed by subtracting a least -squares -fit straight line. In the example shown we fit a linear model and obtain the difference. The graph shown next is a de – trended time series. More complicated trends might require different procedures such a fitting a non -linear model such as a quadratic or a exponential model. Use a Linear Trend Model if the first differences are more or less constant [ (y2-y1) = (y 3-y2) = ……. = (y n-yn-1) ] Use a Quadratic Trend Model if the second differences are more or less constant. [ (y3-y2) – (y2-y1) = ………= (y n-yn-1)-(yn-1-yn-2) ] Use an Exponential Trend Model if the percentage differences are more or constant. [ ( (y2-y1) /y 1) * 100% = …….((y n-yn-1)/y n-1 ) * 100% 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Seasonal Adjustment • Plotting the de -trended series identifies seasons  For CO2 concentration, we can model the period as being a year, with variation at the month level • Simple ad -hoc adjustment: take several years of data, calculate the average value for each month, and subtract that from Y 1t Y2t= Y 1t– St 10 Module 4: Analytics Theory/Methods Unlike the trend and cyclical components, seasonal components, theoretically, happen with similar magnitude during the same time period each year. The holiday sales spike is an example of seasonality. By removing the seasonal component, it is easier to focus on other components. The seasonal component of a series typically makes the interpretation of a series more difficult. A simple adjustment for seasonality is done with taking several years of data, calculating average value for each month and subtracting them from the actual value. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. ARMA(p, q) Model • The simplest Box -Jenkins Model  Ytis de -trended and seasonally adjusted • Combination of two process models  Autoregressive : Y tis a linear combination of its last pvalues  Moving average : Y tis a constant value plus the effects of a dampened white noise process over the last q time values (lags) 11 Module 4: Analytics Theory/Methods Autoregressive (AR) models can be coupled with moving average (MA) models to form a general and useful class of time series models called Autoregressive Moving Average (ARMA) models. This is the simplest Box -Jenkins model. AR model predicts Ytas a linear combination of its last p values. An autoregressive model is simply a linear regression of the current value of the series on one or more prior values of the same series. Several options are available for analyzing autoregressive models, including standard linear least squares techniques. They also have a straightforward interpretation. The time series Ytis called an autoregressive process of order p and is denoted as AR(p) process. A moving average (MA) model adds to Ytthe effects of a dampened white noise process over the last q steps. The simple moving average is one of the most basic of the forecasting methods. Moving backwards in time, minus 1, minus, 2, minus 3 and so forth until we have n data points, divide the sum of those points by the number of data points, n, and that gives you the forecast for the next period. So it’s called a single moving average or simple moving average. The forecast is simply a constant value that projects the next time period. “n” is also the order of the moving averages. moving average: like a random walk, or brownian motion 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. ARIMA(p, d, q) Model • ARIMA adds a differencing term, d, to the ARMA model  Autoregressive Integrated Moving Average  Includes the de -trending as part of the model  linear trend can be removed by d=1  quadratic trend by d=2  and so on for higher order trends • The general non -seasonal model is known as ARIMA ( p, d, q ):  pis the number of autoregressive terms  dis the number of differences  qis the number of moving average terms 12 Module 4: Analytics Theory/Methods ARMA models can be used when the series is weakly stationary ; in other words, the series has a constant variance around a constant mean.. This class of models can be extended to non – stationary series by allowing the differencing of the data series. These are called Autoregressive Integrated Moving Average(ARIMA) models. There are a large variety of ARIMA models. ARIMA –difference the Ytd times to “induce stationarity”. d is usually 1 or 2. “I” stands for integrated –the outputs of the model are summed up (or “integrated”) to recover Yt The general ARIMA (p, d, q) model gives a tremendous variety of patterns in the ACF and PACF, so it is not practical to state rules for identifying general ARIMA models. In practice, it is seldom necessary to deal with values of p, d, or q other than 0, 1, or 2. It is remarkable that such a small range of values for p, d, or q can cover such a large range of practical forecasting situations. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. ACF & PACF • Auto Correlation Function (ACF)  Correlation of the values of the time series with itself  Autocorrelation “carries over”  Helps to determine the order, q, of a MA model  Where does ACF go to zero? • Partial Auto Correlation Function (PACF)  An autocorrelation calculated after removing the linear dependence of the previous terms  Helps to determine the order, p, of an AR model  Where does PACF go to zero? 13 Module 4: Analytics Theory/Methods A common assumption in many time series techniques is that the time series is stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. An ACF plot provides an indication of the stationarity of the data. If the time series is not stationary, we can often transform it to stationarity with the simple technique of differencing. It should be noted that the autocorrelation carries over; if Ytis correlated with Y t-1, it is also correlated with Y t-2(though to a lesser degree). PACF -The partial autocorrelation at lag kis the autocorrelation between Ytand Yt-kthat is not accounted for by lags 1 through k-1. One looks for the point on the plot where the partial autocorrelations for all higher lags are essentially zero. We will look into ACF and PACF graphs in the next Lab. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Model Selection • Based on the data, the Data Scientist selects p, d and q  An “art form” that requires domain knowledge, modeling experience, and a few iterations  Use a simple model when possible  AR model (q = 0)  MA model (p = 0) • Multiple models need to be built and compared  Using ACF and PACF 14 Module 4: Analytics Theory/Methods Identification of the most appropriate model is the most important part of the process, where it becomes as much ‘art’ as ‘science’. The first step is to determine if the time series is stationary. This can be done with a correlogram , plots of the ACF and PACF. If the time series is not stationary, it needs to be first – differenced. (it may need to be differenced again to induce stationarity) The next stage is to determine the pand qin the ARIMA ( p, d, q) model (the drefers to how many times the data needs to be differenced to produce a stationary series). In the diagnostic stage we assess the model’s adequacy by checking whether the model assumptions are satisfied. If the model is inadequate, this stage will provide some information for us to re -identify the model. We also perform: checking normality, constant variance, and independence assumption among residuals. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Reasons to Choose (+) Cautions ( -) Minimal data collection Only have to collect the series itself Do not need to input drivers No meaningful drivers: prediction based only on past performance No explanatory value Can’t do “what -if” scenarios Can’t stress test Designed to handle the inherent autocorrelation of lagged time series It’s an “art form” to select appropriate parameters Accounts for trends and seasonality Only suitable for short term predictions Time Series Analysis -Reasons to Choose (+) & Cautions ( -) Module 4: Analytics Theory/Methods 15 The Reasons to Choose (+) and Cautions ( -) of Time Series Analysis are listed. Time Series Analysis is not a common “tool” in a Data Scientist’s tool kit. Though the models require minimal data collection and handle the inherent auto correlations of lagged time series, it does not produce meaningful drivers for the prediction. The selection of (p,d,q) appropriately is not very straight forward. A complete understanding of the domain knowledge and very detailed analysis of trend and seasonality may be required. Further this method is suitable for short term predictions only. Module 4: Analytics Theory/Methods 15 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Time Series Analysis with R •The function “ ts” is used to create time series objects  mydata <-ts(mydata,start =c(1999,1),frequency=12) • Visualize data  plot( mydata ) • De -trend using differencing  diff( mydata ) •Examine ACF and PACF  acf (mydata ): It computes and plots estimates of the autocorrelations  pacf (mydata ): It computes and plots estimates of the partial autocorrelations 16 Module 4 : Analytics Theory/Methods Important R functions and commands we will be using are listed here. 16 Module 4 : Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Other Useful R Functions in Time Series Analysis •ar (): Fit an autoregressive time series model to the data •arima (): Fit an ARIMA model •predict() : Makes predictions “predict” is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument •arima.sim() : Simulate a time series from an ARIMA model •decompose() : Decompose a time series into seasonal, trend and irregular components using moving averages Deals with additive or multiplicative seasonal component •stl (): Decompose a time series into seasonal, trend and irregular components using loess 17 Module 4 : Analytics Theory/Methods Some additional commands in the ts package are listed. We will use these commands in the lab. 17 Module 4 : Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Check Your Knowledge 1. What is a time series and what are the key components of a time series? 2. How do we “de -trend” a time series data? 3. What makes data stationary? 4. How is seasonality removed from the data? 5. What are the modeling parameters in ARIMA? 6. How do you use ACF and PACF to determine the “stationarity” of time series data? Your Thoughts? 18 Module 4: Analytics Theory/Methods Record your answers here. 18 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: • Time Series Analysis and its applications in forecasting •ARMA and ARIMA Models • Implementing the Box -Jenkins Methodology using R • Reasons to Choose (+) and Cautions ( -) with Time Series Analysis Time Series Analysis -Summary Module 4: Analytics Theory/Methods 19 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 19




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.