




下載本文檔
版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
1、chapter 3Linear regression8/27/20221數(shù)據(jù)挖掘與統(tǒng)計計算Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression i
2、s still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算2Recall
3、 the Advertising data1. Is there a relationship between advertising budget and sales?2. How strong is the relationship between advertising budget and sales?3. Which media contribute to sales?4. How accurately can we estimate the effect of each medium on sales?5. How accurately can we predict future
4、sales?6. Is the relationship linear?7. Is there synergy among the advertising media?8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算33.1 Simple Linear RegressionSimple linear regression lives up to its name: it is a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single pred
5、ictor variable X.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算4You might read “” as “is approximately modeled as”.3.1.1 Estimating the CoefficientsWe define the residual sum of squares (RSS) as8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算58/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算63.1.2 Assessing the Accuracy of the Coefficient Estimatespopulation regression line, which is th
6、e best linear approximation to the true relationship between X andY . 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算7The least squares regression coefficient estimates characterize the least squares line8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算8The true relationship is generally not known forreal data, but the least squares line can always be comput
7、ed using the coefficient estimates. In other words, in real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算9UnbiasednessThe property of unbiasedness holds for the least sq
8、uares coefficient estimates : if we estimate 0 and 1 on the basis of a particular data set, then our estimates wont be exactly equal to 0 and 1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on! 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算10V
9、ariancewe can wonder how close 0 and 1 are to the true values 0 and 1. 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算11For linear regression, the 95 % confidence interval for 1approximately takes the form8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算12Similarly, a confidence interval for 0 approximately takes the form3.1.3 Assessing the Accuracy of the M
10、odelResidual Standard Error8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算13R2 Statistic3.2 Multiple Linear Regression8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算14Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly modate multiple pred
11、ictors. We can do this by giving each predictor a separate slope coefficient in a single model.3.2.1 Estimating the Regression Coefficients8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算158/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算163.2.2 Some Important Questions1. Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?2
12、. Do all the predictors help to explain Y , or is only a subset of the predictors useful?3. How well does the model fit the data?4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算17One: Is There a Relationship Between the
13、Response and Predictors?8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算18Two: Deciding on Important VariablesForward selection. Backward selection.Mixed selection. 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算19We can then select the best model out of all of the models that we have considered. How do we determine which model is best? Various statistics c
14、an be used to judge the quality of a model. These include Mallows Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2. Three: Model Fit8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算20Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the in
15、crease in p.Four: Predictions8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算21There are three sorts of uncertainty associated with this predictionThe inaccuracy in the coefficient estimates is related to the reducible error There is an additional source of potentially reducible error which we call model bias.We referred to this
16、 as the irreducible error. How much will Y vary from Y ? We use prediction intervals to answer this question.3.3 Other Considerations in the Regression Model3.3.1 Qualitative Predictors Predictors with Only Two Levels 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算22Qualitative Predictors with More than Two Levels8/27/2022數(shù)據(jù)挖掘與
17、統(tǒng)計計算233.3.2 Extensions of the Linear ModelRemoving the Additive Assumption8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算24Non-linear Relationships8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算253.3.3 Potential Problems1. Non-linearity of the response-predictor relationships.2. Correlation of error terms.3. Non-constant variance of error terms.4. Outliers
18、.5. High-leverage points.6. Collinearity.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算261. Non-linearity of the DataIf the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as log X, X, and X2, in the regression model
19、.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算272. Correlation of Error Termsif the errors are uncorrelated, then the fact that e(i) is positive provides little or no information about the sign of e(i+1). The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assu
20、mption of uncorrelated error terms. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算283. Non-constant Variance of Error TermsOne can identify non-constant variances in the errors, or hete
21、roscedasticity, from the presence of a funnel shape in the residual plot. 8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算294. OutliersAn outlier is a point for which yi is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.
22、8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算305. High Leverage PointsWe just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage high leverage have an unusual value for xi.8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算316. Collinearity8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算32Code: Simp
23、le Linear Regressionlibrary (MASS )library (ISLR )fix ( Boston )names ( Boston )lm.fit =lm(medvlstat , data= Boston )lm.fitconfint (lm.fit )predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =confidence)predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =prediction)8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算33plot(lstat ,medv )abline (lm.fit )abline (lm.fit ,lwd =3)abline (lm.fit ,lwd =3, col = red )plot(lstat ,medv ,col = red )plot(lstat ,medv ,pch =20)plot(lstat ,medv ,pch =+)plot (1:20 ,1:20 , pch =1:20)8/27/2022數(shù)據(jù)挖掘與統(tǒng)計計算34par ( mfrow =c(2 ,2) )plot(lm.fit )plot( predict (
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 牙醫(yī)藥品知識培訓課件
- 教育投資績效評估表格(年份對比)
- 心理咨詢技能實務試題
- 印刷材料采購與使用協(xié)議
- 山東省菏澤市2024-2025學年高二上學期1月期末生物學試題(含答案)
- 健康醫(yī)療智能硬件開發(fā)合作契約書
- 秘密花園的閱讀引導:英文名著導讀教案
- 智慧城市智慧交通系統(tǒng)智能調(diào)度預案
- 產(chǎn)品定制開發(fā)合同書及產(chǎn)品質量保障承諾書
- 大數(shù)據(jù)分析平臺開發(fā)合作協(xié)議
- 綜合門診部全科醫(yī)療科設置基本標準
- GB 15603-1995常用化學危險品貯存通則
- 人教版PEP初中英語中考總復習:復習重點課件
- 數(shù)字化消防管理解決方案
- 二類汽修廠汽車維修管理新規(guī)制度匯編
- 交接班流程綱要綱要圖
- 浙江省衢州市各縣區(qū)鄉(xiāng)鎮(zhèn)行政村村莊村名居民村民委員會明細
- 初中英語《Unit5-Do-you-remember-what-you-were-doing》教學課件設計
- 品德家庭小賬本
- 癥狀性大腦中動脈慢性閉塞血管內(nèi)開通治療課件
- 大象版科學四年級下冊第一單元測試卷(含答案)
評論
0/150
提交評論