1 00:00:00,000 --> 00:00:00,480 2 00:00:00,480 --> 00:00:03,189 In this video I want to give you an example of what it 3 00:00:03,189 --> 00:00:06,169 means to fit data to a line. 4 00:00:06,169 --> 00:00:08,730 Instead of doing my traditional video using my 5 00:00:08,730 --> 00:00:11,839 little pen tablet, I'm going to do it straight on Excel so 6 00:00:11,839 --> 00:00:14,250 you could see how to do this for yourself, so if you have 7 00:00:14,250 --> 00:00:18,089 Excel or some other type of a spreadsheet program. 8 00:00:18,089 --> 00:00:19,379 We're not going to go into the math of it. 9 00:00:19,379 --> 00:00:21,539 I really just want you to get the conceptual understanding 10 00:00:21,539 --> 00:00:26,839 of what it means to fit data with line, or do a linear 11 00:00:26,839 --> 00:00:27,879 regression. 12 00:00:27,879 --> 00:00:29,549 So here, let's just read the problem. 13 00:00:29,550 --> 00:00:33,549 The following table shows the median California income-- 14 00:00:33,549 --> 00:00:37,189 remember median is the middle, the middle California income 15 00:00:37,189 --> 00:00:40,849 --from 1995 to 2002 as reported by 16 00:00:40,850 --> 00:00:42,679 the U.S. Census Bureau. 17 00:00:42,679 --> 00:00:45,320 Draw a scatter plot and find the equation. 18 00:00:45,320 --> 00:00:47,539 What would you expect the median annual income of a 19 00:00:47,539 --> 00:00:50,369 California family to be in the year 2010? 20 00:00:50,369 --> 00:00:52,969 What are the meanings of the slope and the y-intercept of 21 00:00:52,969 --> 00:00:53,579 this problem? 22 00:00:53,579 --> 00:00:56,049 So the first thing you'd want to do-- I just copied and 23 00:00:56,049 --> 00:00:59,059 pasted this image --we have to get the data in a form that 24 00:00:59,060 --> 00:01:00,920 the spreadsheet can understand it. 25 00:01:00,920 --> 00:01:03,829 So let's make some tables here. 26 00:01:03,829 --> 00:01:08,799 Let's say years since 1995. 27 00:01:08,799 --> 00:01:10,709 Let's make that one column. 28 00:01:10,709 --> 00:01:13,669 Let me make this a little bit wider. 29 00:01:13,670 --> 00:01:16,670 Then let me put median income. 30 00:01:16,670 --> 00:01:20,100 This is the median income in California for a family. 31 00:01:20,099 --> 00:01:23,559 So we start off with 1 year, or 0 years since 32 00:01:23,560 --> 00:01:28,100 1995, 0, 1, 2, 3, 4. 33 00:01:28,099 --> 00:01:32,709 Actually if you want, it'll figure out the trend if you 34 00:01:32,709 --> 00:01:33,589 just keep going down. 35 00:01:33,590 --> 00:01:36,650 It'll figure out you're just incrementing by 1. 36 00:01:36,650 --> 00:01:38,340 Then the income, I'll just copy in these 37 00:01:38,340 --> 00:01:39,299 numbers right there. 38 00:01:39,299 --> 00:01:54,709 So that's $53,807, $55,217, $55,209, $55,415 $63,100, 39 00:01:54,709 --> 00:02:08,680 $63,206, $63,761, and then we have $65,766. 40 00:02:08,680 --> 00:02:10,520 So I don't need these over here. 41 00:02:10,520 --> 00:02:13,380 So I'm going to get rid of them. 42 00:02:13,379 --> 00:02:14,810 I can clear them. 43 00:02:14,810 --> 00:02:16,180 So let me make sure I have enough entries. 44 00:02:16,180 --> 00:02:24,040 This is 1, 2, 3, 4, 5, 6, 7, 8, and I have 1, 2, 3, 4, 5, 45 00:02:24,039 --> 00:02:25,569 6, 7, 8 entries. 46 00:02:25,569 --> 00:02:29,930 I want to make sure I got my data right. $53,807, $55,217, 47 00:02:29,930 --> 00:02:34,510 $55,209, 415, 100, 206, 761, 766. 48 00:02:34,509 --> 00:02:36,039 OK, there we go. 49 00:02:36,039 --> 00:02:38,879 Now you're going to find that in Excel this is incredibly 50 00:02:38,879 --> 00:02:41,079 easy if you know what to click on. 51 00:02:41,080 --> 00:02:44,430 One, plot this data, create a scatter plot, and then even 52 00:02:44,430 --> 00:02:46,219 better, create a regression of that data. 53 00:02:46,219 --> 00:02:48,969 So all you have to do is you select the data. 54 00:02:48,969 --> 00:02:51,479 Then you go to insert, and I'm going to 55 00:02:51,479 --> 00:02:53,169 insert a scatter plot. 56 00:02:53,169 --> 00:02:54,389 Then you can pick the different 57 00:02:54,389 --> 00:02:56,159 types of scatter plots. 58 00:02:56,159 --> 00:02:58,819 I just want to plot the data. 59 00:02:58,819 --> 00:02:59,669 There you go. 60 00:02:59,669 --> 00:03:02,469 It plotted the data for me. 61 00:03:02,469 --> 00:03:03,050 There you go. 62 00:03:03,050 --> 00:03:07,320 If you go by this is the actual income, and this is by 63 00:03:07,319 --> 00:03:08,870 year since 1995. 64 00:03:08,870 --> 00:03:10,569 So this is 1995. 65 00:03:10,569 --> 00:03:13,709 It was $53,807. 66 00:03:13,710 --> 00:03:17,370 In 1996 it's $55,217. 67 00:03:17,370 --> 00:03:19,250 So it plotted all the data. 68 00:03:19,250 --> 00:03:22,110 Now what I want to do is fit a line. 69 00:03:22,110 --> 00:03:24,000 So this isn't exactly a line. 70 00:03:24,000 --> 00:03:27,580 But let's see, if we assume that a line can model this 71 00:03:27,580 --> 00:03:31,500 data well, I'm going to get Excel to fit a line for me. 72 00:03:31,500 --> 00:03:34,580 So what I can do is I have all of these options up here for 73 00:03:34,580 --> 00:03:37,719 different ways to fit a line, all of 74 00:03:37,719 --> 00:03:38,710 these different options. 75 00:03:38,710 --> 00:03:39,780 I'm going to pick this one here. 76 00:03:39,780 --> 00:03:40,939 You might not be able to see it. 77 00:03:40,939 --> 00:03:42,460 It looks like it has a line between dots. 78 00:03:42,460 --> 00:03:45,450 It also has fx which tells me going to tell me the equation 79 00:03:45,449 --> 00:03:46,469 of the line. 80 00:03:46,469 --> 00:03:49,520 So if I click on that, there you go. 81 00:03:49,520 --> 00:03:52,570 It not only fit, it replotted that same data 82 00:03:52,569 --> 00:03:54,530 on a different graph. 83 00:03:54,530 --> 00:03:55,830 Let me make it a little bit bigger. 84 00:03:55,830 --> 00:03:58,430 85 00:03:58,430 --> 00:03:59,930 No, I don't want to that. 86 00:03:59,930 --> 00:04:01,060 Let me make it a little bit bigger. 87 00:04:01,060 --> 00:04:04,430 We can cover up the data now, just because I think we know 88 00:04:04,430 --> 00:04:06,409 what's going on. 89 00:04:06,409 --> 00:04:09,460 So let me cover it up right like that. 90 00:04:09,460 --> 00:04:12,620 So not only did it plot the various data points, it 91 00:04:12,620 --> 00:04:16,028 actually fit a line to that data and it gave me the 92 00:04:16,028 --> 00:04:17,355 equation of that line. 93 00:04:17,355 --> 00:04:21,278 94 00:04:21,278 --> 00:04:22,975 Let me see if I can make this a little bit bigger. 95 00:04:22,975 --> 00:04:27,330 96 00:04:27,329 --> 00:04:30,909 I'll move it out of the way so you can read it at least. 97 00:04:30,910 --> 00:04:33,400 So it tells me right here, that the equation for this 98 00:04:33,399 --> 00:04:40,399 line is y is equal to 1,882.3x plus 52,847. 99 00:04:40,399 --> 00:04:43,250 So if you remember what we know about slope and 100 00:04:43,250 --> 00:04:50,660 y-intercept, the y-intercept is 52,847, which is, if you 101 00:04:50,660 --> 00:04:55,000 use this line as your measure, where this line intersects at 102 00:04:55,000 --> 00:04:57,470 year 0, or in 1995. 103 00:04:57,470 --> 00:05:01,830 So if you use this line as a model, in 1995 the line would 104 00:05:01,829 --> 00:05:05,579 say that you're going to make $52,847. 105 00:05:05,579 --> 00:05:07,279 The actual data was a little bit off of that. 106 00:05:07,279 --> 00:05:09,969 It was a little bit higher, $53,807. 107 00:05:09,970 --> 00:05:11,250 So it was a little bit higher. 108 00:05:11,250 --> 00:05:13,850 But we're trying to get a line that gets as close as possible 109 00:05:13,850 --> 00:05:14,980 to all of this data. 110 00:05:14,980 --> 00:05:17,530 It's actually trying to minimize the distance, the 111 00:05:17,529 --> 00:05:20,469 square of the distance, between each of these points 112 00:05:20,470 --> 00:05:20,970 in the line. 113 00:05:20,970 --> 00:05:22,440 We won't go into the math there. 114 00:05:22,439 --> 00:05:23,829 But it gave us this nice equation. 115 00:05:23,829 --> 00:05:26,810 Now we can use this nice equation to predict things. 116 00:05:26,810 --> 00:05:30,290 If we say that this is a good a model for the data-- let me 117 00:05:30,290 --> 00:05:31,400 bring this down a little bit --let's try 118 00:05:31,399 --> 00:05:32,399 to answer our question. 119 00:05:32,399 --> 00:05:35,209 So we drew a scatter plot-- really Excel did it for us. 120 00:05:35,209 --> 00:05:38,000 We found the equation right there. 121 00:05:38,000 --> 00:05:40,060 They say, what would you expect the median annual 122 00:05:40,060 --> 00:05:44,490 income of a California family to be in the year 2010? 123 00:05:44,490 --> 00:05:50,009 So here, we can just use the equation they gave us. 124 00:05:50,009 --> 00:05:51,769 This right here, was 2002. 125 00:05:51,769 --> 00:05:52,849 So I could write down the year. 126 00:05:52,850 --> 00:05:57,240 This was the year of 2002. 127 00:05:57,240 --> 00:06:02,490 So the year 2010 is 8 more years. 128 00:06:02,490 --> 00:06:04,350 Let me make a little column here. 129 00:06:04,350 --> 00:06:08,390 130 00:06:08,389 --> 00:06:13,539 So this is the year, 1995, 1996. 131 00:06:13,540 --> 00:06:15,710 Then Excel will be able to figure out if I select those, 132 00:06:15,709 --> 00:06:18,949 and I go to this little bottom right square and I scroll 133 00:06:18,949 --> 00:06:23,920 down, Excel will actually figure out that I want to 134 00:06:23,920 --> 00:06:26,980 increment by 1 year every time. 135 00:06:26,980 --> 00:06:31,670 If I say years since 1995, once again I can just continue 136 00:06:31,670 --> 00:06:33,550 this trend right here. 137 00:06:33,550 --> 00:06:36,600 So 2010 would be 15 years. 138 00:06:36,600 --> 00:06:38,660 So we can just apply this equation. 139 00:06:38,660 --> 00:06:41,760 We could say it's going to be equal to, according to this 140 00:06:41,759 --> 00:06:44,569 line-- I'm just going to type it in, hopefully you can read 141 00:06:44,569 --> 00:06:51,000 what I'm saying --1,882.3 times x. 142 00:06:51,000 --> 00:06:55,500 x here is the year since 1995. 143 00:06:55,500 --> 00:06:58,430 I could just select this cell, or I could type 144 00:06:58,430 --> 00:06:59,319 in the number 15. 145 00:06:59,319 --> 00:07:03,589 That means times this cell, times 15. 146 00:07:03,589 --> 00:07:11,389 Then plus 52,847, plus that right there. 147 00:07:11,389 --> 00:07:17,599 Click enter and it predicts $81,081.50. 148 00:07:17,600 --> 00:07:21,610 So if you just continue this line for another 8 or so 149 00:07:21,610 --> 00:07:24,550 years, it predicts that the median income in California 150 00:07:24,550 --> 00:07:27,730 for a family will be $81,000. 151 00:07:27,730 --> 00:07:30,270 Anyway, hopefully you found that interesting. 152 00:07:30,269 --> 00:07:33,339 Spreadsheets are very useful tools for manipulating data. 153 00:07:33,339 --> 00:07:36,519 It'll give you a sense of why linear models are interesting, 154 00:07:36,519 --> 00:07:39,129 why lines are interesting, and how you can actually use these 155 00:07:39,129 --> 00:07:42,600 tools to interpret data and maybe even extrapolate some 156 00:07:42,600 --> 00:07:43,370 type of a prediction. 157 00:07:43,370 --> 00:07:45,449 This right here, is an extrapolation using this 158 00:07:45,449 --> 00:07:47,259 linear regression. 159 00:07:47,259 --> 00:07:47,332