1 00:00:00,320 --> 00:00:02,500 In this lesson, we are going to talk about 2 00:00:02,500 --> 00:00:05,420 how to troubleshoot a failed Spark job. 3 00:00:05,420 --> 00:00:07,344 And we're going to start with some basic 4 00:00:07,344 --> 00:00:08,677 troubleshooting logic. 5 00:00:08,677 --> 00:00:10,870 We're going to learn about thinking it through. 6 00:00:10,870 --> 00:00:13,220 Then we're going to jump back into Ambari UI, 7 00:00:13,220 --> 00:00:14,730 look at configuration settings, 8 00:00:14,730 --> 00:00:17,160 look at cluster health, stack and version. 9 00:00:17,160 --> 00:00:20,140 Follow that up by talking about log files, 10 00:00:20,140 --> 00:00:22,250 configure settings, and then finally, 11 00:00:22,250 --> 00:00:24,320 how to reproduce an error. 12 00:00:24,320 --> 00:00:28,170 So let's get started by gathering data about the issue. 13 00:00:28,170 --> 00:00:30,020 At this point, you need to employ 14 00:00:30,020 --> 00:00:32,130 basic troubleshooting skills. 15 00:00:32,130 --> 00:00:33,570 Now, I know what you're thinking, 16 00:00:33,570 --> 00:00:36,750 "HDInsight is not a basic service". 17 00:00:36,750 --> 00:00:38,770 However, we still need to think 18 00:00:38,770 --> 00:00:41,320 about those basic troubleshooting skills. 19 00:00:41,320 --> 00:00:43,830 You know, taking a look at: is it plugged in? 20 00:00:43,830 --> 00:00:44,950 Is it turned on? 21 00:00:44,950 --> 00:00:46,630 Now I know this is the cloud and yes, 22 00:00:46,630 --> 00:00:48,010 it's plugged in and turned on. 23 00:00:48,010 --> 00:00:52,110 However, think about those basic troubleshooting pieces 24 00:00:52,110 --> 00:00:54,390 and see if there's something super simple 25 00:00:54,390 --> 00:00:58,200 before you jump into much more complicated fixes, 26 00:00:58,200 --> 00:01:00,810 because oftentimes, it's something very simple. 27 00:01:00,810 --> 00:01:02,690 "Oh, I forgot the comma. 28 00:01:02,690 --> 00:01:05,370 "That's 8 hours I'll never get back out of my life." 29 00:01:05,370 --> 00:01:08,060 We've all been there, if you spend any time in coding. 30 00:01:08,060 --> 00:01:11,000 Just think it through and start with the basics. 31 00:01:11,000 --> 00:01:13,690 And then we want to ask ourselves some basic questions. 32 00:01:13,690 --> 00:01:15,340 When did the problem start? 33 00:01:15,340 --> 00:01:17,370 Is this a repetitive issue? 34 00:01:17,370 --> 00:01:21,280 Not only is it repetitive, can I force the issue to repeat? 35 00:01:21,280 --> 00:01:23,920 And what was I expecting to happen? 36 00:01:23,920 --> 00:01:25,900 By understanding some of these basic questions, 37 00:01:25,900 --> 00:01:27,220 we'll have a much better idea 38 00:01:27,220 --> 00:01:30,860 as we jump into some more complicated fixes. 39 00:01:30,860 --> 00:01:32,800 So with this, let's actually stop now 40 00:01:32,800 --> 00:01:35,930 and jump into Ambari and talk about configuration settings, 41 00:01:35,930 --> 00:01:38,890 cluster health, and stack and version. 42 00:01:38,890 --> 00:01:41,920 So, here we find ourselves in Ambari, 43 00:01:41,920 --> 00:01:45,310 and we always want to start by looking at our dashboard 44 00:01:45,310 --> 00:01:48,310 and seeing if everything looks like it should. 45 00:01:48,310 --> 00:01:49,160 Is it green? 46 00:01:49,160 --> 00:01:50,950 Do we have red, angry faces? 47 00:01:50,950 --> 00:01:54,780 Is there something wrong with disk usage, or CPU, 48 00:01:54,780 --> 00:01:56,710 or nodes, or something else? 49 00:01:56,710 --> 00:01:59,320 If we've gone through these basic settings, 50 00:01:59,320 --> 00:02:00,660 the next thing that we want to do 51 00:02:00,660 --> 00:02:04,160 is jump into the service that we're trying to configure. 52 00:02:04,160 --> 00:02:06,510 And we want to make sure that the components are set up 53 00:02:06,510 --> 00:02:08,200 like they're supposed to be. 54 00:02:08,200 --> 00:02:11,240 And we want to take a look at our config files as well. 55 00:02:11,240 --> 00:02:14,340 Now it may be that we don't need to do any configuration. 56 00:02:14,340 --> 00:02:17,140 However, for your individual service, 57 00:02:17,140 --> 00:02:18,600 you might want to jump in and look. 58 00:02:18,600 --> 00:02:20,420 And I'm not really going to spend any time here, 59 00:02:20,420 --> 00:02:21,900 because as you can see, 60 00:02:21,900 --> 00:02:24,660 there are hundreds of different metrics 61 00:02:24,660 --> 00:02:26,730 that we can edit and play with. 62 00:02:26,730 --> 00:02:30,200 So that's not really going to be helpful for the DP-203, 63 00:02:30,200 --> 00:02:31,280 other than understanding 64 00:02:31,280 --> 00:02:33,500 that you do need to look at your configs, 65 00:02:33,500 --> 00:02:35,790 if appropriate for your solution 66 00:02:35,790 --> 00:02:38,500 and your job that you're trying to run. 67 00:02:38,500 --> 00:02:39,908 The next thing that we want to look at 68 00:02:39,908 --> 00:02:42,400 is under Cluster Admin. 69 00:02:42,400 --> 00:02:45,480 We want to take a look at our stack and versions. 70 00:02:45,480 --> 00:02:48,020 So this is going to give us our service stack, 71 00:02:48,020 --> 00:02:50,740 and you can see all of the services that we have, 72 00:02:50,740 --> 00:02:52,510 and you can see if they're installed. 73 00:02:52,510 --> 00:02:54,850 And then you can also see, under Version, 74 00:02:54,850 --> 00:02:57,550 a really nice chart that shows you all of the versions 75 00:02:57,550 --> 00:02:59,360 and how current they are. 76 00:02:59,360 --> 00:03:03,000 So this will help you as you're looking at 77 00:03:03,000 --> 00:03:05,940 a starting place for if there's any obvious errors. 78 00:03:05,940 --> 00:03:08,120 That's kind of what we're looking at. 79 00:03:08,120 --> 00:03:10,830 If you've done that, then we want to jump in 80 00:03:10,830 --> 00:03:12,570 and we want talk about log files. 81 00:03:12,570 --> 00:03:15,030 So we want to check our 2 different types of log files, 82 00:03:15,030 --> 00:03:17,700 and we want to check our Hadoop step logs. 83 00:03:17,700 --> 00:03:19,930 So make sure that you've looked at those. 84 00:03:19,930 --> 00:03:20,890 If you've done that 85 00:03:20,890 --> 00:03:23,780 and we still haven't encountered anything that's obvious, 86 00:03:23,780 --> 00:03:27,330 now we need to start going into configuration settings. 87 00:03:27,330 --> 00:03:30,000 If you have not optimized your configuration settings, 88 00:03:30,000 --> 00:03:32,740 some places to start are your cluster settings, 89 00:03:32,740 --> 00:03:36,660 hardware configuration, and nodes. 90 00:03:36,660 --> 00:03:40,060 Next, we want to see if we can reproduce the error. 91 00:03:40,060 --> 00:03:42,960 If everything fails, Microsoft recommends 92 00:03:42,960 --> 00:03:45,130 that we try again on a brand new cluster 93 00:03:45,130 --> 00:03:47,480 to see if we're getting the same type of error. 94 00:03:49,170 --> 00:03:50,690 All right, so 95 00:03:50,690 --> 00:03:52,860 we ran through those points rather quick. 96 00:03:52,860 --> 00:03:55,130 A few key points to remember. One, 97 00:03:55,130 --> 00:03:58,020 remember that this is a process that you need to complete, 98 00:03:58,020 --> 00:03:59,830 even if it's boring. 99 00:03:59,830 --> 00:04:03,160 Next, you want to make small changes. 100 00:04:03,160 --> 00:04:05,240 Step by step, you want to make a change 101 00:04:05,240 --> 00:04:07,870 and see what's actually impacting in your environment. 102 00:04:07,870 --> 00:04:09,370 If you make too many changes, 103 00:04:09,370 --> 00:04:12,060 you might skip some steps and fall down the stairs. 104 00:04:12,060 --> 00:04:13,540 So don't do that. 105 00:04:13,540 --> 00:04:15,440 Small changes, and then look and see 106 00:04:15,440 --> 00:04:17,100 what's actually impacting or changing 107 00:04:17,100 --> 00:04:18,383 within your environment. 108 00:04:20,020 --> 00:04:24,270 Also, exploring should be done in dev, not prod. 109 00:04:24,270 --> 00:04:26,050 Should be a fairly simple concept, 110 00:04:26,050 --> 00:04:28,010 but make sure that you keep out of prod 111 00:04:28,010 --> 00:04:31,040 when you're exploring with your services. 112 00:04:31,040 --> 00:04:35,500 Finally, for the DP-203, what is it that I need to know? 113 00:04:35,500 --> 00:04:38,560 Well, I would recommend that you know the basic steps. 114 00:04:38,560 --> 00:04:40,840 So you need to look at Ambari UI, 115 00:04:40,840 --> 00:04:42,890 you need examine your log files, 116 00:04:42,890 --> 00:04:44,760 examine your configuration settings, 117 00:04:44,760 --> 00:04:46,590 try and reproduce the error. 118 00:04:46,590 --> 00:04:48,950 If you remember those basic concept pieces, 119 00:04:48,950 --> 00:04:52,140 that should be what you need to know for the DP-203. 120 00:04:52,140 --> 00:04:54,540 I don't think it's going to go much deeper than that, 121 00:04:54,540 --> 00:04:56,730 but just understand those basic points. 122 00:04:56,730 --> 00:04:58,970 So with that, we're done with this lesson. 123 00:04:58,970 --> 00:05:00,123 I'll see you in the next.