1 00:00:00,530 --> 00:00:01,880 Congratulations. 2 00:00:01,880 --> 00:00:04,290 You made it through section 5. 3 00:00:04,290 --> 00:00:05,180 In this lesson, 4 00:00:05,180 --> 00:00:07,290 we're going to be looking through the rear-view mirror 5 00:00:07,290 --> 00:00:08,577 at batch processing 6 00:00:08,577 --> 00:00:11,220 and just recovering a few of the concepts 7 00:00:11,220 --> 00:00:14,380 that you need to know for the DP-203. 8 00:00:14,380 --> 00:00:16,880 So specifically, hey, it's all a review. 9 00:00:16,880 --> 00:00:18,770 We're just going to be looking at things 10 00:00:18,770 --> 00:00:21,050 that we have already talked about in section 5 11 00:00:21,050 --> 00:00:23,150 and highlighting things that you need to know 12 00:00:23,150 --> 00:00:24,785 for the DP-203. 13 00:00:24,785 --> 00:00:27,820 If you come across something that is confusing 14 00:00:27,820 --> 00:00:29,086 or you don't quite remember, 15 00:00:29,086 --> 00:00:32,633 you need to go back through and rewatch those lessons. 16 00:00:32,633 --> 00:00:36,268 Second, it is a focus on the DP-203. 17 00:00:36,268 --> 00:00:39,600 Everything in this course is designed to get you prepped 18 00:00:39,600 --> 00:00:41,720 and ready to pass that exam. 19 00:00:41,720 --> 00:00:45,047 So there are going to be topics that are glossed over 20 00:00:45,047 --> 00:00:46,912 or other things that are missed 21 00:00:46,912 --> 00:00:49,080 because they're not a main focus for the 203, 22 00:00:49,080 --> 00:00:52,850 but things you might want to know for data engineering. 23 00:00:52,850 --> 00:00:55,670 Finally, if you don't know something, review. 24 00:00:55,670 --> 00:00:58,030 Don't forget that. It's extremely important. 25 00:00:58,030 --> 00:01:00,993 With that, let's jump in and start the review. 26 00:01:02,410 --> 00:01:05,690 First up, we talked about batch processing. 27 00:01:05,690 --> 00:01:08,550 We talked about it being used in banking and retail 28 00:01:08,550 --> 00:01:10,410 and hospitals and marketing. 29 00:01:10,410 --> 00:01:11,790 We went through examples 30 00:01:11,790 --> 00:01:15,550 of why batch processing is used there, 31 00:01:15,550 --> 00:01:17,590 and typically, just like with a bank, 32 00:01:17,590 --> 00:01:20,278 it comes down to being able to run everything 33 00:01:20,278 --> 00:01:24,500 in one large lump or a batch. 34 00:01:24,500 --> 00:01:27,670 Also, we talked about some of the challenges. 35 00:01:27,670 --> 00:01:29,800 One of them is date formatting. 36 00:01:29,800 --> 00:01:32,360 We talked about encoding being a challenge as well. 37 00:01:32,360 --> 00:01:35,860 And then we talked about what happens with missed runs, 38 00:01:35,860 --> 00:01:38,790 so batch processing being accomplished all at once. 39 00:01:38,790 --> 00:01:40,470 What happens if you miss a run? 40 00:01:40,470 --> 00:01:44,432 How are you going to go back and fix or catch up the data? 41 00:01:44,432 --> 00:01:46,550 Or do you catch up the data? 42 00:01:46,550 --> 00:01:49,453 Things to consider when you look at batch. 43 00:01:50,533 --> 00:01:52,490 We also talked about full data loading 44 00:01:52,490 --> 00:01:54,146 versus incremental data loading. 45 00:01:54,146 --> 00:01:56,250 If you remember, full data loading 46 00:01:56,250 --> 00:01:58,770 is where we just literally dump the entire data set 47 00:01:58,770 --> 00:02:02,440 and we start over afresh with everything replaced. 48 00:02:02,440 --> 00:02:04,876 There's no other requirements, that's it. 49 00:02:04,876 --> 00:02:07,948 Incremental data loading is where we're not dumping anything 50 00:02:07,948 --> 00:02:09,680 from the current database. 51 00:02:09,680 --> 00:02:11,780 We're just looking at the differences 52 00:02:11,780 --> 00:02:14,450 and we're loading those differences in. 53 00:02:14,450 --> 00:02:15,890 We talked about that process, 54 00:02:15,890 --> 00:02:18,710 creating lookup activities, 55 00:02:18,710 --> 00:02:22,320 creating a copy activity in Data Factory, 56 00:02:22,320 --> 00:02:24,591 and then creating a stored procedure activity 57 00:02:24,591 --> 00:02:26,740 to update your watermark. 58 00:02:26,740 --> 00:02:31,053 And your watermark was measuring what was last updated. 59 00:02:33,480 --> 00:02:35,310 We talked about mapping data flows. 60 00:02:35,310 --> 00:02:39,470 If you remember, a data flow is the visual no-code solution 61 00:02:39,470 --> 00:02:43,110 to developing and implementing transformational logic 62 00:02:43,110 --> 00:02:44,816 in data factory. 63 00:02:44,816 --> 00:02:48,410 So keys here, no code solution 64 00:02:48,410 --> 00:02:51,050 for adding transformational logic. 65 00:02:51,050 --> 00:02:53,861 So if you want to do some transforming, it's somewhat light, 66 00:02:53,861 --> 00:02:56,260 data factory and mapping data flows 67 00:02:56,260 --> 00:02:59,010 might be a fantastic solution. 68 00:02:59,010 --> 00:03:00,810 Those data flows are then created 69 00:03:01,877 --> 00:03:04,373 and added into your Data Factory pipelines. 70 00:03:05,500 --> 00:03:07,660 And if you remember the basic steps, 71 00:03:07,660 --> 00:03:10,020 create a data flow in Data Factory. 72 00:03:10,020 --> 00:03:13,151 You pick your source, or where the data is coming from. 73 00:03:13,151 --> 00:03:16,220 You choose the modifier, which is the things 74 00:03:16,220 --> 00:03:17,150 that you want done, 75 00:03:17,150 --> 00:03:20,230 the transformations you want done on that data. 76 00:03:20,230 --> 00:03:21,840 And then you always finish up 77 00:03:21,840 --> 00:03:23,970 with choosing your destination. 78 00:03:23,970 --> 00:03:26,388 So those are the basic steps for creating those data flows 79 00:03:26,388 --> 00:03:28,733 in Data Factory. 80 00:03:30,440 --> 00:03:34,140 And, of course, there are a ton of scripts that you can use. 81 00:03:34,140 --> 00:03:36,920 And there's the link again. It's also found in the lesson, 82 00:03:36,920 --> 00:03:38,720 but make sure you take a look at that. 83 00:03:38,720 --> 00:03:42,339 Because as you actually jump into data engineering, 84 00:03:42,339 --> 00:03:44,653 those scripts can be very helpful. 85 00:03:47,290 --> 00:03:48,123 Upserting. 86 00:03:48,123 --> 00:03:50,216 We talked about upserting being the operation 87 00:03:50,216 --> 00:03:53,018 that allows you to insert rows into a database table 88 00:03:53,018 --> 00:03:56,983 if they don't exist, or update them if they do. 89 00:03:58,380 --> 00:04:02,220 The key to this is your alteration activity. 90 00:04:02,220 --> 00:04:04,910 So you'll grab that, and you'll use that 91 00:04:04,910 --> 00:04:08,453 in order to create upserting in Data Factory. 92 00:04:10,290 --> 00:04:12,620 We also talked about the key considerations for that 93 00:04:12,620 --> 00:04:16,390 being activity, fault tolerance, and retry. 94 00:04:16,390 --> 00:04:21,130 And if you remember, the activity was setting your success, 95 00:04:21,130 --> 00:04:23,160 failure, completion, and skipped. 96 00:04:23,160 --> 00:04:25,638 So what happens after your activity completes or fails? 97 00:04:25,638 --> 00:04:28,550 Making sure you have a strategy for that. 98 00:04:28,550 --> 00:04:32,420 Fault tolerance was deciding what errors you can ignore. 99 00:04:32,420 --> 00:04:34,660 And then the retry was just simply setting 100 00:04:34,660 --> 00:04:38,520 how many times you're going to retry an activity 101 00:04:38,520 --> 00:04:40,803 before you decide the pipeline has failed. 102 00:04:43,420 --> 00:04:46,160 We also talked about execution triggers. If you remember, 103 00:04:46,160 --> 00:04:49,098 we can manually execute a pipeline. 104 00:04:49,098 --> 00:04:50,580 That's just the on-demand. 105 00:04:50,580 --> 00:04:52,950 We're going to click the button and start it. 106 00:04:52,950 --> 00:04:53,783 And if you remember, 107 00:04:53,783 --> 00:04:56,090 there's 4 methods for configuring. There's .NET, 108 00:04:56,090 --> 00:04:59,903 PowerShell, REST, Python, and, of course, the portal. 109 00:05:00,850 --> 00:05:03,800 In addition to that, you can also set up triggers, 110 00:05:03,800 --> 00:05:06,820 which are going to execute via a schedule-- 111 00:05:06,820 --> 00:05:10,700 that was like our simple wall clock, or our tumbling window, 112 00:05:10,700 --> 00:05:13,580 which is more complicated, but allows us to look back 113 00:05:13,580 --> 00:05:15,682 or forward in time. 114 00:05:15,682 --> 00:05:18,420 Remember, the tumbling windows are fixed 115 00:05:18,420 --> 00:05:20,053 and non-overlapping, 116 00:05:20,053 --> 00:05:22,228 and those are contiguous time intervals. 117 00:05:22,228 --> 00:05:25,180 And contiguous just means that they're touching each other 118 00:05:25,180 --> 00:05:27,130 as we move forward in time. 119 00:05:27,130 --> 00:05:29,610 And, of course, we have our event-based 120 00:05:29,610 --> 00:05:32,490 that we can also set up for execution triggers. 121 00:05:32,490 --> 00:05:34,170 So don't forget about those. 122 00:05:35,290 --> 00:05:37,890 And don't forget about Databricks. 123 00:05:37,890 --> 00:05:40,110 Just like our elephant over here on the right, 124 00:05:40,110 --> 00:05:42,883 you need to remember that the DP-203 125 00:05:42,883 --> 00:05:46,657 is going to look at Databricks primarily for transformation, 126 00:05:46,657 --> 00:05:48,890 and you need to remember that demo 127 00:05:48,890 --> 00:05:50,523 where we went through, created, 128 00:05:50,523 --> 00:05:52,443 and then launched a notebook. 129 00:05:54,410 --> 00:05:55,900 Finally, don't forget 130 00:05:55,900 --> 00:05:59,220 that Databricks can be implemented into other services 131 00:05:59,220 --> 00:06:01,863 like Data Factory or Synapse. 132 00:06:04,820 --> 00:06:06,760 Spark in Data Factory. 133 00:06:06,760 --> 00:06:09,165 We talked a little bit about this, and if you remember, 134 00:06:09,165 --> 00:06:13,435 we can execute Spark activities using an HDInsight cluster. 135 00:06:13,435 --> 00:06:15,668 We talk through a whole bunch of code over here 136 00:06:15,668 --> 00:06:16,622 on the right. 137 00:06:16,622 --> 00:06:19,080 So make sure that you remember that. 138 00:06:19,080 --> 00:06:21,470 I'm not going to go back through all of that code, 139 00:06:21,470 --> 00:06:22,660 but really what you need to know 140 00:06:22,660 --> 00:06:25,050 is the basics of what's happening 141 00:06:25,050 --> 00:06:27,623 and how we create those activities. 142 00:06:31,720 --> 00:06:35,390 In summary, this is not all. 143 00:06:35,390 --> 00:06:38,200 This was a huge section, and what I've covered here 144 00:06:38,200 --> 00:06:39,930 wasn't even all of the slides 145 00:06:39,930 --> 00:06:41,940 and it wasn't even all of the lessons. 146 00:06:41,940 --> 00:06:44,500 It was the lessons that I felt were most important, 147 00:06:44,500 --> 00:06:45,630 but that's not all. 148 00:06:45,630 --> 00:06:48,990 So don't just gloss over the other topics. 149 00:06:48,990 --> 00:06:50,639 Also, don't forget the big picture. 150 00:06:50,639 --> 00:06:52,680 As you go through this course, 151 00:06:52,680 --> 00:06:55,490 you need to be constantly thinking about how this fits 152 00:06:55,490 --> 00:06:57,740 into data engineering pipelines, 153 00:06:57,740 --> 00:06:59,575 how you can use these individual activities 154 00:06:59,575 --> 00:07:02,425 and use them for your larger projects. 155 00:07:02,425 --> 00:07:04,167 That's going to help you in your career 156 00:07:04,167 --> 00:07:07,290 and it's going to help you on this exam. 157 00:07:07,290 --> 00:07:08,781 Also, don't forget to focus 158 00:07:08,781 --> 00:07:11,700 on those Microsoft exam requirements. 159 00:07:11,700 --> 00:07:13,750 If you remember, I've showed this in several lessons, 160 00:07:13,750 --> 00:07:15,925 so we're not going to jump in and do that again, 161 00:07:15,925 --> 00:07:16,758 but make sure 162 00:07:16,758 --> 00:07:19,220 that you're looking at those exam requirements. 163 00:07:19,220 --> 00:07:21,880 And if you need a refresher on that, back in section 1, 164 00:07:21,880 --> 00:07:23,370 we talked about what those are 165 00:07:23,370 --> 00:07:24,970 and I show you how to find them. 166 00:07:25,970 --> 00:07:27,640 Don't forget about the labs. 167 00:07:27,640 --> 00:07:30,000 You also need those for your exam and career. 168 00:07:30,000 --> 00:07:31,580 And I'm going to keep mentioning that 169 00:07:31,580 --> 00:07:35,580 because the labs are very important for your success. 170 00:07:35,580 --> 00:07:38,410 And finally, if you would do me a huge favor 171 00:07:38,410 --> 00:07:40,540 and you would just smash that thumbs up button 172 00:07:40,540 --> 00:07:41,963 on those lessons as you go through, 173 00:07:41,963 --> 00:07:44,340 I would greatly appreciate it. 174 00:07:44,340 --> 00:07:46,081 I hope you're finding this content helpful, 175 00:07:46,081 --> 00:07:49,690 and congratulations again on finishing section 5. 176 00:07:49,690 --> 00:07:51,150 I'll see you in section six 177 00:07:51,150 --> 00:07:53,350 and we'll talk a little bit about streaming.