1 00:00:00,690 --> 00:00:01,840 Hey, what's up Gurus. 2 00:00:01,840 --> 00:00:02,820 In this lesson, 3 00:00:02,820 --> 00:00:04,810 we're going to be talking about how to deal 4 00:00:04,810 --> 00:00:08,980 with duplicate and NULL data within data flow. 5 00:00:08,980 --> 00:00:09,860 Pretty cool lesson, 6 00:00:09,860 --> 00:00:11,303 so, let's get started. 7 00:00:14,320 --> 00:00:15,400 So, in this lesson, 8 00:00:15,400 --> 00:00:16,770 we will start off by talking 9 00:00:16,770 --> 00:00:19,750 about how to handle duplicate and NULL data, 10 00:00:19,750 --> 00:00:20,750 which by the way, 11 00:00:20,750 --> 00:00:23,260 is not just a job for data flow, 12 00:00:23,260 --> 00:00:24,850 but we're going to focus on Data Flow 13 00:00:24,850 --> 00:00:27,840 because of the DP-203. 14 00:00:27,840 --> 00:00:29,370 We're also going to take a look 15 00:00:29,370 --> 00:00:31,170 at some of the basic steps 16 00:00:31,170 --> 00:00:32,770 that you need to complete 17 00:00:32,770 --> 00:00:35,280 in order to deal with duplicate or NULL data 18 00:00:35,280 --> 00:00:36,590 in Data Flow. 19 00:00:36,590 --> 00:00:37,450 And then finally, 20 00:00:37,450 --> 00:00:39,500 of course, we'll jump in and I will show you 21 00:00:39,500 --> 00:00:41,513 how it all works in the portal. 22 00:00:42,920 --> 00:00:44,080 So, let's start by talking 23 00:00:44,080 --> 00:00:48,283 about why this is not just a job for Data Flow. 24 00:00:49,390 --> 00:00:50,560 There's an old saying, 25 00:00:50,560 --> 00:00:52,210 if all you have is a hammer, 26 00:00:52,210 --> 00:00:54,200 everything looks like a nail. 27 00:00:54,200 --> 00:00:56,260 That's very much the case here. 28 00:00:56,260 --> 00:00:58,890 If all you understand is Data Factory, 29 00:00:58,890 --> 00:01:00,230 that's the only tool you've got, 30 00:01:00,230 --> 00:01:01,530 and hey, that's going to be the solution 31 00:01:01,530 --> 00:01:02,940 for everything. 32 00:01:02,940 --> 00:01:04,720 If you've broadened your horizons, 33 00:01:04,720 --> 00:01:05,553 you might say, 34 00:01:05,553 --> 00:01:08,500 oh, Databricks might be a solution in some cases, 35 00:01:08,500 --> 00:01:12,230 or Power BI might be a solution in some cases. 36 00:01:12,230 --> 00:01:13,540 There's a lot of different ways 37 00:01:13,540 --> 00:01:15,810 to look and solve problems 38 00:01:15,810 --> 00:01:17,810 in Azure with data engineering, 39 00:01:17,810 --> 00:01:19,430 so, make sure that you're broadening 40 00:01:19,430 --> 00:01:21,000 your technology base, 41 00:01:21,000 --> 00:01:23,100 so that you can look at the best solution, 42 00:01:23,100 --> 00:01:25,350 not just the only one you know. 43 00:01:25,350 --> 00:01:27,610 Next, complexity. 44 00:01:27,610 --> 00:01:29,220 So, we've talked about Data Factory 45 00:01:29,220 --> 00:01:30,370 and Data Factory Flow, 46 00:01:30,370 --> 00:01:32,820 and being able to do transformations there, 47 00:01:32,820 --> 00:01:34,410 but we've also talked about that being 48 00:01:34,410 --> 00:01:37,710 for generally more simple use cases. 49 00:01:37,710 --> 00:01:39,210 If you have very complex needs, 50 00:01:39,210 --> 00:01:40,620 a lot of times you're going to look at something 51 00:01:40,620 --> 00:01:42,300 like Databricks. 52 00:01:42,300 --> 00:01:44,920 So, understanding your level of complexity, 53 00:01:44,920 --> 00:01:47,760 how much data you're pushing through your pipelines, 54 00:01:47,760 --> 00:01:49,670 will help you to understand the solution 55 00:01:49,670 --> 00:01:52,133 that you should be using for the job as well. 56 00:01:53,010 --> 00:01:55,600 Finally, accompanying solutions. 57 00:01:55,600 --> 00:01:58,510 So, if you think about your pipeline, 58 00:01:58,510 --> 00:02:01,930 if pretty much all you're using is Data Factory, 59 00:02:01,930 --> 00:02:04,060 hey, you might want to do everything in Data Factory, 60 00:02:04,060 --> 00:02:05,570 including data flow, 61 00:02:05,570 --> 00:02:07,030 but if you're already using Databricks 62 00:02:07,030 --> 00:02:08,230 for example, 63 00:02:08,230 --> 00:02:09,560 well, you might want to take a look 64 00:02:09,560 --> 00:02:11,800 at doing your transformations in Databricks, 65 00:02:11,800 --> 00:02:13,540 and then just using Data Factory 66 00:02:13,540 --> 00:02:15,760 to move things along the pipeline. 67 00:02:15,760 --> 00:02:17,680 So, understanding the other solutions 68 00:02:17,680 --> 00:02:20,110 and other steps in your pipelines, 69 00:02:20,110 --> 00:02:23,880 or other solutions in your data engineering jobs, 70 00:02:23,880 --> 00:02:26,430 will help you to understand the best solution 71 00:02:26,430 --> 00:02:28,030 for the task at hand. 72 00:02:28,030 --> 00:02:30,480 So, just be considering all of those things, 73 00:02:30,480 --> 00:02:32,803 not just looking at data flow. 74 00:02:35,850 --> 00:02:36,810 Having said that, 75 00:02:36,810 --> 00:02:38,650 these are the basic steps 76 00:02:38,650 --> 00:02:41,230 to deal with duplicate and missing data. 77 00:02:41,230 --> 00:02:42,360 Step 1, 78 00:02:42,360 --> 00:02:45,870 we're going to create data flow in Data Factory. 79 00:02:45,870 --> 00:02:46,840 Step 2, 80 00:02:46,840 --> 00:02:49,010 we're going to select our source. 81 00:02:49,010 --> 00:02:51,410 Then we're going to choose our modifier, 82 00:02:51,410 --> 00:02:52,243 and I'm going to show you 83 00:02:52,243 --> 00:02:54,290 what that means here in just a second, 84 00:02:54,290 --> 00:02:56,050 but you're going to choose the modifier 85 00:02:56,050 --> 00:02:59,020 to accomplish the task that you're looking to do. 86 00:02:59,020 --> 00:02:59,853 And then 4, 87 00:02:59,853 --> 00:03:01,430 you're going to choose your destination, 88 00:03:01,430 --> 00:03:03,500 which, if you remember in Data Factory, 89 00:03:03,500 --> 00:03:05,120 is the sink. 90 00:03:05,120 --> 00:03:06,970 So, you're going to choose that as well. 91 00:03:07,950 --> 00:03:10,220 So, this is how we're going to choose our modifier 92 00:03:10,220 --> 00:03:11,920 As I go through the demo here. 93 00:03:11,920 --> 00:03:16,070 I'm going to throw this link as well in your lesson, 94 00:03:16,070 --> 00:03:17,930 but make sure that you're looking at that, 95 00:03:17,930 --> 00:03:21,023 because that's going to have a ton of helpful information. 96 00:03:22,360 --> 00:03:23,193 With that, 97 00:03:23,193 --> 00:03:25,380 let's jump over and see how this works 98 00:03:25,380 --> 00:03:27,150 in the portal. 99 00:03:27,150 --> 00:03:31,340 So, let's take a look at removing duplicate rows. 100 00:03:31,340 --> 00:03:33,010 And I want to show you 101 00:03:33,010 --> 00:03:35,940 that Data Flow script link from Microsoft, 102 00:03:35,940 --> 00:03:38,680 so let's jump over and take a look at that. 103 00:03:38,680 --> 00:03:41,470 This is a very cool section 104 00:03:41,470 --> 00:03:43,450 that has a ton of little sample scripts 105 00:03:43,450 --> 00:03:44,420 that you can use 106 00:03:44,420 --> 00:03:47,310 to solve all sorts of different problems. 107 00:03:47,310 --> 00:03:48,560 So, in this case, 108 00:03:48,560 --> 00:03:51,490 let's go ahead and grab this code snippet here, 109 00:03:51,490 --> 00:03:53,080 which is going to help us 110 00:03:53,080 --> 00:03:56,640 with deduping our data. 111 00:03:56,640 --> 00:03:59,510 And so, we're going to copy this snippet, 112 00:03:59,510 --> 00:04:02,480 jump back over here in our data flow, 113 00:04:02,480 --> 00:04:05,350 and we're going to look at accessing the code now 114 00:04:05,350 --> 00:04:07,920 instead of just pulling modules in. 115 00:04:07,920 --> 00:04:09,590 So, over here on the right, 116 00:04:09,590 --> 00:04:12,050 you'll see that we can use Properties. 117 00:04:12,050 --> 00:04:14,210 We can look at our code as well 118 00:04:14,210 --> 00:04:15,913 by clicking on that link. 119 00:04:17,270 --> 00:04:19,520 And we can also take a look at this, 120 00:04:19,520 --> 00:04:21,930 which is our scripts. 121 00:04:21,930 --> 00:04:23,310 Now, since we pulled the script 122 00:04:23,310 --> 00:04:25,260 from that Microsoft link, 123 00:04:25,260 --> 00:04:27,570 we're going to go ahead and use this. 124 00:04:27,570 --> 00:04:30,170 And I'm just going to paste in the script snippet 125 00:04:30,170 --> 00:04:31,770 that we just grabbed. 126 00:04:31,770 --> 00:04:33,470 So, I'll click on OK, 127 00:04:33,470 --> 00:04:35,750 and you can see that it now created 128 00:04:35,750 --> 00:04:38,060 a DistinctRows module for you. 129 00:04:38,060 --> 00:04:39,460 And so, what this is going to do 130 00:04:39,460 --> 00:04:41,850 is it's going to go through all of your code. 131 00:04:41,850 --> 00:04:44,200 It's going to pull out the first instance of each data, 132 00:04:44,200 --> 00:04:45,360 make that unique, 133 00:04:45,360 --> 00:04:47,490 and then anything else that's a duplicate, 134 00:04:47,490 --> 00:04:51,610 it's going to not pass through to our output. 135 00:04:51,610 --> 00:04:53,520 So, let's go ahead and grab this. 136 00:04:53,520 --> 00:04:54,830 For our incoming stream, 137 00:04:54,830 --> 00:04:56,780 we'll choose source1, 138 00:04:56,780 --> 00:05:00,670 and we'll put that in line with our data flow. 139 00:05:00,670 --> 00:05:01,670 So, next, 140 00:05:01,670 --> 00:05:03,690 we want to talk about how do we get rid 141 00:05:03,690 --> 00:05:05,520 of NULL values. 142 00:05:05,520 --> 00:05:08,770 So, we'll jump back over to our script here, 143 00:05:08,770 --> 00:05:13,460 and we'll grab our check for NULL script. 144 00:05:13,460 --> 00:05:16,160 So this is going to separate out 145 00:05:16,160 --> 00:05:18,320 all of our NULL values for us. 146 00:05:18,320 --> 00:05:20,443 So, we'll copy that script snippet, 147 00:05:21,320 --> 00:05:23,100 come back over here, 148 00:05:23,100 --> 00:05:25,033 go to our script section again, 149 00:05:26,050 --> 00:05:29,330 and we'll just paste that in as well. 150 00:05:29,330 --> 00:05:31,340 So, you can see that it now has created 151 00:05:31,340 --> 00:05:33,750 a hasNULLs for us. 152 00:05:33,750 --> 00:05:35,930 So let's go ahead and put this in line 153 00:05:35,930 --> 00:05:37,950 with our data flow as well. 154 00:05:37,950 --> 00:05:40,450 So, we'll come down to our incoming stream. 155 00:05:40,450 --> 00:05:42,020 We're not going to choose source1. 156 00:05:42,020 --> 00:05:44,290 We're actually going to choose DistinctRows, 157 00:05:44,290 --> 00:05:46,350 because we want to have this happen after. 158 00:05:46,350 --> 00:05:47,500 And so, once we do that, 159 00:05:47,500 --> 00:05:50,043 you can see that we now have DistinctRows. 160 00:05:50,043 --> 00:05:51,410 After it's separated out that, 161 00:05:51,410 --> 00:05:53,320 it's going to take a look for NULLs, 162 00:05:53,320 --> 00:05:54,810 and it's going to separate that out 163 00:05:54,810 --> 00:05:56,420 into 2 different flows, 164 00:05:56,420 --> 00:05:58,610 so we can decide what we want to do with the data 165 00:05:58,610 --> 00:06:00,150 that has NULL values in it, 166 00:06:00,150 --> 00:06:02,800 and the data that doesn't have any NULL values in it. 167 00:06:03,690 --> 00:06:04,900 And so on and so forth. 168 00:06:04,900 --> 00:06:06,630 So we can go with the rest of the way through, 169 00:06:06,630 --> 00:06:08,030 building our transformations out 170 00:06:08,030 --> 00:06:09,550 if we so desired. 171 00:06:09,550 --> 00:06:11,370 Once we're all the way done, 172 00:06:11,370 --> 00:06:12,203 and there again, 173 00:06:12,203 --> 00:06:13,390 once we're done with everything, 174 00:06:13,390 --> 00:06:15,370 we would just grab a sink 175 00:06:15,370 --> 00:06:17,490 and we would drag that over 176 00:06:17,490 --> 00:06:19,550 to finish out our output. 177 00:06:19,550 --> 00:06:20,530 And then we could go through 178 00:06:20,530 --> 00:06:22,740 and put that into our data pipeline, 179 00:06:22,740 --> 00:06:25,830 and we would have a pretty cool snippet 180 00:06:25,830 --> 00:06:27,180 that is going to go through, 181 00:06:27,180 --> 00:06:29,040 separate out all of our distinct rows, 182 00:06:29,040 --> 00:06:31,990 and then separate out NULL and non-NULL values. 183 00:06:31,990 --> 00:06:32,823 So, with that, 184 00:06:32,823 --> 00:06:34,220 let's jump back over 185 00:06:34,220 --> 00:06:36,970 and take a look at our review. 186 00:06:36,970 --> 00:06:38,340 Keep in mind, 187 00:06:38,340 --> 00:06:39,630 duplicate data, 188 00:06:39,630 --> 00:06:42,040 NULL data, that is not just for data flow, 189 00:06:42,040 --> 00:06:44,070 so be thinking about alternate solutions 190 00:06:44,070 --> 00:06:45,423 as you move through. 191 00:06:46,530 --> 00:06:50,100 Also, understand the basic data flow steps. 192 00:06:50,100 --> 00:06:51,450 So, step 1, 193 00:06:51,450 --> 00:06:53,330 create a data flow. 194 00:06:53,330 --> 00:06:54,400 Step 2, 195 00:06:54,400 --> 00:06:57,380 you're going to create your source. 196 00:06:57,380 --> 00:06:58,213 Step 3, 197 00:06:58,213 --> 00:06:59,690 you're going to use that modifier, 198 00:06:59,690 --> 00:07:02,340 and you're going to grab those script snippets 199 00:07:02,340 --> 00:07:03,220 from that link, 200 00:07:03,220 --> 00:07:06,340 and then you're going to paste those in where applicable. 201 00:07:06,340 --> 00:07:07,310 And then step 4, 202 00:07:07,310 --> 00:07:08,500 you're going to create your sink 203 00:07:08,500 --> 00:07:12,040 and then attach that into your Data Factory pipeline. 204 00:07:12,040 --> 00:07:13,190 If you remember that, 205 00:07:13,190 --> 00:07:14,590 then you are good to go 206 00:07:14,590 --> 00:07:15,910 and you know what you need to know 207 00:07:15,910 --> 00:07:19,070 on the DP-203 to handle duplicate 208 00:07:19,070 --> 00:07:21,513 and NULL data in data flow.