1 00:00:00,450 --> 00:00:01,990 So in this lesson, 2 00:00:01,990 --> 00:00:05,330 we are going to talk about how to handle late-arriving data 3 00:00:05,330 --> 00:00:08,340 in Azure Stream Analytics. 4 00:00:08,340 --> 00:00:10,740 And specifically what we're going to be talking about 5 00:00:10,740 --> 00:00:13,700 is how we define late data. 6 00:00:13,700 --> 00:00:16,850 We're going to take a look at the consequences of tolerance, 7 00:00:16,850 --> 00:00:19,480 and then we're also going to talk about how we adjust 8 00:00:19,480 --> 00:00:20,563 to late data. 9 00:00:21,620 --> 00:00:23,700 Before we do that, there's a few definitions 10 00:00:23,700 --> 00:00:25,380 that you need to understand. 11 00:00:25,380 --> 00:00:28,360 So we're going to start off by talking about watermarks. 12 00:00:28,360 --> 00:00:31,130 There's 2 different ways to define a watermark. 13 00:00:31,130 --> 00:00:33,740 The first way is to define watermarks 14 00:00:33,740 --> 00:00:35,670 when an event is arriving. 15 00:00:35,670 --> 00:00:38,170 So let's jump back in to that movie theater 16 00:00:38,170 --> 00:00:41,060 and I'm getting ready to buy a movie ticket. 17 00:00:41,060 --> 00:00:45,120 When I purchase that movie ticket, I create an event. 18 00:00:45,120 --> 00:00:49,600 That event will then be moving to Azure Stream Analytics 19 00:00:49,600 --> 00:00:53,580 or arriving into an Azure Stream Analytics. 20 00:00:53,580 --> 00:00:55,930 So this is the incoming scenario. 21 00:00:55,930 --> 00:00:58,010 So my watermark would be defined 22 00:00:58,010 --> 00:01:03,010 by the largest event time minus my out-of-order tolerance. 23 00:01:03,850 --> 00:01:06,470 And I set my out-of-order tolerance. 24 00:01:06,470 --> 00:01:08,470 So we're going to take a look at the largest event time 25 00:01:08,470 --> 00:01:11,000 and we're going to subtract the out-of-order tolerance 26 00:01:11,000 --> 00:01:14,260 to get our incoming watermark time. 27 00:01:14,260 --> 00:01:15,730 The other way to define watermarks 28 00:01:15,730 --> 00:01:18,840 is if there's no incoming event, and if that's the case, 29 00:01:18,840 --> 00:01:22,010 we take a look at our current estimated arrival time 30 00:01:22,010 --> 00:01:25,110 and we subtract out our late arrival tolerance. 31 00:01:25,110 --> 00:01:27,260 And so my estimated arrival time 32 00:01:27,260 --> 00:01:32,260 is just going to be whenever the last event occurs to now. 33 00:01:32,320 --> 00:01:34,800 It's going to be that amount of time. 34 00:01:34,800 --> 00:01:36,810 That's how we define watermarks 35 00:01:36,810 --> 00:01:38,773 when there's no incoming event. 36 00:01:39,870 --> 00:01:43,670 And so from there, we can get to our late data. 37 00:01:43,670 --> 00:01:48,300 So, late is defined as taking a look at my arrival time, 38 00:01:48,300 --> 00:01:50,800 subtracting out my event time, 39 00:01:50,800 --> 00:01:54,540 and then comparing that to my tolerance window. 40 00:01:54,540 --> 00:01:56,580 So an example of that would be 41 00:01:56,580 --> 00:02:01,040 if I, at 12 o'clock, buy a movie ticket, 42 00:02:01,040 --> 00:02:04,250 that is my event time. 12 o'clock. 43 00:02:04,250 --> 00:02:07,320 My arrival time is 12:04. 44 00:02:07,320 --> 00:02:10,040 So if it takes 4 minutes from when I bought the ticket 45 00:02:10,040 --> 00:02:12,500 to get into Azure Stream Analytics, 46 00:02:12,500 --> 00:02:16,500 that's going to be this: 12:04 minus 12:00. 47 00:02:16,500 --> 00:02:19,170 So, that's going to be 4 minutes. 48 00:02:19,170 --> 00:02:21,810 And if I had a 5-minute tolerance window set, 49 00:02:21,810 --> 00:02:23,910 I'm going to compare the 2 of those. 50 00:02:23,910 --> 00:02:25,690 5 is greater than 4. 51 00:02:25,690 --> 00:02:28,163 So the event is not late. 52 00:02:29,040 --> 00:02:31,600 If on the other hand, I had purchased the ticket 53 00:02:31,600 --> 00:02:35,550 at 12 o'clock and it didn't get there until 12:06? 54 00:02:35,550 --> 00:02:38,650 6 is greater than 5, and it would be late. 55 00:02:38,650 --> 00:02:40,840 So that's how we define late. 56 00:02:40,840 --> 00:02:43,510 These 3 definitions you do need to understand, 57 00:02:43,510 --> 00:02:45,810 so make sure you that know those formulas 58 00:02:45,810 --> 00:02:49,100 and have a decent handling of how you define watermarks 59 00:02:49,100 --> 00:02:50,843 and how you define late data. 60 00:02:54,100 --> 00:02:56,250 So now let's take a look at that tolerance window 61 00:02:56,250 --> 00:02:58,790 that I've been talking about and see what the consequences 62 00:02:58,790 --> 00:02:59,993 of tolerance are. 63 00:03:00,850 --> 00:03:03,950 So the first consequence is delayed outputs. 64 00:03:03,950 --> 00:03:07,180 If I have a huge tolerance window, it's 65 00:03:07,180 --> 00:03:10,950 going to take forever for that tolerance window to close, 66 00:03:10,950 --> 00:03:12,270 which means I don't get any data. 67 00:03:12,270 --> 00:03:15,970 I don't get any results that I can look at or compare, 68 00:03:15,970 --> 00:03:17,690 which means I'm going to get bad reports. 69 00:03:17,690 --> 00:03:20,233 It also could break processes down the stream. 70 00:03:21,380 --> 00:03:23,890 The second is if I have a tolerance window 71 00:03:23,890 --> 00:03:28,040 that's set too short, I can miss critical events 72 00:03:28,040 --> 00:03:30,480 because I don't have the proper tolerance set. 73 00:03:30,480 --> 00:03:33,260 So if an event is one minute late, for instance, 74 00:03:33,260 --> 00:03:34,600 I might not even look at it, 75 00:03:34,600 --> 00:03:37,530 and I might miss some very critical data, which again, 76 00:03:37,530 --> 00:03:41,563 can lead to broken processes or other missed data. 77 00:03:43,210 --> 00:03:45,180 So looking at this in action. 78 00:03:45,180 --> 00:03:48,620 So let's say that I have a time window down here 79 00:03:48,620 --> 00:03:49,500 at the bottom. 80 00:03:49,500 --> 00:03:53,860 This is my event time going from 12:00 to 12:30. 81 00:03:53,860 --> 00:03:58,323 And then my processing time going from 12:02 to 12:20, okay? 82 00:03:59,700 --> 00:04:02,320 So let's say that I have my first event come in, 83 00:04:02,320 --> 00:04:04,140 and it's just event number 4. 84 00:04:04,140 --> 00:04:07,130 The numbers in these little bubbles mean absolutely nothing. 85 00:04:07,130 --> 00:04:10,723 It's just, I have something come in that's 4. All right? 86 00:04:11,770 --> 00:04:14,210 So it is processed by 12:02. 87 00:04:14,210 --> 00:04:17,620 It arrives at 12:00 and it's processed by 12:02. 88 00:04:17,620 --> 00:04:21,200 The next event comes in at 12:05, 89 00:04:21,200 --> 00:04:24,830 and you can see see there that it's processed at 12:10. 90 00:04:24,830 --> 00:04:26,860 Still within the tolerance window, 91 00:04:26,860 --> 00:04:30,240 which is this large blue box, okay? 92 00:04:30,240 --> 00:04:31,430 So I have my tolerance window, 93 00:04:31,430 --> 00:04:33,550 which would be this large blue box, 94 00:04:33,550 --> 00:04:35,150 and I have my events. 95 00:04:35,150 --> 00:04:38,703 So, so far my event total would be 10: six + 4. 96 00:04:40,070 --> 00:04:42,970 Now let's say that I have an event come in at 12:08, 97 00:04:42,970 --> 00:04:46,730 but it's not processed until, like, 12:25. 98 00:04:46,730 --> 00:04:48,000 Well, that's way late 99 00:04:48,000 --> 00:04:51,090 and it's not going to be in the event window. 100 00:04:51,090 --> 00:04:52,920 So when you look at the sum total 101 00:04:52,920 --> 00:04:56,690 of the data in that window, I'm going to get 10. 102 00:04:56,690 --> 00:04:59,343 I missed my 8 up there at the top. 103 00:05:00,870 --> 00:05:02,800 Moving on, in my next window, 104 00:05:02,800 --> 00:05:04,370 let's just say that I had events 105 00:05:04,370 --> 00:05:06,210 8, 4, 3, and 9, 106 00:05:06,210 --> 00:05:11,210 and all of those are processed in a relatively fast fashion. 107 00:05:12,300 --> 00:05:14,230 You can see here that event 8 108 00:05:14,230 --> 00:05:17,370 is actually processed slower than event 4. 109 00:05:17,370 --> 00:05:20,670 Event 3 is processed much slower than events 110 00:05:20,670 --> 00:05:23,450 8 or 4, but it still just squeaks in 111 00:05:23,450 --> 00:05:24,670 with that window. 112 00:05:24,670 --> 00:05:27,980 And then event 9 is processed pretty close to on time. 113 00:05:27,980 --> 00:05:32,340 12:18 for the event and then 12:18 for the processing time, 114 00:05:32,340 --> 00:05:35,750 so it's processed nearly instantaneously. 115 00:05:35,750 --> 00:05:38,830 All 4 of those events happen within that tolerance window 116 00:05:38,830 --> 00:05:39,950 that we've set. 117 00:05:39,950 --> 00:05:42,143 And so I get a total of 24. 118 00:05:43,300 --> 00:05:44,920 And so then in our third scenario, 119 00:05:44,920 --> 00:05:47,270 let's say that we have events 5 and 7, 120 00:05:47,270 --> 00:05:50,040 and that's going to lead us to 12. 121 00:05:50,040 --> 00:05:51,070 That looks pretty good. 122 00:05:51,070 --> 00:05:53,260 But let's say that we were really concerned 123 00:05:53,260 --> 00:05:57,150 about missing data, and so we increase our tolerance window. 124 00:05:57,150 --> 00:05:59,560 If we do that, we are going to have 125 00:05:59,560 --> 00:06:02,380 a massive tolerance window, and my 7 and 5 126 00:06:02,380 --> 00:06:05,340 are well within the range of the tolerance window. 127 00:06:05,340 --> 00:06:08,770 And in theory, this would also increase to the top as well, 128 00:06:08,770 --> 00:06:10,080 but I can't really show you that 129 00:06:10,080 --> 00:06:12,390 and still have everything fit on the screen. 130 00:06:12,390 --> 00:06:15,010 But you would have then 7 and 5, 131 00:06:15,010 --> 00:06:17,253 and then a massive window. 132 00:06:18,300 --> 00:06:21,660 Now, that massive window may look fantastic, 133 00:06:21,660 --> 00:06:22,560 but the problem is, 134 00:06:22,560 --> 00:06:25,740 let's jump back into our movie theater scenario. 135 00:06:25,740 --> 00:06:29,490 Let's say that we have a theater that only seats 10 people. 136 00:06:29,490 --> 00:06:31,370 I come in and I buy 5 tickets. 137 00:06:31,370 --> 00:06:33,250 This first event here. 138 00:06:33,250 --> 00:06:37,240 Right after me, someone comes in and buys 7 tickets. 139 00:06:37,240 --> 00:06:38,950 So now we've sold 12 tickets 140 00:06:38,950 --> 00:06:41,450 for a theater that only seats 10 people. 141 00:06:41,450 --> 00:06:43,930 The problem is our tolerance window, 142 00:06:43,930 --> 00:06:46,952 this giant, light-blue box is so big 143 00:06:46,952 --> 00:06:49,940 that I don't get data coming out. 144 00:06:49,940 --> 00:06:51,440 And so I can sell 5 seats 145 00:06:51,440 --> 00:06:54,300 and then 7 seats before the system updates, 146 00:06:54,300 --> 00:06:57,030 because it's still waiting on this tolerance window. 147 00:06:57,030 --> 00:06:59,800 So it's really important when we look at tolerance 148 00:06:59,800 --> 00:07:01,440 that we find that sweet spot. 149 00:07:01,440 --> 00:07:03,930 That we don't do what's over here in this dark blue, 150 00:07:03,930 --> 00:07:05,940 and make our tolerance window so short 151 00:07:05,940 --> 00:07:07,820 that we miss critical events. 152 00:07:07,820 --> 00:07:09,260 And that we also don't do what's over here 153 00:07:09,260 --> 00:07:11,570 in this light blue and make a tolerance window 154 00:07:11,570 --> 00:07:14,770 that's so big that we also break processes 155 00:07:14,770 --> 00:07:17,340 or we sell more tickets than we have 156 00:07:17,340 --> 00:07:20,160 because our tolerance window isn't the right size. 157 00:07:20,160 --> 00:07:22,323 So it's important to find that sweet spot. 158 00:07:24,530 --> 00:07:26,510 So now, we're going to take a look 159 00:07:26,510 --> 00:07:28,850 at how we set tolerance in action. 160 00:07:28,850 --> 00:07:31,270 And this may be the shortest demo 161 00:07:31,270 --> 00:07:36,060 in the DP-203, and may be in Azure, but let's take a look. 162 00:07:36,060 --> 00:07:38,720 All right, so we are in Stream Analytics 163 00:07:38,720 --> 00:07:40,150 and I have simply scrolled down 164 00:07:40,150 --> 00:07:43,020 and clicked on Event Ordering here. 165 00:07:43,020 --> 00:07:45,260 And you can see that I have 2 different windows. 166 00:07:45,260 --> 00:07:48,580 I can choose my late-arriving data events, 167 00:07:48,580 --> 00:07:49,530 and I just come in here 168 00:07:49,530 --> 00:07:51,890 and I choose whatever I want for that. 169 00:07:51,890 --> 00:07:54,400 And then I can choose my out-of-order events. 170 00:07:54,400 --> 00:07:55,930 So I can look at the tolerance for that. 171 00:07:55,930 --> 00:07:58,640 And I can just simply set how late 172 00:07:58,640 --> 00:08:01,070 I'm willing to look at late-arriving data 173 00:08:01,070 --> 00:08:03,850 or out-of-order data for my tolerance. 174 00:08:03,850 --> 00:08:05,320 And then finally, I can choose 175 00:08:05,320 --> 00:08:08,260 whether I want to adjust or drop events. 176 00:08:08,260 --> 00:08:10,800 Basically if I drop an event, it's just going to delete it. 177 00:08:10,800 --> 00:08:13,380 If I adjust it, it's going to keep that event 178 00:08:13,380 --> 00:08:15,430 and just change the timestamp. 179 00:08:15,430 --> 00:08:18,010 So you can look at how you want to handle that. 180 00:08:18,010 --> 00:08:20,610 Click on Save, and that's it. 181 00:08:20,610 --> 00:08:22,300 That Stream Analytics job is done 182 00:08:22,300 --> 00:08:23,970 and you are ready to go. 183 00:08:23,970 --> 00:08:25,560 So it's extremely simple to set. 184 00:08:25,560 --> 00:08:28,260 It's really more important to think about that tolerance 185 00:08:28,260 --> 00:08:30,150 and understand what you're actually setting 186 00:08:30,150 --> 00:08:31,760 when you do that. 187 00:08:31,760 --> 00:08:35,500 All right, so in review, we have 3 formulas to look at. 188 00:08:35,500 --> 00:08:38,050 We have our 2 watermark formulas. 189 00:08:38,050 --> 00:08:40,400 The first being our incoming event 190 00:08:40,400 --> 00:08:43,410 which is the largest event minus out-of-order tolerance. 191 00:08:43,410 --> 00:08:45,960 And then our second is if there's no incoming event, 192 00:08:45,960 --> 00:08:48,300 which is just our current estimated arrival time 193 00:08:48,300 --> 00:08:50,620 minus our late arrival tolerance. 194 00:08:50,620 --> 00:08:53,740 And then keep in mind as well, we have our late definition, 195 00:08:53,740 --> 00:08:57,110 which is just our arrival time minus our event time, 196 00:08:57,110 --> 00:09:00,113 and then comparing that to our tolerance window. 197 00:09:01,460 --> 00:09:04,040 We also have the consequences of tolerance. 198 00:09:04,040 --> 00:09:07,830 So make sure that you use metrics to help you set tolerance. 199 00:09:07,830 --> 00:09:08,920 And what I mean by that 200 00:09:08,920 --> 00:09:11,660 is as you start running your Stream Analytics jobs, 201 00:09:11,660 --> 00:09:14,230 you'll be able to see how many events 202 00:09:14,230 --> 00:09:16,980 are late or out-of-order. 203 00:09:16,980 --> 00:09:18,580 And based upon that data, 204 00:09:18,580 --> 00:09:20,170 you can use those metrics 205 00:09:20,170 --> 00:09:22,713 to help you set that tolerance window. 206 00:09:25,410 --> 00:09:29,470 Finally, watermarks and tolerance are extremely important. 207 00:09:29,470 --> 00:09:31,470 Make sure that you don't gloss over that 208 00:09:31,470 --> 00:09:33,530 and that when you make Stream Analytics jobs, 209 00:09:33,530 --> 00:09:36,340 you have a strategy for watermarks and tolerance, 210 00:09:36,340 --> 00:09:38,440 because they can absolutely ruin your day 211 00:09:38,440 --> 00:09:41,060 if you don't have them set appropriately. 212 00:09:41,060 --> 00:09:41,893 All right, that's it. 213 00:09:41,893 --> 00:09:44,390 We are done with late-arriving data. 214 00:09:44,390 --> 00:09:46,783 With that, I'll see you in the next video.