1 00:00:01,100 --> 00:00:02,040 Hey, welcome back. 2 00:00:02,040 --> 00:00:05,550 So in this lesson, we are going to start building our puzzle 3 00:00:05,550 --> 00:00:07,450 and we're going to be looking, specifically, at 4 00:00:07,450 --> 00:00:12,450 how we identify services for stream processing in Azure. 5 00:00:12,790 --> 00:00:13,810 So in this lesson, 6 00:00:13,810 --> 00:00:17,320 we're going to be talking about what stream processing is. 7 00:00:17,320 --> 00:00:19,750 We're going to be talking about how stream processing 8 00:00:19,750 --> 00:00:22,030 is typically used in Azure. 9 00:00:22,030 --> 00:00:23,470 And then, we're going to take a look at 10 00:00:23,470 --> 00:00:26,210 some common stream processing solutions. 11 00:00:26,210 --> 00:00:30,600 Now, much like the last section, we are only going to focus 12 00:00:30,600 --> 00:00:33,950 on solutions that you would find on the DP-203, 13 00:00:33,950 --> 00:00:38,450 and we're going to attempt to focus the majority of our time 14 00:00:38,450 --> 00:00:41,300 on the services that you're most likely to heavily see 15 00:00:41,300 --> 00:00:42,970 on the DP-203. 16 00:00:42,970 --> 00:00:45,693 So just keep that in mind as we move forward. 17 00:00:46,800 --> 00:00:50,100 So from the last section, we talked a lot about batch 18 00:00:50,100 --> 00:00:52,140 and you remember the discussion about the pizza 19 00:00:52,140 --> 00:00:54,990 and the waterfall, or the stream here 20 00:00:54,990 --> 00:00:56,580 and what that looks like. 21 00:00:56,580 --> 00:00:58,030 We talked about batch 22 00:00:58,030 --> 00:01:01,230 being for data that's not required immediately. 23 00:01:01,230 --> 00:01:03,820 It's used for large transformations 24 00:01:03,820 --> 00:01:06,250 or can handle large transformations easier. 25 00:01:06,250 --> 00:01:09,530 And then it was that lower cost, larger data. 26 00:01:09,530 --> 00:01:11,830 Well with stream, it's kind of the opposite. 27 00:01:11,830 --> 00:01:14,020 When we talk about stream, we're talking about, 28 00:01:14,020 --> 00:01:18,050 we need data now, we need those insights immediately, 29 00:01:18,050 --> 00:01:22,970 and that is worth spending more time processing, 30 00:01:22,970 --> 00:01:26,690 spending more money to get those processes done faster, 31 00:01:26,690 --> 00:01:28,360 and dealing with some of the complexities 32 00:01:28,360 --> 00:01:30,920 that you see with stream. 33 00:01:30,920 --> 00:01:33,350 So, picture that stream over here on the right, 34 00:01:33,350 --> 00:01:37,010 and remember, streaming is unbounded data, 35 00:01:37,010 --> 00:01:39,000 which means there's just a lot of unknowns. 36 00:01:39,000 --> 00:01:42,440 We don't necessarily know when that data's going to end. 37 00:01:42,440 --> 00:01:44,670 Most likely, we're not exactly sure 38 00:01:44,670 --> 00:01:46,520 about the size of the data coming in. 39 00:01:46,520 --> 00:01:49,570 So we have to be able to deal with some fluctuations there. 40 00:01:49,570 --> 00:01:53,500 And we have more complexity because as that data moves, 41 00:01:53,500 --> 00:01:55,960 if we have issues with transformations 42 00:01:55,960 --> 00:01:59,830 or issues with our pipelines, that data is still flowing. 43 00:01:59,830 --> 00:02:02,010 And so we have to figure out strategies 44 00:02:02,010 --> 00:02:03,870 to be able to handle that missing data 45 00:02:03,870 --> 00:02:06,440 and catch up with where the stream is now. 46 00:02:06,440 --> 00:02:10,120 So all of this comes down to streaming is more complex 47 00:02:10,120 --> 00:02:11,403 and more expensive. 48 00:02:15,060 --> 00:02:18,490 So where do we see streaming most likely used? 49 00:02:18,490 --> 00:02:20,650 Well, first is in fraud detection. 50 00:02:20,650 --> 00:02:23,010 If I am logging into my bank, 51 00:02:23,010 --> 00:02:24,530 it is very important 52 00:02:24,530 --> 00:02:27,770 that that bank be able to look in real-time 53 00:02:27,770 --> 00:02:31,940 to figure out if I am a high risk for fraud. 54 00:02:31,940 --> 00:02:34,480 So if that's the case, they need to know immediately 55 00:02:34,480 --> 00:02:37,130 to be able to take steps to fix it. 56 00:02:37,130 --> 00:02:40,100 You also see streaming used in stock trading. 57 00:02:40,100 --> 00:02:44,560 Stock trading is something that is extremely fast 58 00:02:44,560 --> 00:02:46,500 and you need to be able to make decisions 59 00:02:46,500 --> 00:02:48,230 and see data immediately. 60 00:02:48,230 --> 00:02:49,470 So stock trading is something 61 00:02:49,470 --> 00:02:51,700 that you would see streaming data in. 62 00:02:51,700 --> 00:02:55,510 Customer activity, so think Amazon here or Netflix. 63 00:02:55,510 --> 00:02:56,730 If I'm watching shows, 64 00:02:56,730 --> 00:02:58,850 we want to be able to recommend things to me. 65 00:02:58,850 --> 00:03:01,500 Or if I'm browsing in Amazon, 66 00:03:01,500 --> 00:03:03,250 we want to be able to show me products 67 00:03:03,250 --> 00:03:05,150 that I might see immediately 68 00:03:05,150 --> 00:03:07,040 and that's also going to be true of Walmart, 69 00:03:07,040 --> 00:03:09,430 and so think customer activity. 70 00:03:09,430 --> 00:03:12,170 Ride-sharing. Ride-sharing is another place 71 00:03:12,170 --> 00:03:13,900 that streaming data is used. 72 00:03:13,900 --> 00:03:18,210 If I log on to my phone for Uber and I request an Uber, 73 00:03:18,210 --> 00:03:21,830 I need to be able to see all of the cars that are around me 74 00:03:21,830 --> 00:03:24,940 and they need to see that I am a potential customer. 75 00:03:24,940 --> 00:03:28,030 So there's a lot of streaming data that's used there. 76 00:03:28,030 --> 00:03:31,810 And then also with marketing, as we look at marketing, 77 00:03:31,810 --> 00:03:35,570 being able to identify potential customers immediately 78 00:03:35,570 --> 00:03:38,040 and/or change the experience of someone 79 00:03:38,040 --> 00:03:40,770 as they come on to a website, for instance, 80 00:03:40,770 --> 00:03:43,950 is something that you would use streaming data as well. 81 00:03:43,950 --> 00:03:46,390 So anything that you need real-time insights 82 00:03:46,390 --> 00:03:49,260 or you need to be able to change things on the fly 83 00:03:49,260 --> 00:03:51,270 for customer experience, 84 00:03:51,270 --> 00:03:53,690 that's where you're going to see streaming. 85 00:03:53,690 --> 00:03:56,270 So now, let's talk about some challenges. 86 00:03:56,270 --> 00:03:58,160 I mentioned the time issue. 87 00:03:58,160 --> 00:03:59,950 So the first is time windows. 88 00:03:59,950 --> 00:04:04,420 We're going to be looking at how do we evaluate 89 00:04:04,420 --> 00:04:07,940 and get insights on the data as it passes. 90 00:04:07,940 --> 00:04:10,440 We can't look at an indefinite period of time, 91 00:04:10,440 --> 00:04:13,410 we need to build windows and we'll talk in this section 92 00:04:13,410 --> 00:04:16,770 about different types of windows and what you can do, 93 00:04:16,770 --> 00:04:18,840 but we have to be able to frame our data 94 00:04:18,840 --> 00:04:20,580 in some sort of window 95 00:04:20,580 --> 00:04:23,980 so that we can get things like summations, or averages, 96 00:04:23,980 --> 00:04:27,610 or other insights that I might want to see on a report. 97 00:04:27,610 --> 00:04:28,840 Next, missing data. 98 00:04:28,840 --> 00:04:31,890 So when we lose that data, what do we do to catch up? 99 00:04:31,890 --> 00:04:34,060 So we'll be talking about that in this section as well 100 00:04:34,060 --> 00:04:35,770 and some different opportunities. 101 00:04:35,770 --> 00:04:38,000 And then transformation challenges. 102 00:04:38,000 --> 00:04:41,470 So, think as we move through with our data, 103 00:04:41,470 --> 00:04:43,930 we have to do transformation on that data. 104 00:04:43,930 --> 00:04:45,050 We have to clean it up. 105 00:04:45,050 --> 00:04:48,530 We have to clean up the sources or convert things. 106 00:04:48,530 --> 00:04:52,070 Well, that cleanup has to happen in near real-time 107 00:04:52,070 --> 00:04:54,550 because we are getting data behind it. 108 00:04:54,550 --> 00:04:56,640 And so we have some transformation challenges 109 00:04:56,640 --> 00:04:58,580 that you don't see in batch 110 00:04:58,580 --> 00:05:00,560 that we have to consider when we look at streaming. 111 00:05:00,560 --> 00:05:02,263 So we'll talk about that as well. 112 00:05:04,530 --> 00:05:07,810 So let's talk about now where streaming lives 113 00:05:07,810 --> 00:05:09,680 when we talk about Azure. 114 00:05:09,680 --> 00:05:12,300 So on the left here, you're typically going to start off 115 00:05:12,300 --> 00:05:15,000 with a Blob source or a Data Lake, 116 00:05:15,000 --> 00:05:17,240 something that the data is coming from, 117 00:05:17,240 --> 00:05:19,360 it might also be like an Event Hub, 118 00:05:19,360 --> 00:05:21,463 but some form of input. 119 00:05:22,490 --> 00:05:25,790 That then takes us into our services. 120 00:05:25,790 --> 00:05:27,880 And the first is Databricks. 121 00:05:27,880 --> 00:05:30,250 Databricks is definitely a streaming solution 122 00:05:30,250 --> 00:05:33,040 and is a frequently used streaming solution, 123 00:05:33,040 --> 00:05:36,080 when we start talking about large data sources 124 00:05:36,080 --> 00:05:38,160 or when we're talking about machine learning-- 125 00:05:38,160 --> 00:05:39,530 that's another place you'll see 126 00:05:39,530 --> 00:05:41,223 Databricks used for streaming. 127 00:05:42,200 --> 00:05:45,790 HDInsight. A good way to think about HDInsight 128 00:05:45,790 --> 00:05:49,080 is a control panel with a thousand buttons. 129 00:05:49,080 --> 00:05:51,450 There's a ton of things that you can do in there 130 00:05:51,450 --> 00:05:53,700 but it's very complex to configure. 131 00:05:53,700 --> 00:05:58,700 And so it's not as well used or utilized in Azure, 132 00:05:59,180 --> 00:06:00,550 doesn't mean that it's not good, 133 00:06:00,550 --> 00:06:02,010 it just means it's more complex, 134 00:06:02,010 --> 00:06:03,300 and so it's not as used 135 00:06:03,300 --> 00:06:05,530 because it takes a lot of configuration. 136 00:06:05,530 --> 00:06:08,410 And then Azure Stream Analytics, or ASA, 137 00:06:08,410 --> 00:06:10,010 as you'll see it quite frequently, 138 00:06:10,010 --> 00:06:11,730 because it's very long to say 139 00:06:11,730 --> 00:06:13,860 Azure Stream Analytics all the time. 140 00:06:13,860 --> 00:06:15,990 So ASA is the most common service 141 00:06:15,990 --> 00:06:19,750 that you'll see streaming for, especially on the DP-203. 142 00:06:19,750 --> 00:06:21,780 So we're going to spend the bulk of our time 143 00:06:21,780 --> 00:06:24,350 in Azure Stream Analytics. 144 00:06:24,350 --> 00:06:25,930 Now, the concepts that we talk about 145 00:06:25,930 --> 00:06:29,130 will carry through into other services as well, 146 00:06:29,130 --> 00:06:31,930 but this is where we're going to spend the bulk of our time. 147 00:06:31,930 --> 00:06:33,410 So we take that streaming data 148 00:06:33,410 --> 00:06:35,060 and we pass it through Databricks 149 00:06:35,060 --> 00:06:37,360 or Stream Analytics or HDInsight, 150 00:06:37,360 --> 00:06:39,100 and then we're going to push it out, 151 00:06:39,100 --> 00:06:40,890 and it's going to be pushed out into 152 00:06:40,890 --> 00:06:43,430 either a reporting tool like Power BI 153 00:06:43,430 --> 00:06:45,540 or pushed back into a Data Lake 154 00:06:45,540 --> 00:06:47,150 to have something else happen, 155 00:06:47,150 --> 00:06:49,220 or it might be pushed through an Event Hub 156 00:06:49,220 --> 00:06:52,700 to have another action taken on it or even machine learning. 157 00:06:52,700 --> 00:06:55,480 But that's what happens in this step. 158 00:06:55,480 --> 00:06:57,990 So, Stream Analytics, Databricks, 159 00:06:57,990 --> 00:07:00,020 the streaming service is really 160 00:07:00,020 --> 00:07:03,280 it serves the role of Data Factory. 161 00:07:03,280 --> 00:07:04,680 So it kind of takes that data 162 00:07:04,680 --> 00:07:05,990 and it moves it through the system. 163 00:07:05,990 --> 00:07:09,690 And then it also combines that with transformation as well. 164 00:07:09,690 --> 00:07:12,320 So you have both of those things kind of bundled together 165 00:07:12,320 --> 00:07:15,230 in a pipeline that's going to move your data 166 00:07:15,230 --> 00:07:17,390 and clean it and do some other processing 167 00:07:17,390 --> 00:07:19,743 so that it's ready to go on to the next step. 168 00:07:22,300 --> 00:07:24,340 All right, so in review, 169 00:07:24,340 --> 00:07:26,920 we have talked about stream processing. 170 00:07:26,920 --> 00:07:28,800 You need to make sure that you keep in mind 171 00:07:28,800 --> 00:07:31,583 the characteristics of batch versus stream. 172 00:07:32,520 --> 00:07:34,560 Examples and challenges. 173 00:07:34,560 --> 00:07:38,150 So, when would we stream versus when would we batch? 174 00:07:38,150 --> 00:07:41,290 If you understand those challenges and 175 00:07:41,290 --> 00:07:44,070 you understand the common examples that we're going to see, 176 00:07:44,070 --> 00:07:46,920 you'll do much better when thinking about new projects 177 00:07:46,920 --> 00:07:49,990 and trying to decide which you should use, or defend 178 00:07:49,990 --> 00:07:52,283 which you should use to a stakeholder. 179 00:07:53,270 --> 00:07:56,030 And then finally, Azure technology choices. 180 00:07:56,030 --> 00:07:59,170 So, which services are ideal for stream processing 181 00:07:59,170 --> 00:08:02,190 and which would you see on the DP-203? 182 00:08:02,190 --> 00:08:04,970 Keeping that in mind, we'll also help you maybe knock off 183 00:08:04,970 --> 00:08:07,390 a couple of answers that wouldn't make any sense 184 00:08:07,390 --> 00:08:10,300 as you look through questions on the exam. 185 00:08:10,300 --> 00:08:12,190 All right, that's it for this lesson, 186 00:08:12,190 --> 00:08:13,853 I'll see you in the next.