1 00:00:00,160 --> 00:00:01,760 Hey, what's up, Gurus? 2 00:00:01,760 --> 00:00:02,820 In this lesson, 3 00:00:02,820 --> 00:00:06,730 we are going to be talking about Azure Databricks. 4 00:00:06,730 --> 00:00:08,560 Specifically, we're going to take a look 5 00:00:08,560 --> 00:00:09,393 at the Azure Databricks notebook, 6 00:00:09,393 --> 00:00:10,226 and I'm going to talk to you 7 00:00:11,730 --> 00:00:15,000 about how you would implement that in Azure. 8 00:00:15,000 --> 00:00:16,840 Now, a couple things to keep in mind. 9 00:00:16,840 --> 00:00:19,120 We are not going to really focus on the code 10 00:00:19,120 --> 00:00:20,560 because you don't need to understand 11 00:00:20,560 --> 00:00:24,480 the code for the DP-203 as it relates to Databricks. 12 00:00:24,480 --> 00:00:27,080 Really, you need to more understand the concepts 13 00:00:27,080 --> 00:00:29,750 and understand what Databricks can be used for, 14 00:00:29,750 --> 00:00:33,330 so don't focus in too much on the code. 15 00:00:33,330 --> 00:00:36,690 Next up, this is more of a demo than the other lessons 16 00:00:36,690 --> 00:00:38,600 that we've been looking at, because, again, 17 00:00:38,600 --> 00:00:40,490 we really just need to understand the basics, 18 00:00:40,490 --> 00:00:42,970 and the quickest way for me to explain the basics to you 19 00:00:42,970 --> 00:00:46,070 is just to hop into the portal and show you how it works. 20 00:00:46,070 --> 00:00:49,713 So with that in mind, let's hop in, take a look. 21 00:00:50,680 --> 00:00:52,510 Alright, so I have opened up 22 00:00:52,510 --> 00:00:55,040 an Azure Databricks workspace. 23 00:00:55,040 --> 00:00:56,730 And the first thing that we want to do 24 00:00:56,730 --> 00:00:59,940 is hop over here into this left panel, 25 00:00:59,940 --> 00:01:02,170 and we're going to make that a little bit bigger, 26 00:01:02,170 --> 00:01:05,950 and let's jump in and take a look at the workspace. 27 00:01:05,950 --> 00:01:08,270 So we can click on Workspace. 28 00:01:08,270 --> 00:01:10,623 Let's go ahead and click on Users, 29 00:01:12,310 --> 00:01:15,750 and then let's go to Create a Notebook. 30 00:01:15,750 --> 00:01:17,220 And I want to show you what happens. 31 00:01:17,220 --> 00:01:20,120 So let's just call this thing test. 32 00:01:20,120 --> 00:01:23,200 Right away, we can choose a default language. 33 00:01:23,200 --> 00:01:25,220 Now keep in mind, this is the default language. 34 00:01:25,220 --> 00:01:28,240 We can actually switch languages as we move in 35 00:01:28,240 --> 00:01:30,680 from cell to cell in the notebook, 36 00:01:30,680 --> 00:01:32,950 which I'll show you here in a minute. 37 00:01:32,950 --> 00:01:35,150 And then we need to select a cluster. 38 00:01:35,150 --> 00:01:37,060 Hey, we don't have a cluster running. 39 00:01:37,060 --> 00:01:40,010 That's going to be important in order to make a notebook. 40 00:01:40,010 --> 00:01:41,950 So we could go ahead and create it. 41 00:01:41,950 --> 00:01:46,950 But if I did that, it's still going to need a cluster. 42 00:01:47,340 --> 00:01:50,000 So we're going to want to set up our cluster. 43 00:01:50,000 --> 00:01:53,630 And so let me go ahead and jump over and create that now. 44 00:01:53,630 --> 00:01:57,880 So we'll go back to this left panel, scroll down to Compute, 45 00:01:57,880 --> 00:02:01,433 and let's go ahead and create an all-purpose cluster. 46 00:02:02,630 --> 00:02:04,683 We'll just call this, test. 47 00:02:06,120 --> 00:02:08,603 Standard is just fine. 48 00:02:09,580 --> 00:02:11,920 And we can choose our runtime version, 49 00:02:11,920 --> 00:02:15,960 but we're going to leave it as just our standard 9.1. 50 00:02:16,800 --> 00:02:19,970 We don't want to enable autoscaling for what I'm doing, 51 00:02:19,970 --> 00:02:21,820 but in a production environment, 52 00:02:21,820 --> 00:02:23,970 you would want to enable autoscaling 53 00:02:23,970 --> 00:02:27,970 so that you don't run into issues as you run your notebooks. 54 00:02:27,970 --> 00:02:29,250 And then we can terminate. 55 00:02:29,250 --> 00:02:30,300 And I'm just going to go ahead 56 00:02:30,300 --> 00:02:33,060 and set this to 30 minutes of inactivity, 57 00:02:33,060 --> 00:02:35,900 but you can set that if you are concerned 58 00:02:35,900 --> 00:02:39,450 about the cost of the cluster, which you should be, 59 00:02:39,450 --> 00:02:41,750 unless you are, again, in a production environment, 60 00:02:41,750 --> 00:02:44,450 in which case you might leave it running all the time. 61 00:02:45,430 --> 00:02:48,063 We're going to choose our worker type, 62 00:02:49,300 --> 00:02:53,060 and we can leave that at the DS3_V2. 63 00:02:53,060 --> 00:02:55,300 And I'm going to go ahead and ramp down the workers, 64 00:02:55,300 --> 00:02:57,500 because, again, this is just more of a demo. 65 00:02:59,530 --> 00:03:00,680 And there we go. 66 00:03:00,680 --> 00:03:04,493 So let's go ahead and create that cluster. 67 00:03:06,350 --> 00:03:08,440 So we'll let that spin up, 68 00:03:08,440 --> 00:03:10,690 and I'm going to go ahead and pause the video here, 69 00:03:10,690 --> 00:03:11,730 and I will resume it 70 00:03:11,730 --> 00:03:14,493 as soon as we have our cluster up and running. 71 00:03:20,890 --> 00:03:22,220 Alright, so we are back 72 00:03:22,220 --> 00:03:25,450 and we have our cluster all up and running now. 73 00:03:25,450 --> 00:03:27,100 So let's go ahead and hop back 74 00:03:27,100 --> 00:03:31,520 into our workspace and our notebook. 75 00:03:31,520 --> 00:03:34,283 So I'm going to go ahead and open up my notebook. 76 00:03:36,250 --> 00:03:39,750 And if you will notice, up here in the left-hand corner, 77 00:03:39,750 --> 00:03:41,300 it says we are detached. 78 00:03:41,300 --> 00:03:43,420 That is referring to our cluster. 79 00:03:43,420 --> 00:03:44,800 So let's go ahead and click on that 80 00:03:44,800 --> 00:03:47,550 and choose the cluster that we spun up. 81 00:03:47,550 --> 00:03:50,530 So now we are connected and ready to go. 82 00:03:50,530 --> 00:03:53,020 So this is your notebook. 83 00:03:53,020 --> 00:03:56,170 And inside of the notebook, we have cells. 84 00:03:56,170 --> 00:03:59,470 Each one of these cells is essentially 1 line of code, 85 00:03:59,470 --> 00:04:03,020 and we can run these cells individually. 86 00:04:03,020 --> 00:04:06,600 So let me just go ahead and grab a few lines of code. 87 00:04:06,600 --> 00:04:07,970 And we're just going to use 88 00:04:07,970 --> 00:04:11,550 a standard Quickstart notebook section from Databricks, 89 00:04:11,550 --> 00:04:13,460 and we'll run a couple of those. 90 00:04:13,460 --> 00:04:16,290 Now at the very beginning, I want to show you as well 91 00:04:16,290 --> 00:04:18,200 that we can actually change the language 92 00:04:18,200 --> 00:04:20,030 by just using that percent sign 93 00:04:20,030 --> 00:04:21,340 and then typing in the language 94 00:04:21,340 --> 00:04:24,890 that we want to use for this cell. 95 00:04:24,890 --> 00:04:28,113 So let's go ahead and run this cell. 96 00:04:29,290 --> 00:04:33,370 And you can see here that even though this is SQL commands 97 00:04:33,370 --> 00:04:36,820 and we're running it in a Python default notebook, 98 00:04:36,820 --> 00:04:40,670 we can run the SQL command by using this %sql. 99 00:04:40,670 --> 00:04:43,100 So we have now completed this cell 100 00:04:43,100 --> 00:04:45,590 from a Databricks sample notebook, 101 00:04:45,590 --> 00:04:47,990 so you can see some code quickly. 102 00:04:47,990 --> 00:04:51,150 So we just simply pulled that out, pasted it in here, 103 00:04:51,150 --> 00:04:54,130 changed the language of this cell to SQL, 104 00:04:54,130 --> 00:04:55,360 and we are off and running. 105 00:04:55,360 --> 00:04:57,180 Now you'll note that since this is SQL, 106 00:04:57,180 --> 00:05:00,403 if I was to paste this in here and try and run this, 107 00:05:01,360 --> 00:05:02,590 it's not going to work 108 00:05:02,590 --> 00:05:06,100 because this cell is going to default to Python, 109 00:05:06,100 --> 00:05:08,410 which we can see in this upper right-hand corner. 110 00:05:08,410 --> 00:05:09,720 I can also click on that, 111 00:05:09,720 --> 00:05:12,890 and I could have chosen a different language here as well. 112 00:05:12,890 --> 00:05:15,703 So if I change the language over to SQL, it would run. 113 00:05:16,650 --> 00:05:19,300 So one last thing that I want to show you here 114 00:05:19,300 --> 00:05:20,700 in the notebook. 115 00:05:20,700 --> 00:05:25,283 Let's go ahead and grab, and we'll kill this cell. 116 00:05:26,790 --> 00:05:30,830 Let's go ahead and just run a simple SELECT statement. 117 00:05:30,830 --> 00:05:33,293 And again, we'll switch it to SQL here. 118 00:05:36,510 --> 00:05:37,540 And if we do that, 119 00:05:37,540 --> 00:05:40,623 we can actually begin to visualize our data. 120 00:05:41,510 --> 00:05:42,343 So you can see here 121 00:05:42,343 --> 00:05:44,840 that it just gives us a standard spreadsheet. 122 00:05:44,840 --> 00:05:46,210 I can actually come down here 123 00:05:46,210 --> 00:05:48,020 and I can begin to visualize this 124 00:05:48,020 --> 00:05:50,700 in different ways if I want to. 125 00:05:50,700 --> 00:05:54,530 And so we can start to visualize in different ways. 126 00:05:54,530 --> 00:05:58,100 Let's go back here to our spreadsheet. 127 00:05:58,100 --> 00:06:01,820 And if I did just a little bit more work here, 128 00:06:01,820 --> 00:06:04,260 I can actually change around 129 00:06:04,260 --> 00:06:08,990 and give myself a little bit better representation 130 00:06:08,990 --> 00:06:10,670 if I want to see things in a bar chart 131 00:06:10,670 --> 00:06:12,570 or a pie chart or something like that. 132 00:06:13,740 --> 00:06:17,060 So this is how we interact with Databricks, 133 00:06:17,060 --> 00:06:19,730 and keep in mind that me changing the languages 134 00:06:19,730 --> 00:06:22,540 is for demonstration purposes only. 135 00:06:22,540 --> 00:06:26,660 If you actually were using every single cell as SQL, 136 00:06:26,660 --> 00:06:27,950 it would make a whole lot more sense 137 00:06:27,950 --> 00:06:31,000 to actually set your notebook to be SQL, 138 00:06:31,000 --> 00:06:32,330 but I wanted to do it this way 139 00:06:32,330 --> 00:06:33,620 to make sure that you could see 140 00:06:33,620 --> 00:06:35,760 how you can change languages. 141 00:06:35,760 --> 00:06:38,780 And so this is a notebook that I have created, 142 00:06:38,780 --> 00:06:40,180 and then we could take this notebook 143 00:06:40,180 --> 00:06:43,180 and I could actually jump into Data Factory, 144 00:06:43,180 --> 00:06:46,040 and I could create an activity in Data Factory 145 00:06:46,040 --> 00:06:47,380 that would tie this notebook, 146 00:06:47,380 --> 00:06:50,160 and it would actually run this notebook 147 00:06:50,160 --> 00:06:52,770 in my Data Factory pipeline. 148 00:06:52,770 --> 00:06:55,610 So we can start to create some very complex scenarios, 149 00:06:55,610 --> 00:06:56,650 and keep in mind, again, 150 00:06:56,650 --> 00:07:00,780 Databricks is really going to be for big data. 151 00:07:00,780 --> 00:07:03,610 That's where you're going to see Databricks used most often. 152 00:07:03,610 --> 00:07:05,180 And primarily you're going to use it, 153 00:07:05,180 --> 00:07:09,400 at least for the DP-203, as a transformation tool set. 154 00:07:09,400 --> 00:07:12,320 It's also very big in machine learning and some other areas, 155 00:07:12,320 --> 00:07:14,350 but again, for the DP-203, 156 00:07:14,350 --> 00:07:17,030 think transformation, think big data, 157 00:07:17,030 --> 00:07:20,140 and that's where you're going to see Databricks. 158 00:07:20,140 --> 00:07:21,100 So with that, let's go ahead 159 00:07:21,100 --> 00:07:23,460 and jump in and talk about the review. 160 00:07:23,460 --> 00:07:26,900 So over here on the right, building a data pipeline 161 00:07:26,900 --> 00:07:28,360 is kind of like an onion. 162 00:07:28,360 --> 00:07:30,940 As we start to shave off more and more pieces, 163 00:07:30,940 --> 00:07:33,660 we get layers, layers upon layers, 164 00:07:33,660 --> 00:07:34,910 and so you're going to be interacting 165 00:07:34,910 --> 00:07:36,920 with a lot of different services. 166 00:07:36,920 --> 00:07:40,010 So it's important to think about the overall architecture 167 00:07:40,010 --> 00:07:44,490 before you jump into the weeds of Databricks or Data Factory 168 00:07:44,490 --> 00:07:47,610 or Synapse, or whatever the services are. 169 00:07:47,610 --> 00:07:49,630 Think about what you're trying to accomplish 170 00:07:49,630 --> 00:07:51,060 and the best way to do that, 171 00:07:51,060 --> 00:07:54,800 and then start to build out your building blocks. 172 00:07:54,800 --> 00:07:57,780 These pipelines are going to be absolutely critical 173 00:07:57,780 --> 00:07:59,140 for data engineering. 174 00:07:59,140 --> 00:08:01,470 Also keep in mind, you'll be using pipelines 175 00:08:01,470 --> 00:08:03,240 in Synapse and Data Factory, 176 00:08:03,240 --> 00:08:06,010 and those are very important for the 203. 177 00:08:06,010 --> 00:08:09,150 Don't forget to set your worker options. 178 00:08:09,150 --> 00:08:11,960 It can be an extremely expensive mistake. 179 00:08:11,960 --> 00:08:14,693 So back here in my cluster, 180 00:08:16,080 --> 00:08:18,300 when I went to Compute, 181 00:08:18,300 --> 00:08:22,480 we went down and we set my workers as 1. 182 00:08:22,480 --> 00:08:24,930 If I was to set that as 10, 183 00:08:24,930 --> 00:08:27,230 I would be getting charged 10 times the amount 184 00:08:27,230 --> 00:08:29,870 of what I'm getting charged now for one. 185 00:08:29,870 --> 00:08:31,050 If you're not careful, 186 00:08:31,050 --> 00:08:33,770 you can make an extremely expensive mistake 187 00:08:33,770 --> 00:08:36,000 by not setting your worker options 188 00:08:36,000 --> 00:08:38,790 and by not setting your terminate 189 00:08:38,790 --> 00:08:41,600 so that your clusters are actually spinning down 190 00:08:41,600 --> 00:08:43,333 after they haven't been used. 191 00:08:44,310 --> 00:08:47,240 Finally, we will take a look at authentication 192 00:08:47,240 --> 00:08:49,430 a little bit later on in the course. 193 00:08:49,430 --> 00:08:52,170 I believe that comes up in section 10. 194 00:08:52,170 --> 00:08:53,560 So don't worry too much 195 00:08:53,560 --> 00:08:55,290 about the security and authentication. 196 00:08:55,290 --> 00:08:57,680 Those are things that we'll be covering later. 197 00:08:57,680 --> 00:09:00,580 The big takeaway for Databricks 198 00:09:00,580 --> 00:09:03,940 is just understanding the impact of clusters 199 00:09:03,940 --> 00:09:05,840 and then understanding the basic principles 200 00:09:05,840 --> 00:09:09,630 of how you would implement a notebook in Azure Databricks. 201 00:09:09,630 --> 00:09:11,710 All right, that's it for this lesson. 202 00:09:11,710 --> 00:09:13,023 I'll see you in the next.