1 00:00:01,050 --> 00:00:03,460 Hey Cloud Gurus, welcome back. 2 00:00:03,460 --> 00:00:05,230 In this lesson, we're going to talk about 3 00:00:05,230 --> 00:00:06,823 creating data pipelines. 4 00:00:08,850 --> 00:00:09,990 To start off with, we'll do 5 00:00:09,990 --> 00:00:13,420 a little bit of an Azure Data Factory recap. 6 00:00:13,420 --> 00:00:14,600 You've already been introduced 7 00:00:14,600 --> 00:00:16,330 to the components of pipelines. 8 00:00:16,330 --> 00:00:17,590 So that will be a quick stop 9 00:00:17,590 --> 00:00:19,740 before we jump straight into the demo. 10 00:00:19,740 --> 00:00:21,580 And we'll take a hands-on look at these 11 00:00:21,580 --> 00:00:24,540 in order to tie all the concepts together. 12 00:00:24,540 --> 00:00:26,510 The goal is really to run through the process 13 00:00:26,510 --> 00:00:29,510 of constructing a pipeline so that you're familiar with it, 14 00:00:29,510 --> 00:00:31,830 have an idea of what you can accomplish with it 15 00:00:31,830 --> 00:00:33,713 and can see it end to end. 16 00:00:34,660 --> 00:00:36,860 We'll then wrap everything up with a review. 17 00:00:39,700 --> 00:00:41,750 As you will remember when Brian went over 18 00:00:41,750 --> 00:00:43,650 Azure Data Factory, 19 00:00:43,650 --> 00:00:47,700 we have pipelines that are logical groupings of activities. 20 00:00:47,700 --> 00:00:51,430 These activities can consist of many different action types, 21 00:00:51,430 --> 00:00:55,043 including data movement, transformation, and control. 22 00:00:55,970 --> 00:01:00,380 We use data sets to actually interact with the data itself, 23 00:01:00,380 --> 00:01:03,970 and we populate those data sets by using linked services 24 00:01:03,970 --> 00:01:05,853 to the various source types. 25 00:01:06,740 --> 00:01:10,400 In this video, we're looking specifically at the pipelines 26 00:01:10,400 --> 00:01:12,420 where we're tying all of this together 27 00:01:12,420 --> 00:01:14,373 and making it a functional workflow. 28 00:01:16,400 --> 00:01:18,010 So without further ado, 29 00:01:18,010 --> 00:01:19,770 let's jump into the Azure portal 30 00:01:19,770 --> 00:01:22,733 and take a look at constructing data pipelines. 31 00:01:24,700 --> 00:01:26,500 Here we are in the Azure portal 32 00:01:26,500 --> 00:01:28,790 and I've simply gone to my Azure Data Factory 33 00:01:28,790 --> 00:01:32,120 named awesome-company-adf and opened up 34 00:01:32,120 --> 00:01:33,683 Azure Data Factory studio. 35 00:01:34,770 --> 00:01:37,340 We're here on the author tab where we can have a look 36 00:01:37,340 --> 00:01:41,570 at our pipelines, datasets data flows, etc. 37 00:01:41,570 --> 00:01:44,040 You can see I have no pipelines yet. 38 00:01:44,040 --> 00:01:46,430 There are multiple datasets. 39 00:01:46,430 --> 00:01:49,010 The scenario we're going to work off of for this demo 40 00:01:49,010 --> 00:01:51,880 is that we have nightly loads and transformations 41 00:01:51,880 --> 00:01:54,950 from our production environment to our dev environment 42 00:01:54,950 --> 00:01:56,483 for some of our sales data. 43 00:01:57,500 --> 00:01:59,840 To that end, there are a number of datasets 44 00:01:59,840 --> 00:02:03,280 already created some to production tables 45 00:02:03,280 --> 00:02:05,663 and some to development tables. 46 00:02:06,760 --> 00:02:10,250 You can see these link back to Azure SQL Database 47 00:02:10,250 --> 00:02:11,453 linked services, 48 00:02:12,400 --> 00:02:16,853 but let's collapse this for now and let's create a pipeline. 49 00:02:18,500 --> 00:02:22,473 We can name it something like NightlyLoad, 50 00:02:24,650 --> 00:02:28,440 and within here, let's do a couple of copies. 51 00:02:28,440 --> 00:02:31,220 Let's say that we want to copy data from production 52 00:02:31,220 --> 00:02:32,570 to development every night. 53 00:02:33,810 --> 00:02:37,213 Maybe we want to copy customer data. 54 00:02:40,000 --> 00:02:43,360 I'll go ahead and collapse this to give us more room 55 00:02:43,360 --> 00:02:47,203 and under Source, we can specify our dataset. 56 00:02:48,310 --> 00:02:53,041 Like I mentioned, I have several already, but for this one, 57 00:02:53,041 --> 00:02:57,000 let's pull from Customer_Prod and for our sync, 58 00:02:57,000 --> 00:03:01,920 it's going to go into the dataset Customer_Dev. 59 00:03:01,920 --> 00:03:04,160 Since that table does not yet exist there, 60 00:03:04,160 --> 00:03:06,423 I'm going to say auto create table. 61 00:03:07,420 --> 00:03:10,010 And so this one activity is taking care of the task 62 00:03:10,010 --> 00:03:12,150 of loading the customer table from prod 63 00:03:12,150 --> 00:03:14,083 and placing it in dev. 64 00:03:15,210 --> 00:03:18,540 But the beauty of pipelines is that we don't have to manage 65 00:03:18,540 --> 00:03:20,890 every little thing individually. 66 00:03:20,890 --> 00:03:24,160 We can roll up groups of tasks or activities 67 00:03:24,160 --> 00:03:26,370 into one pipeline. 68 00:03:26,370 --> 00:03:27,973 So let's copy another table, 69 00:03:30,190 --> 00:03:32,133 move that over make it nice and neat. 70 00:03:33,350 --> 00:03:34,570 And let's call this one 71 00:03:37,129 --> 00:03:39,283 SalesOrderDetail Copy, 72 00:03:41,290 --> 00:03:43,860 performing a very similar task. 73 00:03:43,860 --> 00:03:46,870 But this time we're choosing a different dataset. 74 00:03:46,870 --> 00:03:49,793 The one for SalesOrderDetail, 75 00:03:50,780 --> 00:03:54,350 pulling from prod and going to dev 76 00:03:55,400 --> 00:03:57,493 and, again, creating the table. 77 00:03:58,720 --> 00:04:01,580 So now every night we're copying both the customer 78 00:04:01,580 --> 00:04:04,770 and SalesOrderDetail tables, 79 00:04:04,770 --> 00:04:07,180 but what if we have more complex scenarios than that 80 00:04:07,180 --> 00:04:09,060 as part of our nightly load? 81 00:04:09,060 --> 00:04:10,690 That's completely okay, 82 00:04:10,690 --> 00:04:12,860 let's grab a data flow over here, 83 00:04:12,860 --> 00:04:17,410 toss it in and notice I don't have these chained together, 84 00:04:17,410 --> 00:04:20,630 which means they'll operate independently of one another, 85 00:04:20,630 --> 00:04:22,863 but I could chain actions together as well. 86 00:04:23,810 --> 00:04:26,290 Or I can grab a data flow like this. 87 00:04:26,290 --> 00:04:30,103 Let's call this one Product Transform. 88 00:04:32,154 --> 00:04:33,830 And for our product table, 89 00:04:33,830 --> 00:04:36,240 we're going to do a little bit of a transformation 90 00:04:36,240 --> 00:04:38,340 instead of just straight copying the data. 91 00:04:39,470 --> 00:04:42,170 Under settings, let's do a new data flow 92 00:04:43,110 --> 00:04:44,913 and I'll dismiss this message. 93 00:04:46,120 --> 00:04:47,820 Let's add the source 94 00:04:51,951 --> 00:04:53,277 and select the dataset Product_Prod. 95 00:04:56,930 --> 00:05:01,930 From there, we can chain new activities such as aggregate. 96 00:05:02,630 --> 00:05:05,563 Let's say, I want to aggregate some values for this table. 97 00:05:06,950 --> 00:05:08,423 Let's choose aggregates. 98 00:05:11,079 --> 00:05:13,883 And for the column; StandardCost. 99 00:05:17,330 --> 00:05:19,963 In my expression, I can use expression builder. 100 00:05:21,777 --> 00:05:24,637 Let's say let's grab a function, such as average, 101 00:05:27,350 --> 00:05:29,740 and then throw in the column 102 00:05:32,430 --> 00:05:33,563 StandardCost. 103 00:05:34,980 --> 00:05:36,563 Save and finish. 104 00:05:37,500 --> 00:05:40,710 And so that's going to average our StandardCost column 105 00:05:41,810 --> 00:05:42,980 and we can pick a group. 106 00:05:42,980 --> 00:05:44,513 I have something like, 107 00:05:46,534 --> 00:05:47,584 Hmm, let's say color. 108 00:05:48,440 --> 00:05:50,710 Of course, that's going to need somewhere to go. 109 00:05:50,710 --> 00:05:53,820 So let's click the + again and go all the way down 110 00:05:53,820 --> 00:05:55,193 to sync. 111 00:05:58,259 --> 00:06:01,123 And for this, I'll choose the dataset Product_Avg. 112 00:06:03,850 --> 00:06:05,910 Let's jump over to the settings for that. 113 00:06:05,910 --> 00:06:08,840 And I'm going to tell it to recreate the table every time 114 00:06:08,840 --> 00:06:10,590 because it doesn't currently exist. 115 00:06:13,040 --> 00:06:15,283 I can give it a better name than sink1. 116 00:06:17,905 --> 00:06:19,770 Let's say Product_Avg 117 00:06:23,097 --> 00:06:25,603 and our source was Product. 118 00:06:26,820 --> 00:06:30,040 Now all right, let's pop out of here for a second 119 00:06:30,040 --> 00:06:31,630 over to our pipeline tab 120 00:06:31,630 --> 00:06:33,830 and take a look at what we have. 121 00:06:33,830 --> 00:06:37,330 So we have 2 different copies of 2 different tables 122 00:06:37,330 --> 00:06:39,330 coming from our production environment 123 00:06:39,330 --> 00:06:41,623 and copying the entire table down to dev. 124 00:06:42,490 --> 00:06:44,880 And then we also have a data flow 125 00:06:44,880 --> 00:06:46,870 that is not only copying the table, 126 00:06:46,870 --> 00:06:48,820 but performing a transformation 127 00:06:48,820 --> 00:06:52,680 by aggregating the standard cost and grouping by color. 128 00:06:52,680 --> 00:06:55,713 And then placing that in another dev table. 129 00:06:58,660 --> 00:07:00,630 Of course, before we can use the pipeline, 130 00:07:00,630 --> 00:07:01,880 we need to publish it. 131 00:07:01,880 --> 00:07:05,633 So let's click Publish All and publish. 132 00:07:07,570 --> 00:07:10,000 That'll take just a moment and it will save 133 00:07:10,000 --> 00:07:14,190 all of our changes: our new pipeline, our new data flow, 134 00:07:14,190 --> 00:07:16,023 and all of the added components. 135 00:07:17,030 --> 00:07:19,630 As mentioned, one of the beautiful things about pipelines 136 00:07:19,630 --> 00:07:22,390 is that these activities can be run together 137 00:07:22,390 --> 00:07:25,180 either on demand or on a schedule. 138 00:07:25,180 --> 00:07:27,980 So we could add a trigger for a later time, 139 00:07:27,980 --> 00:07:30,180 but let's go ahead and run them together now 140 00:07:31,750 --> 00:07:33,083 and click OK. 141 00:07:36,000 --> 00:07:39,003 We can monitor that over here on the monitor tab, 142 00:07:40,750 --> 00:07:42,460 it will automatically update you, 143 00:07:42,460 --> 00:07:46,180 or you can hit refresh to check on the progress. 144 00:07:46,180 --> 00:07:47,670 I'll pause the video here, 145 00:07:47,670 --> 00:07:50,270 and I'll be right back with you when it's completed. 146 00:07:51,810 --> 00:07:53,170 And we're back. 147 00:07:53,170 --> 00:07:55,950 As you can see, it has successfully completed. 148 00:07:55,950 --> 00:07:59,000 It took about 4 minutes and 34 seconds. 149 00:07:59,000 --> 00:08:01,540 And so we have successfully created a data pipeline 150 00:08:01,540 --> 00:08:02,843 and executed it. 151 00:08:04,140 --> 00:08:06,840 We pop back over to our author tab. 152 00:08:06,840 --> 00:08:10,580 Remember you can chain any number of activities together 153 00:08:10,580 --> 00:08:13,930 to perform whatever loads and transformations you want. 154 00:08:13,930 --> 00:08:15,690 But I wanted you to have a full picture 155 00:08:15,690 --> 00:08:17,430 of creating a pipeline, 156 00:08:17,430 --> 00:08:20,290 throwing some activities in there and using your data sets 157 00:08:20,290 --> 00:08:22,533 and linked services as we've talked about. 158 00:08:24,530 --> 00:08:26,084 By way of review, 159 00:08:26,084 --> 00:08:29,490 data pipelines are logical groupings of activities 160 00:08:29,490 --> 00:08:32,310 that together accomplish a task. 161 00:08:32,310 --> 00:08:34,330 No matter what that task is, 162 00:08:34,330 --> 00:08:36,450 whether loading your development environment 163 00:08:36,450 --> 00:08:39,470 or setting up an ETL/ELT solution 164 00:08:39,470 --> 00:08:41,910 for downstream data analysis, 165 00:08:41,910 --> 00:08:44,260 you can configure and chain together activities 166 00:08:44,260 --> 00:08:47,690 in a multitude of ways to create the types of pipelines 167 00:08:47,690 --> 00:08:48,633 that you need. 168 00:08:50,551 --> 00:08:53,300 And this allows you to manage activities as a whole, 169 00:08:53,300 --> 00:08:55,530 rather than individually. 170 00:08:55,530 --> 00:08:58,580 As you saw, we had 3 different tasks being executed 171 00:08:58,580 --> 00:09:01,400 as a unit, copying 2 different tables 172 00:09:01,400 --> 00:09:03,530 and then transforming another. 173 00:09:03,530 --> 00:09:06,680 This is great for deploying and scheduling them as a unit 174 00:09:06,680 --> 00:09:08,910 that makes sense with your business needs 175 00:09:08,910 --> 00:09:11,830 rather than having to tediously do every activity 176 00:09:11,830 --> 00:09:12,713 on its own. 177 00:09:14,690 --> 00:09:17,230 You can use pipelines to construct end-to-end, 178 00:09:17,230 --> 00:09:19,640 data-driven workflows for your data movement 179 00:09:19,640 --> 00:09:21,423 and data processing solutions. 180 00:09:22,280 --> 00:09:24,010 And keep in mind that this doesn't apply 181 00:09:24,010 --> 00:09:26,420 only to Azure Data Factory. 182 00:09:26,420 --> 00:09:29,590 You can create pipelines in a very similar fashion 183 00:09:29,590 --> 00:09:32,000 in Azure Synapse Analytics, as well. 184 00:09:32,000 --> 00:09:34,090 If that's the utility you're working in 185 00:09:34,090 --> 00:09:35,633 for an analytics project. 186 00:09:36,550 --> 00:09:37,680 That's it for this lesson. 187 00:09:37,680 --> 00:09:39,039 Thank you for joining me. 188 00:09:39,039 --> 00:09:40,730 I hope this helps you get familiar 189 00:09:40,730 --> 00:09:43,770 with what it looks like to create a data pipeline. 190 00:09:43,770 --> 00:09:46,850 Be sure to fire up the Cloud playground and go in 191 00:09:46,850 --> 00:09:48,310 and experiment for your own. 192 00:09:48,310 --> 00:09:51,150 Nothing beats hands-on practice. 193 00:09:51,150 --> 00:09:53,640 Create pipelines, datasets, and linked services 194 00:09:53,640 --> 00:09:56,580 to your heart's content until it's second nature. 195 00:09:56,580 --> 00:09:57,500 And before too long, 196 00:09:57,500 --> 00:09:59,900 you're going to be a data transformation master. 197 00:10:00,840 --> 00:10:01,673 All right. 198 00:10:01,673 --> 00:10:03,210 That's it, Gurus. When you're ready, 199 00:10:03,210 --> 00:10:04,660 I'll see you in the next one.