1 00:00:00,240 --> 00:00:01,080 In this lesson, 2 00:00:01,080 --> 00:00:04,404 we are going to talk about shuffling in a pipeline. 3 00:00:04,404 --> 00:00:05,840 And we're going to start off 4 00:00:05,840 --> 00:00:07,700 by talking about distributed tables, 5 00:00:07,700 --> 00:00:09,740 and we're going to take a look at hash distribution 6 00:00:09,740 --> 00:00:11,450 and round-robin distribution. 7 00:00:11,450 --> 00:00:12,290 And then we're going to follow up 8 00:00:12,290 --> 00:00:14,440 by taking a look at how we set partition types 9 00:00:14,440 --> 00:00:15,653 in the portal. 10 00:00:16,750 --> 00:00:19,870 So let's start by talking about table distribution. 11 00:00:19,870 --> 00:00:22,460 There's several different types of table distribution. 12 00:00:22,460 --> 00:00:25,630 There's round-robin, hash, dynamic range, 13 00:00:25,630 --> 00:00:27,580 fixed range, and key. 14 00:00:27,580 --> 00:00:30,640 Now, this is specific to Data Factory 15 00:00:30,640 --> 00:00:32,630 which is the service that we're looking at here. 16 00:00:32,630 --> 00:00:35,350 This is actually an image from Data Factory, 17 00:00:35,350 --> 00:00:37,470 and we'll see this later in the portal. 18 00:00:37,470 --> 00:00:40,300 But we're going to be talking through these partition types, 19 00:00:40,300 --> 00:00:41,280 and it's also important 20 00:00:41,280 --> 00:00:43,270 to talk a little bit about shuffling. 21 00:00:43,270 --> 00:00:46,610 So shuffling is just whenever we need to move data 22 00:00:46,610 --> 00:00:50,210 from point A to point B, or we need to change attributes. 23 00:00:50,210 --> 00:00:52,400 We shuffle that data around, 24 00:00:52,400 --> 00:00:54,410 and the way that we shuffle that data around, 25 00:00:54,410 --> 00:00:56,393 that's our table distribution. 26 00:00:57,820 --> 00:01:01,160 So let's start off by talking about hash distribution. 27 00:01:01,160 --> 00:01:05,080 A hash function is used to assign data to rows. 28 00:01:05,080 --> 00:01:06,800 Think about it kind of like a map. 29 00:01:06,800 --> 00:01:10,160 So, it's the key or the information that you need 30 00:01:10,160 --> 00:01:13,060 to figure out how to get from point A to point B. 31 00:01:13,060 --> 00:01:14,540 When we're talking about data, 32 00:01:14,540 --> 00:01:16,150 it's the hash function 33 00:01:16,150 --> 00:01:18,120 is the information or the key 34 00:01:18,120 --> 00:01:20,160 to help us figure out how to move data 35 00:01:20,160 --> 00:01:21,720 from our starting point, 36 00:01:21,720 --> 00:01:23,100 and then where we're going to put it 37 00:01:23,100 --> 00:01:25,663 as we move data into the new table. 38 00:01:26,870 --> 00:01:29,440 Essentially, it's a fancy math algorithm. 39 00:01:29,440 --> 00:01:32,300 So let's take a look at how this works. 40 00:01:32,300 --> 00:01:34,747 So we have 3 input sources, 41 00:01:34,747 --> 00:01:39,747 and from those input sources, we are going to move our data. 42 00:01:40,210 --> 00:01:41,500 And what we're going to do 43 00:01:41,500 --> 00:01:43,190 is we're going to take those blue boxes there, 44 00:01:43,190 --> 00:01:44,023 that's our input, 45 00:01:44,023 --> 00:01:46,210 and we're going to move it into those 9 boxes 46 00:01:46,210 --> 00:01:48,390 over there on the right. 47 00:01:48,390 --> 00:01:51,810 Now, the purple box represents our hash function. 48 00:01:51,810 --> 00:01:55,050 So as the data moves from the left to the right, 49 00:01:55,050 --> 00:01:58,100 it's going to encounter that purple box. 50 00:01:58,100 --> 00:01:59,770 And that is our math algorithm 51 00:01:59,770 --> 00:02:02,810 that's going to tell us which box over there on the right 52 00:02:02,810 --> 00:02:04,450 we need to move into. 53 00:02:04,450 --> 00:02:07,230 Now, hash functions are really important 54 00:02:07,230 --> 00:02:11,000 because they're going to give you fast querying. 55 00:02:11,000 --> 00:02:13,087 It takes a little longer to load data 56 00:02:13,087 --> 00:02:16,520 into a hash distribution, but it's going to be fast to query 57 00:02:16,520 --> 00:02:19,830 because we know where the data is. 58 00:02:19,830 --> 00:02:21,030 Because of that algorithm, 59 00:02:21,030 --> 00:02:23,230 we know where that data is going to be located. 60 00:02:23,230 --> 00:02:24,810 However, it's going to be take a little bit longer 61 00:02:24,810 --> 00:02:27,920 because the input has to go through that hash 62 00:02:27,920 --> 00:02:30,663 before it can come out and go to the right spot. 63 00:02:32,020 --> 00:02:32,853 The other side of that 64 00:02:32,853 --> 00:02:35,750 is looking at a round-robin distribution. 65 00:02:35,750 --> 00:02:37,630 In a round-robin distribution, 66 00:02:37,630 --> 00:02:38,980 we're going to have first random, 67 00:02:38,980 --> 00:02:42,040 and then sequential movement of our data. 68 00:02:42,040 --> 00:02:43,880 So you can see over there on the right, 69 00:02:43,880 --> 00:02:46,990 we basically we put an X over our hash function. 70 00:02:46,990 --> 00:02:48,480 There is no more hash function, 71 00:02:48,480 --> 00:02:50,870 we're going to randomly assign the data 72 00:02:50,870 --> 00:02:52,700 into those 9 buckets. 73 00:02:52,700 --> 00:02:54,220 Now, when I say random, 74 00:02:54,220 --> 00:02:56,640 we are going to have an even distribution, 75 00:02:56,640 --> 00:02:59,480 so it's not like 1 bucket's going to get all of the data. 76 00:02:59,480 --> 00:03:01,070 It's going to be evenly distributed 77 00:03:01,070 --> 00:03:03,230 across all 9 of those boxes. 78 00:03:03,230 --> 00:03:06,770 However, it's going to be randomly distributed at first. 79 00:03:06,770 --> 00:03:10,310 Now, that is good because we're going to get faster loads, 80 00:03:10,310 --> 00:03:13,220 because we don't have to move through that hash function. 81 00:03:13,220 --> 00:03:15,500 However, it's going to take longer to query 82 00:03:15,500 --> 00:03:18,670 because we don't really know where that data is exactly. 83 00:03:18,670 --> 00:03:21,220 And so it's going to take longer to search through 84 00:03:21,220 --> 00:03:24,393 to get to the data that we need when we're running queries. 85 00:03:25,270 --> 00:03:27,290 So in addition to hash and round-robin, 86 00:03:27,290 --> 00:03:30,260 there's a couple more distribution types. 87 00:03:30,260 --> 00:03:33,560 There is fixed, dynamic, and key. 88 00:03:33,560 --> 00:03:35,020 So fixed range 89 00:03:35,020 --> 00:03:39,860 is just simply: we are going to set number of the partitions 90 00:03:39,860 --> 00:03:41,150 that we want, 91 00:03:41,150 --> 00:03:44,740 and we are going to use a custom expression, 92 00:03:44,740 --> 00:03:45,970 and that custom expression 93 00:03:45,970 --> 00:03:49,200 is going to be determining how that data gets loaded 94 00:03:49,200 --> 00:03:51,240 into those partitions. 95 00:03:51,240 --> 00:03:53,530 Now, dynamic is going to be very similar to that, 96 00:03:53,530 --> 00:03:55,870 except you're going to have the system to determine 97 00:03:55,870 --> 00:03:57,420 the ranges for your data, 98 00:03:57,420 --> 00:03:59,640 rather than it being based upon conditions 99 00:03:59,640 --> 00:04:01,200 that you're setting. 100 00:04:01,200 --> 00:04:03,700 The last one down there is key. 101 00:04:03,700 --> 00:04:05,220 Key distributions is just simply: 102 00:04:05,220 --> 00:04:06,960 you're going to pick a key column, 103 00:04:06,960 --> 00:04:08,470 and it's going to do distribution 104 00:04:08,470 --> 00:04:11,060 based upon that key column. 105 00:04:11,060 --> 00:04:13,470 So with that, let's go ahead and jump into the portal, 106 00:04:13,470 --> 00:04:16,660 and I'll show you how we set distribution. 107 00:04:16,660 --> 00:04:19,130 So here, we find ourselves in Data Factory, 108 00:04:19,130 --> 00:04:21,760 in Data Factory flows, specifically. 109 00:04:21,760 --> 00:04:24,110 And I have set up just a very simple flow; 110 00:04:24,110 --> 00:04:25,970 we have our input source, 111 00:04:25,970 --> 00:04:29,060 and then we have our output or our sync. 112 00:04:29,060 --> 00:04:33,210 If we go to our sync and look down here at the bottom, 113 00:04:33,210 --> 00:04:35,380 there is an Optimize tab, 114 00:04:35,380 --> 00:04:38,480 and you'll see that I can choose the partition type 115 00:04:38,480 --> 00:04:39,590 that I want. 116 00:04:39,590 --> 00:04:43,730 So I just simply choose the current partitioning 117 00:04:43,730 --> 00:04:46,880 or I set the partitioning as round-robin, 118 00:04:46,880 --> 00:04:49,590 or key, or fixed, right? 119 00:04:49,590 --> 00:04:50,830 That's literally it. 120 00:04:50,830 --> 00:04:52,320 When I run this data flow, 121 00:04:52,320 --> 00:04:55,130 it's going to change my partition type 122 00:04:55,130 --> 00:04:56,918 or it's going to use the partition type 123 00:04:56,918 --> 00:05:00,333 that is currently there from the source data. 124 00:05:00,333 --> 00:05:02,250 All right, so with that, 125 00:05:02,250 --> 00:05:04,450 let's go ahead and kind of wrap this lesson up 126 00:05:04,450 --> 00:05:06,090 with a few key points. 127 00:05:06,090 --> 00:05:08,720 You do need to know the basic distribution types, 128 00:05:08,720 --> 00:05:11,560 especially round-robin and hash. 129 00:05:11,560 --> 00:05:13,610 Those are very common across Azure, 130 00:05:13,610 --> 00:05:15,600 so make sure that you know those. 131 00:05:15,600 --> 00:05:17,530 And you need to understand that shuffling 132 00:05:17,530 --> 00:05:20,740 has a very large effect on performance. 133 00:05:20,740 --> 00:05:22,470 When you're moving data around, 134 00:05:22,470 --> 00:05:24,590 that's going to affect the cost, 135 00:05:24,590 --> 00:05:28,120 and that's going to affect your performance. 136 00:05:28,120 --> 00:05:30,450 So don't neglect optimization. 137 00:05:30,450 --> 00:05:31,743 It can make a huge difference 138 00:05:31,743 --> 00:05:36,210 when you set your distribution appropriately. 139 00:05:36,210 --> 00:05:37,820 All right, that's it for this lesson. 140 00:05:37,820 --> 00:05:39,070 I'll see you in the next.