1 00:00:00,610 --> 00:00:01,530 Hey, Cloud Gurus. 2 00:00:01,530 --> 00:00:04,063 Welcome to this lesson on distributing data. 3 00:00:06,110 --> 00:00:07,610 In this lesson, we're going to start off 4 00:00:07,610 --> 00:00:09,830 by introducing distribution. 5 00:00:09,830 --> 00:00:11,680 We're then going to talk about the different types 6 00:00:11,680 --> 00:00:16,680 such as round robin, hash-distributed, and replicated, 7 00:00:16,840 --> 00:00:19,123 and wrap everything up in a review. 8 00:00:22,570 --> 00:00:23,403 To start off, 9 00:00:23,403 --> 00:00:25,873 let's introduce what we mean by distributing data. 10 00:00:27,040 --> 00:00:29,000 Azure Synapse Analytics spreads your data 11 00:00:29,000 --> 00:00:32,983 across 60 underlying databases or distributions. 12 00:00:33,830 --> 00:00:35,400 This is done for you automatically 13 00:00:35,400 --> 00:00:39,150 as part of being a massively parallel processing system. 14 00:00:39,150 --> 00:00:40,500 And it's something that you don't manage 15 00:00:40,500 --> 00:00:41,973 or even have access to. 16 00:00:43,020 --> 00:00:45,000 This is not the same as partitioning, 17 00:00:45,000 --> 00:00:46,970 which is more about grouping similar data 18 00:00:46,970 --> 00:00:48,800 to be accessed together. 19 00:00:48,800 --> 00:00:50,210 Distribution can be thought of more 20 00:00:50,210 --> 00:00:52,940 as a hardware storage concept. 21 00:00:52,940 --> 00:00:54,800 The goal behind it is to distribute 22 00:00:54,800 --> 00:00:58,370 in such a way that data movement is minimized. 23 00:00:58,370 --> 00:01:00,040 And what we mean when we say data movement, 24 00:01:00,040 --> 00:01:03,320 is shuffling data around and fetching it across the network 25 00:01:03,320 --> 00:01:05,950 from different distributions in order to satisfy 26 00:01:05,950 --> 00:01:09,000 the results of a query, such as a join. 27 00:01:09,000 --> 00:01:10,780 If we don't get this right, then it can lead 28 00:01:10,780 --> 00:01:14,163 to negative effects such as network inefficiencies, 29 00:01:15,060 --> 00:01:19,440 increased I/O, and ultimately, slow queries. 30 00:01:19,440 --> 00:01:20,790 And so this is definitely something 31 00:01:20,790 --> 00:01:23,103 we want to take advantage of and get right. 32 00:01:25,650 --> 00:01:27,710 There are a few different types of distribution 33 00:01:27,710 --> 00:01:29,150 available to you. 34 00:01:29,150 --> 00:01:31,940 As you may have guessed from the introduction slide, 35 00:01:31,940 --> 00:01:34,260 we have round-robin-distributed tables, 36 00:01:34,260 --> 00:01:36,510 and that is where the data is distributed evenly 37 00:01:36,510 --> 00:01:37,923 in a random fashion. 38 00:01:39,070 --> 00:01:40,840 Hash-distributed tables, 39 00:01:40,840 --> 00:01:43,100 where data is distributed deterministically 40 00:01:43,100 --> 00:01:44,773 by using a hash function. 41 00:01:45,950 --> 00:01:49,720 And replicated, where we place a full copy of the table 42 00:01:49,720 --> 00:01:51,503 on every single compute node. 43 00:01:53,870 --> 00:01:56,640 Let's start off with round-robin-distributed tables. 44 00:01:56,640 --> 00:01:58,730 This is the easiest way to get started, 45 00:01:58,730 --> 00:02:01,513 and it's the default option if you don't specify one. 46 00:02:02,660 --> 00:02:03,860 As we mentioned before, 47 00:02:03,860 --> 00:02:07,080 this is an even distribution across the databases. 48 00:02:07,080 --> 00:02:09,110 It assigns rows randomly at first 49 00:02:09,110 --> 00:02:10,933 and then sequentially thereafter. 50 00:02:11,780 --> 00:02:13,500 That's a very efficient way to get them there, 51 00:02:13,500 --> 00:02:14,870 but it also means that the data 52 00:02:14,870 --> 00:02:17,890 on any particular distribution may have no relation 53 00:02:17,890 --> 00:02:20,053 to the other data that is there with it. 54 00:02:22,020 --> 00:02:24,420 This gives us fast performance for the loads. 55 00:02:24,420 --> 00:02:27,360 As we said, there's very little work in row assignment. 56 00:02:27,360 --> 00:02:30,030 And so that makes it great for things like staging tables, 57 00:02:30,030 --> 00:02:32,000 where you just want to load the data in there 58 00:02:32,000 --> 00:02:33,163 as quickly as possible. 59 00:02:34,480 --> 00:02:37,550 You can, however, have slower read performance. 60 00:02:37,550 --> 00:02:39,360 That's because there is a great potential 61 00:02:39,360 --> 00:02:41,640 for data movement here because the data 62 00:02:41,640 --> 00:02:44,040 is not placed in logical groupings. 63 00:02:44,040 --> 00:02:46,320 So when it goes to satisfy a query, 64 00:02:46,320 --> 00:02:48,010 there's a good chance it's going to have to pull 65 00:02:48,010 --> 00:02:49,940 from multiple distributions 66 00:02:49,940 --> 00:02:52,043 in order to obtain the query results. 67 00:02:53,890 --> 00:02:55,550 This type of distribution is best 68 00:02:55,550 --> 00:02:58,600 for when there is no clear distribution key. 69 00:02:58,600 --> 00:03:00,500 If you don't have something specific 70 00:03:00,500 --> 00:03:02,750 that would work really well for hash-distributed, 71 00:03:02,750 --> 00:03:04,603 then round robin is a great option. 72 00:03:05,930 --> 00:03:08,620 If you don't have frequent joins with other tables 73 00:03:08,620 --> 00:03:09,770 and therefore aren't pulling 74 00:03:09,770 --> 00:03:11,530 from a bunch of different areas, 75 00:03:11,530 --> 00:03:12,830 then it will be just fine. 76 00:03:14,380 --> 00:03:16,630 If you want a uniform distribution, 77 00:03:16,630 --> 00:03:18,737 round robin accomplishes that for you. 78 00:03:20,400 --> 00:03:22,810 As we said, staging tables are great. 79 00:03:22,810 --> 00:03:25,160 It's going to quickly load that information in there 80 00:03:25,160 --> 00:03:27,610 as it randomly throws it wherever it needs to go. 81 00:03:29,570 --> 00:03:32,700 And if you just need a simple starting point, 82 00:03:32,700 --> 00:03:34,560 let's say that you're just beginning to experiment 83 00:03:34,560 --> 00:03:37,270 with distribution, give round robin a try. 84 00:03:37,270 --> 00:03:39,680 If it suits your needs, then you're off to the races. 85 00:03:39,680 --> 00:03:41,270 But if you determine later 86 00:03:41,270 --> 00:03:43,250 that a different strategy is better, 87 00:03:43,250 --> 00:03:44,980 you can always start again. 88 00:03:44,980 --> 00:03:46,600 But it's great for getting started quickly 89 00:03:46,600 --> 00:03:47,693 with distribution. 90 00:03:50,980 --> 00:03:53,360 Our next type is hash-distributed. 91 00:03:53,360 --> 00:03:54,810 And this is like creating signs 92 00:03:54,810 --> 00:03:56,560 that tell the rows which way to go. 93 00:03:57,740 --> 00:03:59,930 It is deterministic, 94 00:03:59,930 --> 00:04:01,740 and that means data is distributed 95 00:04:01,740 --> 00:04:04,430 by the hash value of a column. 96 00:04:04,430 --> 00:04:06,540 This is a column that you choose, 97 00:04:06,540 --> 00:04:09,680 and so rows with the same data value and data type 98 00:04:09,680 --> 00:04:12,460 for the hash key will always be assigned 99 00:04:12,460 --> 00:04:14,290 to the same distribution. 100 00:04:14,290 --> 00:04:16,683 This is because their hash value is the same. 101 00:04:18,760 --> 00:04:21,640 The distribution column cannot be changed later, 102 00:04:21,640 --> 00:04:22,720 so once you've chosen it, 103 00:04:22,720 --> 00:04:26,053 you would have to recreate the table in order to change it. 104 00:04:27,010 --> 00:04:28,710 For that reason and others, 105 00:04:28,710 --> 00:04:31,870 it's important to select a column wisely. 106 00:04:31,870 --> 00:04:34,440 Remember, you're trying to minimize data movement 107 00:04:34,440 --> 00:04:36,593 while also distributing evenly. 108 00:04:37,620 --> 00:04:40,823 And so a few tips for how to choose a great column. 109 00:04:42,200 --> 00:04:44,850 Pick one that has many unique values. 110 00:04:44,850 --> 00:04:46,430 This is going to help spread those rows 111 00:04:46,430 --> 00:04:47,853 to many distributions. 112 00:04:49,240 --> 00:04:52,270 Pick one that has few or no nulls. 113 00:04:52,270 --> 00:04:55,630 Since null values all evaluate to the same hash value, 114 00:04:55,630 --> 00:04:58,700 they're going to get dumped into the same bucket. 115 00:04:58,700 --> 00:05:02,173 And so you could end up with very unevenly distributed rows. 116 00:05:03,890 --> 00:05:05,910 Very similar for date columns. 117 00:05:05,910 --> 00:05:09,160 You don't want to create hotspots if you have a lot of data 118 00:05:09,160 --> 00:05:11,820 that's on one day and not on another. 119 00:05:11,820 --> 00:05:14,313 And so it's probably best to avoid date columns. 120 00:05:16,470 --> 00:05:20,090 Hash-distributed is best for large tables. 121 00:05:20,090 --> 00:05:23,310 And when we say large, we mean that the table size on disk 122 00:05:23,310 --> 00:05:25,920 is larger than 2 gigabytes. 123 00:05:25,920 --> 00:05:28,830 Examples of these are large facts tables, 124 00:05:28,830 --> 00:05:30,880 historical transaction tables, 125 00:05:30,880 --> 00:05:32,733 and even large dimension tables. 126 00:05:34,650 --> 00:05:36,060 It's also great for tables 127 00:05:36,060 --> 00:05:39,133 with frequent inserts, updates, and deletes. 128 00:05:42,130 --> 00:05:43,440 Here, we have a quick example 129 00:05:43,440 --> 00:05:46,150 of creating a table with distribution. 130 00:05:46,150 --> 00:05:48,610 You can see we have a number of columns. 131 00:05:48,610 --> 00:05:52,360 And inside the WITH statement, we set our distribution, 132 00:05:52,360 --> 00:05:55,450 in this case, equal to Hash. 133 00:05:55,450 --> 00:05:57,310 And we're using the product Key 134 00:05:57,310 --> 00:05:59,773 as the column for the hash function. 135 00:06:03,816 --> 00:06:07,680 Our final type of distribution is replicated. 136 00:06:07,680 --> 00:06:09,950 And this is quite different than the other two. 137 00:06:09,950 --> 00:06:11,960 In this scenario, we have a full table 138 00:06:11,960 --> 00:06:14,510 that we copy to every node. 139 00:06:14,510 --> 00:06:16,630 This eliminates the need to transfer the data 140 00:06:16,630 --> 00:06:18,123 back and forth for queries. 141 00:06:19,170 --> 00:06:21,040 It does have a storage impact 142 00:06:21,040 --> 00:06:22,910 because by copying into all of the nodes, 143 00:06:22,910 --> 00:06:25,000 of course, you're taking up extra storage, 144 00:06:25,000 --> 00:06:27,003 and it incurs overhead for writes. 145 00:06:28,730 --> 00:06:30,760 Instead of being an entire solution, 146 00:06:30,760 --> 00:06:33,250 it's really just 1 piece of the puzzle. 147 00:06:33,250 --> 00:06:36,740 It's normally used in conjunction with the other methods. 148 00:06:36,740 --> 00:06:38,810 And so what it's best for 149 00:06:38,810 --> 00:06:41,730 is small lookup or dimension tables 150 00:06:41,730 --> 00:06:44,780 that are frequently joined with larger tables. 151 00:06:44,780 --> 00:06:46,750 I hate to say that's all it's good for. 152 00:06:46,750 --> 00:06:49,810 It feels wrong to only have 1 check mark on a slide, 153 00:06:49,810 --> 00:06:52,290 but it's a very specialized case. 154 00:06:52,290 --> 00:06:54,510 Normally, you'll be using one of the other two, 155 00:06:54,510 --> 00:06:58,490 either round robin or hash-distributed. 156 00:06:58,490 --> 00:07:00,180 But replicated can also save you 157 00:07:00,180 --> 00:07:02,670 from having to copy information back and forth 158 00:07:02,670 --> 00:07:04,970 for the small tables that are frequently used. 159 00:07:07,460 --> 00:07:09,130 By way of review, 160 00:07:09,130 --> 00:07:12,300 Azure Synapse Analytics automatically creates and manages 161 00:07:12,300 --> 00:07:14,263 60 distributions for your data. 162 00:07:15,890 --> 00:07:18,150 You will normally choose between round robin, 163 00:07:18,150 --> 00:07:19,680 which distributes randomly, 164 00:07:19,680 --> 00:07:22,423 or hashed, which distributes deterministically. 165 00:07:23,390 --> 00:07:26,140 And in some scenarios, replicated tables can be useful 166 00:07:26,140 --> 00:07:28,090 for small amounts of information 167 00:07:28,090 --> 00:07:30,593 that are frequently joined with larger tables. 168 00:07:31,830 --> 00:07:33,160 That's it for this lesson. 169 00:07:33,160 --> 00:07:34,600 I hope this helps you wrap your mind 170 00:07:34,600 --> 00:07:37,200 around distribution in Azure Synapse. 171 00:07:37,200 --> 00:07:38,320 If you have any questions, 172 00:07:38,320 --> 00:07:40,140 please feel free to reach out to me. 173 00:07:40,140 --> 00:07:41,680 And when you're ready for the next lesson, 174 00:07:41,680 --> 00:07:42,630 I'll see you there.