1 00:00:00,590 --> 00:00:01,500 Hey, Cloud Gurus. 2 00:00:01,500 --> 00:00:04,420 Welcome back to the wonderful world of partitioning. 3 00:00:04,420 --> 00:00:05,253 In the last lesson, 4 00:00:05,253 --> 00:00:07,460 we took a look at some of the best practices 5 00:00:07,460 --> 00:00:09,090 of partitioning in general, 6 00:00:09,090 --> 00:00:11,763 as well as some specific advice for Azure Data Lake. 7 00:00:12,830 --> 00:00:13,820 In this lesson, 8 00:00:13,820 --> 00:00:17,230 we're going to take a look at partitioning in Azure Synapse. 9 00:00:17,230 --> 00:00:18,170 As I said before, 10 00:00:18,170 --> 00:00:20,213 many of the things we discussed in the last lesson 11 00:00:20,213 --> 00:00:22,240 apply here as well. 12 00:00:22,240 --> 00:00:24,580 So, instead of reiterating what's already been said, 13 00:00:24,580 --> 00:00:27,960 we're going to focus in on what is key for Azure Synapse. 14 00:00:27,960 --> 00:00:30,450 And so, we'll be listing some best practices 15 00:00:30,450 --> 00:00:32,323 and then, wrapping up with a review. 16 00:00:33,850 --> 00:00:35,440 Just like with Azure data lakes, 17 00:00:35,440 --> 00:00:38,440 it's important upfront to understand how partitioning works 18 00:00:38,440 --> 00:00:39,633 in Azure Synapse. 19 00:00:40,500 --> 00:00:43,890 This is a massively parallel processing system. 20 00:00:43,890 --> 00:00:45,930 And what that means is that clients are connecting 21 00:00:45,930 --> 00:00:48,410 to a control node, and then that control node 22 00:00:48,410 --> 00:00:50,850 optimizes the query and passes it on 23 00:00:50,850 --> 00:00:53,290 to a number of compute nodes underneath. 24 00:00:53,290 --> 00:00:55,423 And those execute the work in parallel. 25 00:00:57,780 --> 00:00:59,443 Because of that, 26 00:00:59,443 --> 00:01:01,090 there are some extra factors in determining the number 27 00:01:01,090 --> 00:01:03,810 of partitions in an Azure Synapse system, 28 00:01:03,810 --> 00:01:06,360 because it is distributed by default. 29 00:01:06,360 --> 00:01:08,420 The data is automatically distributed 30 00:01:08,420 --> 00:01:12,330 across 60 underlying databases or distributions. 31 00:01:12,330 --> 00:01:14,020 And so, any partitioning done 32 00:01:14,020 --> 00:01:17,170 is in addition to these underlying distributions. 33 00:01:17,170 --> 00:01:18,630 So let's take an example. 34 00:01:18,630 --> 00:01:21,030 If you decided to make 100 partitions, 35 00:01:21,030 --> 00:01:25,050 you'd actually need to multiply that by 60 distributions, 36 00:01:25,050 --> 00:01:28,670 giving you, in total, 6,000 partitions. 37 00:01:28,670 --> 00:01:31,250 So keep that in mind so that you don't accidentally 38 00:01:31,250 --> 00:01:33,100 spread your data too thin. 39 00:01:33,100 --> 00:01:35,290 If your table only has 6,000 rows, 40 00:01:35,290 --> 00:01:37,860 then that's only 1 row per partition. 41 00:01:37,860 --> 00:01:39,743 And that's obviously not ideal. 42 00:01:40,970 --> 00:01:44,123 With that in mind, let's move on to best practices. 43 00:01:45,810 --> 00:01:47,060 Be focused. 44 00:01:47,060 --> 00:01:50,483 Scope your queries to as few partitions as possible. 45 00:01:51,698 --> 00:01:52,590 Your biggest enemy in these systems 46 00:01:52,590 --> 00:01:54,270 is going to be data movement, 47 00:01:54,270 --> 00:01:55,990 and that's shuffling around the data 48 00:01:55,990 --> 00:01:58,020 to satisfy your queries. 49 00:01:58,020 --> 00:02:00,670 If your query can look in just a couple of partitions, 50 00:02:00,670 --> 00:02:02,690 it's going to be much faster. 51 00:02:02,690 --> 00:02:04,387 And for the same reasons, 52 00:02:04,387 --> 00:02:07,100 you need to avoid cross-partition joins or transactions. 53 00:02:07,100 --> 00:02:08,290 If you're constantly having to 54 00:02:08,290 --> 00:02:10,110 join together other partitions, 55 00:02:10,110 --> 00:02:12,710 it's definitely going to slow down your performance. 56 00:02:13,720 --> 00:02:16,720 Also, utilize partition operations. 57 00:02:16,720 --> 00:02:19,210 It can be tempting to just use an insert and update 58 00:02:19,210 --> 00:02:22,820 or delete, but that has a heavy transactional cost. 59 00:02:22,820 --> 00:02:26,750 Instead, use partition switching, merging, and deletion, 60 00:02:26,750 --> 00:02:30,590 really taking an advantage of carefully crafted partitions. 61 00:02:30,590 --> 00:02:33,000 And that will help these operations go much faster, 62 00:02:33,000 --> 00:02:35,753 especially the larger amount of data you have. 63 00:02:37,160 --> 00:02:39,450 And don't over-partition. 64 00:02:39,450 --> 00:02:41,680 Just like our 6,000-row table, 65 00:02:41,680 --> 00:02:44,260 think of it like this empty auditorium. 66 00:02:44,260 --> 00:02:45,770 You could have lots of partitions 67 00:02:45,770 --> 00:02:48,300 and very little or no information in it. 68 00:02:48,300 --> 00:02:49,780 Generally, you don't want to shoot for 69 00:02:49,780 --> 00:02:51,290 thousands of partitions, 70 00:02:51,290 --> 00:02:54,870 but you're aiming more in the ballpark of tens or hundreds. 71 00:02:54,870 --> 00:02:57,760 Each scenario will be different, of course, but overall, 72 00:02:57,760 --> 00:03:01,090 just because partitioning usually helps us performance-wise, 73 00:03:01,090 --> 00:03:02,990 it can also hurt us if we do it wrong. 74 00:03:05,020 --> 00:03:07,910 By way of review, when creating partitions, 75 00:03:07,910 --> 00:03:09,180 remember that Azure Synapse 76 00:03:09,180 --> 00:03:11,810 already has 60 distributions included. 77 00:03:11,810 --> 00:03:14,450 And so, any number of partitions you create 78 00:03:14,450 --> 00:03:16,333 is going to be multiplied by 60. 79 00:03:17,920 --> 00:03:20,410 Take advantage of partition switching, merging, 80 00:03:20,410 --> 00:03:22,780 and deletion for much better performance 81 00:03:22,780 --> 00:03:24,893 over inserts, updates, and deletes. 82 00:03:26,270 --> 00:03:28,150 And avoid performance pitfalls 83 00:03:28,150 --> 00:03:30,720 that result from crossing partitions. 84 00:03:30,720 --> 00:03:32,910 You really want to craft your partitions, 85 00:03:32,910 --> 00:03:35,480 and your queries against those partitions, 86 00:03:35,480 --> 00:03:38,990 in a way that's going to scope them to as few as possible, 87 00:03:38,990 --> 00:03:40,660 really striking that balance between 88 00:03:40,660 --> 00:03:42,540 having all the information you need, 89 00:03:42,540 --> 00:03:44,840 but not having to split out to other partitions 90 00:03:44,840 --> 00:03:46,620 to join against them, 91 00:03:46,620 --> 00:03:48,920 or do write and delete operations across them. 92 00:03:50,350 --> 00:03:51,930 That's it for this lesson. 93 00:03:51,930 --> 00:03:54,280 And of course, keep in mind all the other lessons we learned 94 00:03:54,280 --> 00:03:57,050 in part one about scalability, performance, 95 00:03:57,050 --> 00:03:58,350 and availability. 96 00:03:58,350 --> 00:04:01,040 But I hope this lesson helps you understand specifically 97 00:04:01,040 --> 00:04:04,530 how to best approach partitioning in Azure Synapse. 98 00:04:04,530 --> 00:04:05,410 That's it for now. 99 00:04:05,410 --> 00:04:08,260 And I will see you in the next lesson, when you're ready.