1
00:00:00,590 --> 00:00:01,500
Hey, Cloud Gurus.

2
00:00:01,500 --> 00:00:04,420
Welcome back to the wonderful world of partitioning.

3
00:00:04,420 --> 00:00:05,253
In the last lesson,

4
00:00:05,253 --> 00:00:07,460
we took a look at some of the best practices

5
00:00:07,460 --> 00:00:09,090
of partitioning in general,

6
00:00:09,090 --> 00:00:11,763
as well as some specific advice for Azure Data Lake.

7
00:00:12,830 --> 00:00:13,820
In this lesson,

8
00:00:13,820 --> 00:00:17,230
we're going to take a look at partitioning in Azure Synapse.

9
00:00:17,230 --> 00:00:18,170
As I said before,

10
00:00:18,170 --> 00:00:20,213
many of the things we discussed in the last lesson

11
00:00:20,213 --> 00:00:22,240
apply here as well.

12
00:00:22,240 --> 00:00:24,580
So, instead of reiterating what's already been said,

13
00:00:24,580 --> 00:00:27,960
we're going to focus in on what is key for Azure Synapse.

14
00:00:27,960 --> 00:00:30,450
And so, we'll be listing some best practices

15
00:00:30,450 --> 00:00:32,323
and then, wrapping up with a review.

16
00:00:33,850 --> 00:00:35,440
Just like with Azure data lakes,

17
00:00:35,440 --> 00:00:38,440
it's important upfront to understand how partitioning works

18
00:00:38,440 --> 00:00:39,633
in Azure Synapse.

19
00:00:40,500 --> 00:00:43,890
This is a massively parallel processing system.

20
00:00:43,890 --> 00:00:45,930
And what that means is that clients are connecting

21
00:00:45,930 --> 00:00:48,410
to a control node, and then that control node

22
00:00:48,410 --> 00:00:50,850
optimizes the query and passes it on

23
00:00:50,850 --> 00:00:53,290
to a number of compute nodes underneath.

24
00:00:53,290 --> 00:00:55,423
And those execute the work in parallel.

25
00:00:57,780 --> 00:00:59,443
Because of that,

26
00:00:59,443 --> 00:01:01,090
there are some extra factors in determining the number

27
00:01:01,090 --> 00:01:03,810
of partitions in an Azure Synapse system,

28
00:01:03,810 --> 00:01:06,360
because it is distributed by default.

29
00:01:06,360 --> 00:01:08,420
The data is automatically distributed

30
00:01:08,420 --> 00:01:12,330
across 60 underlying databases or distributions.

31
00:01:12,330 --> 00:01:14,020
And so, any partitioning done

32
00:01:14,020 --> 00:01:17,170
is in addition to these underlying distributions.

33
00:01:17,170 --> 00:01:18,630
So let's take an example.

34
00:01:18,630 --> 00:01:21,030
If you decided to make 100 partitions,

35
00:01:21,030 --> 00:01:25,050
you'd actually need to multiply that by 60 distributions,

36
00:01:25,050 --> 00:01:28,670
giving you, in total, 6,000 partitions.

37
00:01:28,670 --> 00:01:31,250
So keep that in mind so that you don't accidentally

38
00:01:31,250 --> 00:01:33,100
spread your data too thin.

39
00:01:33,100 --> 00:01:35,290
If your table only has 6,000 rows,

40
00:01:35,290 --> 00:01:37,860
then that's only 1 row per partition.

41
00:01:37,860 --> 00:01:39,743
And that's obviously not ideal.

42
00:01:40,970 --> 00:01:44,123
With that in mind, let's move on to best practices.

43
00:01:45,810 --> 00:01:47,060
Be focused.

44
00:01:47,060 --> 00:01:50,483
Scope your queries to as few partitions as possible.

45
00:01:51,698 --> 00:01:52,590
Your biggest enemy in these systems

46
00:01:52,590 --> 00:01:54,270
is going to be data movement,

47
00:01:54,270 --> 00:01:55,990
and that's shuffling around the data

48
00:01:55,990 --> 00:01:58,020
to satisfy your queries.

49
00:01:58,020 --> 00:02:00,670
If your query can look in just a couple of partitions,

50
00:02:00,670 --> 00:02:02,690
it's going to be much faster.

51
00:02:02,690 --> 00:02:04,387
And for the same reasons,

52
00:02:04,387 --> 00:02:07,100
you need to avoid cross-partition joins or transactions.

53
00:02:07,100 --> 00:02:08,290
If you're constantly having to

54
00:02:08,290 --> 00:02:10,110
join together other partitions,

55
00:02:10,110 --> 00:02:12,710
it's definitely going to slow down your performance.

56
00:02:13,720 --> 00:02:16,720
Also, utilize partition operations.

57
00:02:16,720 --> 00:02:19,210
It can be tempting to just use an insert and update

58
00:02:19,210 --> 00:02:22,820
or delete, but that has a heavy transactional cost.

59
00:02:22,820 --> 00:02:26,750
Instead, use partition switching, merging, and deletion,

60
00:02:26,750 --> 00:02:30,590
really taking an advantage of carefully crafted partitions.

61
00:02:30,590 --> 00:02:33,000
And that will help these operations go much faster,

62
00:02:33,000 --> 00:02:35,753
especially the larger amount of data you have.

63
00:02:37,160 --> 00:02:39,450
And don't over-partition.

64
00:02:39,450 --> 00:02:41,680
Just like our 6,000-row table,

65
00:02:41,680 --> 00:02:44,260
think of it like this empty auditorium.

66
00:02:44,260 --> 00:02:45,770
You could have lots of partitions

67
00:02:45,770 --> 00:02:48,300
and very little or no information in it.

68
00:02:48,300 --> 00:02:49,780
Generally, you don't want to shoot for

69
00:02:49,780 --> 00:02:51,290
thousands of partitions,

70
00:02:51,290 --> 00:02:54,870
but you're aiming more in the ballpark of tens or hundreds.

71
00:02:54,870 --> 00:02:57,760
Each scenario will be different, of course, but overall,

72
00:02:57,760 --> 00:03:01,090
just because partitioning usually helps us performance-wise,

73
00:03:01,090 --> 00:03:02,990
it can also hurt us if we do it wrong.

74
00:03:05,020 --> 00:03:07,910
By way of review, when creating partitions,

75
00:03:07,910 --> 00:03:09,180
remember that Azure Synapse

76
00:03:09,180 --> 00:03:11,810
already has 60 distributions included.

77
00:03:11,810 --> 00:03:14,450
And so, any number of partitions you create

78
00:03:14,450 --> 00:03:16,333
is going to be multiplied by 60.

79
00:03:17,920 --> 00:03:20,410
Take advantage of partition switching, merging,

80
00:03:20,410 --> 00:03:22,780
and deletion for much better performance

81
00:03:22,780 --> 00:03:24,893
over inserts, updates, and deletes.

82
00:03:26,270 --> 00:03:28,150
And avoid performance pitfalls

83
00:03:28,150 --> 00:03:30,720
that result from crossing partitions.

84
00:03:30,720 --> 00:03:32,910
You really want to craft your partitions,

85
00:03:32,910 --> 00:03:35,480
and your queries against those partitions,

86
00:03:35,480 --> 00:03:38,990
in a way that's going to scope them to as few as possible,

87
00:03:38,990 --> 00:03:40,660
really striking that balance between

88
00:03:40,660 --> 00:03:42,540
having all the information you need,

89
00:03:42,540 --> 00:03:44,840
but not having to split out to other partitions

90
00:03:44,840 --> 00:03:46,620
to join against them,

91
00:03:46,620 --> 00:03:48,920
or do write and delete operations across them.

92
00:03:50,350 --> 00:03:51,930
That's it for this lesson.

93
00:03:51,930 --> 00:03:54,280
And of course, keep in mind all the other lessons we learned

94
00:03:54,280 --> 00:03:57,050
in part one about scalability, performance,

95
00:03:57,050 --> 00:03:58,350
and availability.

96
00:03:58,350 --> 00:04:01,040
But I hope this lesson helps you understand specifically

97
00:04:01,040 --> 00:04:04,530
how to best approach partitioning in Azure Synapse.

98
00:04:04,530 --> 00:04:05,410
That's it for now.

99
00:04:05,410 --> 00:04:08,260
And I will see you in the next lesson, when you're ready.