1
00:00:00,610 --> 00:00:01,530
Hey, Cloud Gurus.

2
00:00:01,530 --> 00:00:04,063
Welcome to this lesson on distributing data.

3
00:00:06,110 --> 00:00:07,610
In this lesson, we're going to start off

4
00:00:07,610 --> 00:00:09,830
by introducing distribution.

5
00:00:09,830 --> 00:00:11,680
We're then going to talk about the different types

6
00:00:11,680 --> 00:00:16,680
such as round robin, hash-distributed, and replicated,

7
00:00:16,840 --> 00:00:19,123
and wrap everything up in a review.

8
00:00:22,570 --> 00:00:23,403
To start off,

9
00:00:23,403 --> 00:00:25,873
let's introduce what we mean by distributing data.

10
00:00:27,040 --> 00:00:29,000
Azure Synapse Analytics spreads your data

11
00:00:29,000 --> 00:00:32,983
across 60 underlying databases or distributions.

12
00:00:33,830 --> 00:00:35,400
This is done for you automatically

13
00:00:35,400 --> 00:00:39,150
as part of being a massively parallel processing system.

14
00:00:39,150 --> 00:00:40,500
And it's something that you don't manage

15
00:00:40,500 --> 00:00:41,973
or even have access to.

16
00:00:43,020 --> 00:00:45,000
This is not the same as partitioning,

17
00:00:45,000 --> 00:00:46,970
which is more about grouping similar data

18
00:00:46,970 --> 00:00:48,800
to be accessed together.

19
00:00:48,800 --> 00:00:50,210
Distribution can be thought of more

20
00:00:50,210 --> 00:00:52,940
as a hardware storage concept.

21
00:00:52,940 --> 00:00:54,800
The goal behind it is to distribute

22
00:00:54,800 --> 00:00:58,370
in such a way that data movement is minimized.

23
00:00:58,370 --> 00:01:00,040
And what we mean when we say data movement,

24
00:01:00,040 --> 00:01:03,320
is shuffling data around and fetching it across the network

25
00:01:03,320 --> 00:01:05,950
from different distributions in order to satisfy

26
00:01:05,950 --> 00:01:09,000
the results of a query, such as a join.

27
00:01:09,000 --> 00:01:10,780
If we don't get this right, then it can lead

28
00:01:10,780 --> 00:01:14,163
to negative effects such as network inefficiencies,

29
00:01:15,060 --> 00:01:19,440
increased I/O, and ultimately, slow queries.

30
00:01:19,440 --> 00:01:20,790
And so this is definitely something

31
00:01:20,790 --> 00:01:23,103
we want to take advantage of and get right.

32
00:01:25,650 --> 00:01:27,710
There are a few different types of distribution

33
00:01:27,710 --> 00:01:29,150
available to you.

34
00:01:29,150 --> 00:01:31,940
As you may have guessed from the introduction slide,

35
00:01:31,940 --> 00:01:34,260
we have round-robin-distributed tables,

36
00:01:34,260 --> 00:01:36,510
and that is where the data is distributed evenly

37
00:01:36,510 --> 00:01:37,923
in a random fashion.

38
00:01:39,070 --> 00:01:40,840
Hash-distributed tables,

39
00:01:40,840 --> 00:01:43,100
where data is distributed deterministically

40
00:01:43,100 --> 00:01:44,773
by using a hash function.

41
00:01:45,950 --> 00:01:49,720
And replicated, where we place a full copy of the table

42
00:01:49,720 --> 00:01:51,503
on every single compute node.

43
00:01:53,870 --> 00:01:56,640
Let's start off with round-robin-distributed tables.

44
00:01:56,640 --> 00:01:58,730
This is the easiest way to get started,

45
00:01:58,730 --> 00:02:01,513
and it's the default option if you don't specify one.

46
00:02:02,660 --> 00:02:03,860
As we mentioned before,

47
00:02:03,860 --> 00:02:07,080
this is an even distribution across the databases.

48
00:02:07,080 --> 00:02:09,110
It assigns rows randomly at first

49
00:02:09,110 --> 00:02:10,933
and then sequentially thereafter.

50
00:02:11,780 --> 00:02:13,500
That's a very efficient way to get them there,

51
00:02:13,500 --> 00:02:14,870
but it also means that the data

52
00:02:14,870 --> 00:02:17,890
on any particular distribution may have no relation

53
00:02:17,890 --> 00:02:20,053
to the other data that is there with it.

54
00:02:22,020 --> 00:02:24,420
This gives us fast performance for the loads.

55
00:02:24,420 --> 00:02:27,360
As we said, there's very little work in row assignment.

56
00:02:27,360 --> 00:02:30,030
And so that makes it great for things like staging tables,

57
00:02:30,030 --> 00:02:32,000
where you just want to load the data in there

58
00:02:32,000 --> 00:02:33,163
as quickly as possible.

59
00:02:34,480 --> 00:02:37,550
You can, however, have slower read performance.

60
00:02:37,550 --> 00:02:39,360
That's because there is a great potential

61
00:02:39,360 --> 00:02:41,640
for data movement here because the data

62
00:02:41,640 --> 00:02:44,040
is not placed in logical groupings.

63
00:02:44,040 --> 00:02:46,320
So when it goes to satisfy a query,

64
00:02:46,320 --> 00:02:48,010
there's a good chance it's going to have to pull

65
00:02:48,010 --> 00:02:49,940
from multiple distributions

66
00:02:49,940 --> 00:02:52,043
in order to obtain the query results.

67
00:02:53,890 --> 00:02:55,550
This type of distribution is best

68
00:02:55,550 --> 00:02:58,600
for when there is no clear distribution key.

69
00:02:58,600 --> 00:03:00,500
If you don't have something specific

70
00:03:00,500 --> 00:03:02,750
that would work really well for hash-distributed,

71
00:03:02,750 --> 00:03:04,603
then round robin is a great option.

72
00:03:05,930 --> 00:03:08,620
If you don't have frequent joins with other tables

73
00:03:08,620 --> 00:03:09,770
and therefore aren't pulling

74
00:03:09,770 --> 00:03:11,530
from a bunch of different areas,

75
00:03:11,530 --> 00:03:12,830
then it will be just fine.

76
00:03:14,380 --> 00:03:16,630
If you want a uniform distribution,

77
00:03:16,630 --> 00:03:18,737
round robin accomplishes that for you.

78
00:03:20,400 --> 00:03:22,810
As we said, staging tables are great.

79
00:03:22,810 --> 00:03:25,160
It's going to quickly load that information in there

80
00:03:25,160 --> 00:03:27,610
as it randomly throws it wherever it needs to go.

81
00:03:29,570 --> 00:03:32,700
And if you just need a simple starting point,

82
00:03:32,700 --> 00:03:34,560
let's say that you're just beginning to experiment

83
00:03:34,560 --> 00:03:37,270
with distribution, give round robin a try.

84
00:03:37,270 --> 00:03:39,680
If it suits your needs, then you're off to the races.

85
00:03:39,680 --> 00:03:41,270
But if you determine later

86
00:03:41,270 --> 00:03:43,250
that a different strategy is better,

87
00:03:43,250 --> 00:03:44,980
you can always start again.

88
00:03:44,980 --> 00:03:46,600
But it's great for getting started quickly

89
00:03:46,600 --> 00:03:47,693
with distribution.

90
00:03:50,980 --> 00:03:53,360
Our next type is hash-distributed.

91
00:03:53,360 --> 00:03:54,810
And this is like creating signs

92
00:03:54,810 --> 00:03:56,560
that tell the rows which way to go.

93
00:03:57,740 --> 00:03:59,930
It is deterministic,

94
00:03:59,930 --> 00:04:01,740
and that means data is distributed

95
00:04:01,740 --> 00:04:04,430
by the hash value of a column.

96
00:04:04,430 --> 00:04:06,540
This is a column that you choose,

97
00:04:06,540 --> 00:04:09,680
and so rows with the same data value and data type

98
00:04:09,680 --> 00:04:12,460
for the hash key will always be assigned

99
00:04:12,460 --> 00:04:14,290
to the same distribution.

100
00:04:14,290 --> 00:04:16,683
This is because their hash value is the same.

101
00:04:18,760 --> 00:04:21,640
The distribution column cannot be changed later,

102
00:04:21,640 --> 00:04:22,720
so once you've chosen it,

103
00:04:22,720 --> 00:04:26,053
you would have to recreate the table in order to change it.

104
00:04:27,010 --> 00:04:28,710
For that reason and others,

105
00:04:28,710 --> 00:04:31,870
it's important to select a column wisely.

106
00:04:31,870 --> 00:04:34,440
Remember, you're trying to minimize data movement

107
00:04:34,440 --> 00:04:36,593
while also distributing evenly.

108
00:04:37,620 --> 00:04:40,823
And so a few tips for how to choose a great column.

109
00:04:42,200 --> 00:04:44,850
Pick one that has many unique values.

110
00:04:44,850 --> 00:04:46,430
This is going to help spread those rows

111
00:04:46,430 --> 00:04:47,853
to many distributions.

112
00:04:49,240 --> 00:04:52,270
Pick one that has few or no nulls.

113
00:04:52,270 --> 00:04:55,630
Since null values all evaluate to the same hash value,

114
00:04:55,630 --> 00:04:58,700
they're going to get dumped into the same bucket.

115
00:04:58,700 --> 00:05:02,173
And so you could end up with very unevenly distributed rows.

116
00:05:03,890 --> 00:05:05,910
Very similar for date columns.

117
00:05:05,910 --> 00:05:09,160
You don't want to create hotspots if you have a lot of data

118
00:05:09,160 --> 00:05:11,820
that's on one day and not on another.

119
00:05:11,820 --> 00:05:14,313
And so it's probably best to avoid date columns.

120
00:05:16,470 --> 00:05:20,090
Hash-distributed is best for large tables.

121
00:05:20,090 --> 00:05:23,310
And when we say large, we mean that the table size on disk

122
00:05:23,310 --> 00:05:25,920
is larger than 2 gigabytes.

123
00:05:25,920 --> 00:05:28,830
Examples of these are large facts tables,

124
00:05:28,830 --> 00:05:30,880
historical transaction tables,

125
00:05:30,880 --> 00:05:32,733
and even large dimension tables.

126
00:05:34,650 --> 00:05:36,060
It's also great for tables

127
00:05:36,060 --> 00:05:39,133
with frequent inserts, updates, and deletes.

128
00:05:42,130 --> 00:05:43,440
Here, we have a quick example

129
00:05:43,440 --> 00:05:46,150
of creating a table with distribution.

130
00:05:46,150 --> 00:05:48,610
You can see we have a number of columns.

131
00:05:48,610 --> 00:05:52,360
And inside the WITH statement, we set our distribution,

132
00:05:52,360 --> 00:05:55,450
in this case, equal to Hash.

133
00:05:55,450 --> 00:05:57,310
And we're using the product Key

134
00:05:57,310 --> 00:05:59,773
as the column for the hash function.

135
00:06:03,816 --> 00:06:07,680
Our final type of distribution is replicated.

136
00:06:07,680 --> 00:06:09,950
And this is quite different than the other two.

137
00:06:09,950 --> 00:06:11,960
In this scenario, we have a full table

138
00:06:11,960 --> 00:06:14,510
that we copy to every node.

139
00:06:14,510 --> 00:06:16,630
This eliminates the need to transfer the data

140
00:06:16,630 --> 00:06:18,123
back and forth for queries.

141
00:06:19,170 --> 00:06:21,040
It does have a storage impact

142
00:06:21,040 --> 00:06:22,910
because by copying into all of the nodes,

143
00:06:22,910 --> 00:06:25,000
of course, you're taking up extra storage,

144
00:06:25,000 --> 00:06:27,003
and it incurs overhead for writes.

145
00:06:28,730 --> 00:06:30,760
Instead of being an entire solution,

146
00:06:30,760 --> 00:06:33,250
it's really just 1 piece of the puzzle.

147
00:06:33,250 --> 00:06:36,740
It's normally used in conjunction with the other methods.

148
00:06:36,740 --> 00:06:38,810
And so what it's best for

149
00:06:38,810 --> 00:06:41,730
is small lookup or dimension tables

150
00:06:41,730 --> 00:06:44,780
that are frequently joined with larger tables.

151
00:06:44,780 --> 00:06:46,750
I hate to say that's all it's good for.

152
00:06:46,750 --> 00:06:49,810
It feels wrong to only have 1 check mark on a slide,

153
00:06:49,810 --> 00:06:52,290
but it's a very specialized case.

154
00:06:52,290 --> 00:06:54,510
Normally, you'll be using one of the other two,

155
00:06:54,510 --> 00:06:58,490
either round robin or hash-distributed.

156
00:06:58,490 --> 00:07:00,180
But replicated can also save you

157
00:07:00,180 --> 00:07:02,670
from having to copy information back and forth

158
00:07:02,670 --> 00:07:04,970
for the small tables that are frequently used.

159
00:07:07,460 --> 00:07:09,130
By way of review,

160
00:07:09,130 --> 00:07:12,300
Azure Synapse Analytics automatically creates and manages

161
00:07:12,300 --> 00:07:14,263
60 distributions for your data.

162
00:07:15,890 --> 00:07:18,150
You will normally choose between round robin,

163
00:07:18,150 --> 00:07:19,680
which distributes randomly,

164
00:07:19,680 --> 00:07:22,423
or hashed, which distributes deterministically.

165
00:07:23,390 --> 00:07:26,140
And in some scenarios, replicated tables can be useful

166
00:07:26,140 --> 00:07:28,090
for small amounts of information

167
00:07:28,090 --> 00:07:30,593
that are frequently joined with larger tables.

168
00:07:31,830 --> 00:07:33,160
That's it for this lesson.

169
00:07:33,160 --> 00:07:34,600
I hope this helps you wrap your mind

170
00:07:34,600 --> 00:07:37,200
around distribution in Azure Synapse.

171
00:07:37,200 --> 00:07:38,320
If you have any questions,

172
00:07:38,320 --> 00:07:40,140
please feel free to reach out to me.

173
00:07:40,140 --> 00:07:41,680
And when you're ready for the next lesson,

174
00:07:41,680 --> 00:07:42,630
I'll see you there.