1
00:00:00,240 --> 00:00:01,080
In this lesson,

2
00:00:01,080 --> 00:00:04,404
we are going to talk about shuffling in a pipeline.

3
00:00:04,404 --> 00:00:05,840
And we're going to start off

4
00:00:05,840 --> 00:00:07,700
by talking about distributed tables,

5
00:00:07,700 --> 00:00:09,740
and we're going to take a look at hash distribution

6
00:00:09,740 --> 00:00:11,450
and round-robin distribution.

7
00:00:11,450 --> 00:00:12,290
And then we're going to follow up

8
00:00:12,290 --> 00:00:14,440
by taking a look at how we set partition types

9
00:00:14,440 --> 00:00:15,653
in the portal.

10
00:00:16,750 --> 00:00:19,870
So let's start by talking about table distribution.

11
00:00:19,870 --> 00:00:22,460
There's several different types of table distribution.

12
00:00:22,460 --> 00:00:25,630
There's round-robin, hash, dynamic range,

13
00:00:25,630 --> 00:00:27,580
fixed range, and key.

14
00:00:27,580 --> 00:00:30,640
Now, this is specific to Data Factory

15
00:00:30,640 --> 00:00:32,630
which is the service that we're looking at here.

16
00:00:32,630 --> 00:00:35,350
This is actually an image from Data Factory,

17
00:00:35,350 --> 00:00:37,470
and we'll see this later in the portal.

18
00:00:37,470 --> 00:00:40,300
But we're going to be talking through these partition types,

19
00:00:40,300 --> 00:00:41,280
and it's also important

20
00:00:41,280 --> 00:00:43,270
to talk a little bit about shuffling.

21
00:00:43,270 --> 00:00:46,610
So shuffling is just whenever we need to move data

22
00:00:46,610 --> 00:00:50,210
from point A to point B, or we need to change attributes.

23
00:00:50,210 --> 00:00:52,400
We shuffle that data around,

24
00:00:52,400 --> 00:00:54,410
and the way that we shuffle that data around,

25
00:00:54,410 --> 00:00:56,393
that's our table distribution.

26
00:00:57,820 --> 00:01:01,160
So let's start off by talking about hash distribution.

27
00:01:01,160 --> 00:01:05,080
A hash function is used to assign data to rows.

28
00:01:05,080 --> 00:01:06,800
Think about it kind of like a map.

29
00:01:06,800 --> 00:01:10,160
So, it's the key or the information that you need

30
00:01:10,160 --> 00:01:13,060
to figure out how to get from point A to point B.

31
00:01:13,060 --> 00:01:14,540
When we're talking about data,

32
00:01:14,540 --> 00:01:16,150
it's the hash function

33
00:01:16,150 --> 00:01:18,120
is the information or the key

34
00:01:18,120 --> 00:01:20,160
to help us figure out how to move data

35
00:01:20,160 --> 00:01:21,720
from our starting point,

36
00:01:21,720 --> 00:01:23,100
and then where we're going to put it

37
00:01:23,100 --> 00:01:25,663
as we move data into the new table.

38
00:01:26,870 --> 00:01:29,440
Essentially, it's a fancy math algorithm.

39
00:01:29,440 --> 00:01:32,300
So let's take a look at how this works.

40
00:01:32,300 --> 00:01:34,747
So we have 3 input sources,

41
00:01:34,747 --> 00:01:39,747
and from those input sources, we are going to move our data.

42
00:01:40,210 --> 00:01:41,500
And what we're going to do

43
00:01:41,500 --> 00:01:43,190
is we're going to take those blue boxes there,

44
00:01:43,190 --> 00:01:44,023
that's our input,

45
00:01:44,023 --> 00:01:46,210
and we're going to move it into those 9 boxes

46
00:01:46,210 --> 00:01:48,390
over there on the right.

47
00:01:48,390 --> 00:01:51,810
Now, the purple box represents our hash function.

48
00:01:51,810 --> 00:01:55,050
So as the data moves from the left to the right,

49
00:01:55,050 --> 00:01:58,100
it's going to encounter that purple box.

50
00:01:58,100 --> 00:01:59,770
And that is our math algorithm

51
00:01:59,770 --> 00:02:02,810
that's going to tell us which box over there on the right

52
00:02:02,810 --> 00:02:04,450
we need to move into.

53
00:02:04,450 --> 00:02:07,230
Now, hash functions are really important

54
00:02:07,230 --> 00:02:11,000
because they're going to give you fast querying.

55
00:02:11,000 --> 00:02:13,087
It takes a little longer to load data

56
00:02:13,087 --> 00:02:16,520
into a hash distribution, but it's going to be fast to query

57
00:02:16,520 --> 00:02:19,830
because we know where the data is.

58
00:02:19,830 --> 00:02:21,030
Because of that algorithm,

59
00:02:21,030 --> 00:02:23,230
we know where that data is going to be located.

60
00:02:23,230 --> 00:02:24,810
However, it's going to be take a little bit longer

61
00:02:24,810 --> 00:02:27,920
because the input has to go through that hash

62
00:02:27,920 --> 00:02:30,663
before it can come out and go to the right spot.

63
00:02:32,020 --> 00:02:32,853
The other side of that

64
00:02:32,853 --> 00:02:35,750
is looking at a round-robin distribution.

65
00:02:35,750 --> 00:02:37,630
In a round-robin distribution,

66
00:02:37,630 --> 00:02:38,980
we're going to have first random,

67
00:02:38,980 --> 00:02:42,040
and then sequential movement of our data.

68
00:02:42,040 --> 00:02:43,880
So you can see over there on the right,

69
00:02:43,880 --> 00:02:46,990
we basically we put an X over our hash function.

70
00:02:46,990 --> 00:02:48,480
There is no more hash function,

71
00:02:48,480 --> 00:02:50,870
we're going to randomly assign the data

72
00:02:50,870 --> 00:02:52,700
into those 9 buckets.

73
00:02:52,700 --> 00:02:54,220
Now, when I say random,

74
00:02:54,220 --> 00:02:56,640
we are going to have an even distribution,

75
00:02:56,640 --> 00:02:59,480
so it's not like 1 bucket's going to get all of the data.

76
00:02:59,480 --> 00:03:01,070
It's going to be evenly distributed

77
00:03:01,070 --> 00:03:03,230
across all 9 of those boxes.

78
00:03:03,230 --> 00:03:06,770
However, it's going to be randomly distributed at first.

79
00:03:06,770 --> 00:03:10,310
Now, that is good because we're going to get faster loads,

80
00:03:10,310 --> 00:03:13,220
because we don't have to move through that hash function.

81
00:03:13,220 --> 00:03:15,500
However, it's going to take longer to query

82
00:03:15,500 --> 00:03:18,670
because we don't really know where that data is exactly.

83
00:03:18,670 --> 00:03:21,220
And so it's going to take longer to search through

84
00:03:21,220 --> 00:03:24,393
to get to the data that we need when we're running queries.

85
00:03:25,270 --> 00:03:27,290
So in addition to hash and round-robin,

86
00:03:27,290 --> 00:03:30,260
there's a couple more distribution types.

87
00:03:30,260 --> 00:03:33,560
There is fixed, dynamic, and key.

88
00:03:33,560 --> 00:03:35,020
So fixed range

89
00:03:35,020 --> 00:03:39,860
is just simply: we are going to set number of the partitions

90
00:03:39,860 --> 00:03:41,150
that we want,

91
00:03:41,150 --> 00:03:44,740
and we are going to use a custom expression,

92
00:03:44,740 --> 00:03:45,970
and that custom expression

93
00:03:45,970 --> 00:03:49,200
is going to be determining how that data gets loaded

94
00:03:49,200 --> 00:03:51,240
into those partitions.

95
00:03:51,240 --> 00:03:53,530
Now, dynamic is going to be very similar to that,

96
00:03:53,530 --> 00:03:55,870
except you're going to have the system to determine

97
00:03:55,870 --> 00:03:57,420
the ranges for your data,

98
00:03:57,420 --> 00:03:59,640
rather than it being based upon conditions

99
00:03:59,640 --> 00:04:01,200
that you're setting.

100
00:04:01,200 --> 00:04:03,700
The last one down there is key.

101
00:04:03,700 --> 00:04:05,220
Key distributions is just simply:

102
00:04:05,220 --> 00:04:06,960
you're going to pick a key column,

103
00:04:06,960 --> 00:04:08,470
and it's going to do distribution

104
00:04:08,470 --> 00:04:11,060
based upon that key column.

105
00:04:11,060 --> 00:04:13,470
So with that, let's go ahead and jump into the portal,

106
00:04:13,470 --> 00:04:16,660
and I'll show you how we set distribution.

107
00:04:16,660 --> 00:04:19,130
So here, we find ourselves in Data Factory,

108
00:04:19,130 --> 00:04:21,760
in Data Factory flows, specifically.

109
00:04:21,760 --> 00:04:24,110
And I have set up just a very simple flow;

110
00:04:24,110 --> 00:04:25,970
we have our input source,

111
00:04:25,970 --> 00:04:29,060
and then we have our output or our sync.

112
00:04:29,060 --> 00:04:33,210
If we go to our sync and look down here at the bottom,

113
00:04:33,210 --> 00:04:35,380
there is an Optimize tab,

114
00:04:35,380 --> 00:04:38,480
and you'll see that I can choose the partition type

115
00:04:38,480 --> 00:04:39,590
that I want.

116
00:04:39,590 --> 00:04:43,730
So I just simply choose the current partitioning

117
00:04:43,730 --> 00:04:46,880
or I set the partitioning as round-robin,

118
00:04:46,880 --> 00:04:49,590
or key, or fixed, right?

119
00:04:49,590 --> 00:04:50,830
That's literally it.

120
00:04:50,830 --> 00:04:52,320
When I run this data flow,

121
00:04:52,320 --> 00:04:55,130
it's going to change my partition type

122
00:04:55,130 --> 00:04:56,918
or it's going to use the partition type

123
00:04:56,918 --> 00:05:00,333
that is currently there from the source data.

124
00:05:00,333 --> 00:05:02,250
All right, so with that,

125
00:05:02,250 --> 00:05:04,450
let's go ahead and kind of wrap this lesson up

126
00:05:04,450 --> 00:05:06,090
with a few key points.

127
00:05:06,090 --> 00:05:08,720
You do need to know the basic distribution types,

128
00:05:08,720 --> 00:05:11,560
especially round-robin and hash.

129
00:05:11,560 --> 00:05:13,610
Those are very common across Azure,

130
00:05:13,610 --> 00:05:15,600
so make sure that you know those.

131
00:05:15,600 --> 00:05:17,530
And you need to understand that shuffling

132
00:05:17,530 --> 00:05:20,740
has a very large effect on performance.

133
00:05:20,740 --> 00:05:22,470
When you're moving data around,

134
00:05:22,470 --> 00:05:24,590
that's going to affect the cost,

135
00:05:24,590 --> 00:05:28,120
and that's going to affect your performance.

136
00:05:28,120 --> 00:05:30,450
So don't neglect optimization.

137
00:05:30,450 --> 00:05:31,743
It can make a huge difference

138
00:05:31,743 --> 00:05:36,210
when you set your distribution appropriately.

139
00:05:36,210 --> 00:05:37,820
All right, that's it for this lesson.

140
00:05:37,820 --> 00:05:39,070
I'll see you in the next.