1
00:00:02,340 --> 00:00:04,620
‫So now, let's talk about Amazon Athena.

2
00:00:04,620 --> 00:00:07,110
‫So Athena is a serverless query service

3
00:00:07,110 --> 00:00:10,980
‫to help you analyze the data stored in Amazon S3 buckets.

4
00:00:10,980 --> 00:00:12,720
‫And to analyze this data,

5
00:00:12,720 --> 00:00:15,180
‫you're going to use the standard SQL language

6
00:00:15,180 --> 00:00:16,590
‫to query the files.

7
00:00:16,590 --> 00:00:17,430
‫Behind the scenes,

8
00:00:17,430 --> 00:00:19,320
‫Athena is built on the Presto engine,

9
00:00:19,320 --> 00:00:21,030
‫which uses the SQL language.

10
00:00:21,030 --> 00:00:23,700
‫So the idea is that users are going to load data

11
00:00:23,700 --> 00:00:24,600
‫into your S3 bucket,

12
00:00:24,600 --> 00:00:27,330
‫or you are going to load data into your S3 bucket,

13
00:00:27,330 --> 00:00:29,070
‫and then you would use the Athena service

14
00:00:29,070 --> 00:00:32,640
‫to query and analyze this data in Amazon S3

15
00:00:32,640 --> 00:00:33,660
‫without moving it.

16
00:00:33,660 --> 00:00:34,920
‫So Athena is serverless,

17
00:00:34,920 --> 00:00:38,790
‫and it analyzes directly your data living in your S3 bucket.

18
00:00:38,790 --> 00:00:40,350
‫So it supports different formats,

19
00:00:40,350 --> 00:00:43,170
‫such as CSV, JSON, ORC,

20
00:00:43,170 --> 00:00:45,660
‫Avro, and Parquet, and possibly others.

21
00:00:45,660 --> 00:00:47,520
‫And the pricing is very simple.

22
00:00:47,520 --> 00:00:49,650
‫You're just going to pay a fixed amount

23
00:00:49,650 --> 00:00:52,050
‫per terabytes of data scanned.

24
00:00:52,050 --> 00:00:54,060
‫You don't need to provision any database again,

25
00:00:54,060 --> 00:00:56,580
‫because the whole service is serverless.

26
00:00:56,580 --> 00:00:59,640
‫So Athena is commonly used with another tool,

27
00:00:59,640 --> 00:01:03,630
‫called Amazon QuickSight to create reports and dashboards.

28
00:01:03,630 --> 00:01:06,150
‫So we'll see Quicksight later in depth,

29
00:01:06,150 --> 00:01:08,670
‫but Amazon Quicksight connects to Athena,

30
00:01:08,670 --> 00:01:11,100
‫which connects to your S3 buckets.

31
00:01:11,100 --> 00:01:13,440
‫Now, the use cases for Amazon Athena

32
00:01:13,440 --> 00:01:14,940
‫are to do ad hoc queries,

33
00:01:14,940 --> 00:01:17,880
‫business intelligence, analytics, reporting,

34
00:01:17,880 --> 00:01:19,830
‫and to analyze and query

35
00:01:19,830 --> 00:01:24,540
‫any kind of logs that originates from your AWS services.

36
00:01:24,540 --> 00:01:26,370
‫So it could be your VPC flow logs,

37
00:01:26,370 --> 00:01:30,120
‫your load balancer logs, your CloudTrail trails, and so on.

38
00:01:30,120 --> 00:01:31,860
‫So as an exam tip,

39
00:01:31,860 --> 00:01:34,830
‫anytime you need to analyze data in Amazon S3

40
00:01:34,830 --> 00:01:37,230
‫using a serverless SQL engine,

41
00:01:37,230 --> 00:01:38,823
‫you can think about Athena.

42
00:01:39,690 --> 00:01:41,940
‫Now, I've talked about performance improvements,

43
00:01:41,940 --> 00:01:44,180
‫and you can actually improve Athena performance,

44
00:01:44,180 --> 00:01:45,870
‫and in this exam,

45
00:01:45,870 --> 00:01:47,820
‫we'll test you on this as well.

46
00:01:47,820 --> 00:01:50,640
‫So first of all, because you pay for the amount

47
00:01:50,640 --> 00:01:53,520
‫of data scanned per terabyte,

48
00:01:53,520 --> 00:01:56,040
‫you need to use a type of data

49
00:01:56,040 --> 00:01:58,290
‫where you're going to scan less data.

50
00:01:58,290 --> 00:02:01,290
‫And for this, you can use a columnar data type

51
00:02:01,290 --> 00:02:02,123
‫for cost-savings,

52
00:02:02,123 --> 00:02:04,770
‫because you only scan the columns you need.

53
00:02:04,770 --> 00:02:09,090
‫So therefore, the recommended formats for Amazon Athena,

54
00:02:09,090 --> 00:02:12,243
‫are going to be Apache Parquet and ORC,

55
00:02:13,080 --> 00:02:15,780
‫and it's going to give you a huge performance improvement.

56
00:02:15,780 --> 00:02:20,370
‫And to get your files into the Apache Parquet or ORC format,

57
00:02:20,370 --> 00:02:21,630
‫you must use a service

58
00:02:21,630 --> 00:02:23,310
‫that we'll see as well in this section,

59
00:02:23,310 --> 00:02:24,660
‫for example, Glue.

60
00:02:24,660 --> 00:02:25,980
‫Glue can be very helpful

61
00:02:25,980 --> 00:02:29,130
‫to convert your data as an ETL job,

62
00:02:29,130 --> 00:02:32,640
‫between, for example, CSV and Parquet.

63
00:02:32,640 --> 00:02:35,850
‫Now, also because we want to scan less data,

64
00:02:35,850 --> 00:02:38,160
‫we need to compress data for smaller retrievals.

65
00:02:38,160 --> 00:02:41,190
‫So there are different compression mechanisms you can use

66
00:02:41,190 --> 00:02:43,680
‫that have listed it right here.

67
00:02:43,680 --> 00:02:46,890
‫Next, if you know you're going to query all the time

68
00:02:46,890 --> 00:02:48,990
‫on some specific columns,

69
00:02:48,990 --> 00:02:50,940
‫you can partition your datasets,

70
00:02:50,940 --> 00:02:54,180
‫and partition datasets means that in your S3 bucket,

71
00:02:54,180 --> 00:02:57,150
‫you're going to have the full path with slashes,

72
00:02:57,150 --> 00:03:00,240
‫and each slash will be a different column name

73
00:03:00,240 --> 00:03:01,920
‫with a specific value.

74
00:03:01,920 --> 00:03:03,060
‫And so you're organizing,

75
00:03:03,060 --> 00:03:05,370
‫you're partitioning your data in Amazon S3,

76
00:03:05,370 --> 00:03:06,510
‫so that when you query it,

77
00:03:06,510 --> 00:03:08,160
‫you can know exactly in which folder

78
00:03:08,160 --> 00:03:09,780
‫at which path Amazon S3,

79
00:03:09,780 --> 00:03:11,400
‫you need to scan for data.

80
00:03:11,400 --> 00:03:13,740
‫So here's an example of data partitions.

81
00:03:13,740 --> 00:03:16,770
‫So we have flight data in Parquet formats,

82
00:03:16,770 --> 00:03:20,070
‫and then we do /year=1991.

83
00:03:20,070 --> 00:03:21,750
‫So we partition by year,

84
00:03:21,750 --> 00:03:24,000
‫and we'll have one folder for each year.

85
00:03:24,000 --> 00:03:26,370
‫Then within each year we'll have month,

86
00:03:26,370 --> 00:03:27,480
‫so month equals one,

87
00:03:27,480 --> 00:03:29,370
‫and within each month we'll have days,

88
00:03:29,370 --> 00:03:30,210
‫this equals one.

89
00:03:30,210 --> 00:03:33,090
‫And so when I do a query on Athena,

90
00:03:33,090 --> 00:03:35,250
‫and I filter for a specific year,

91
00:03:35,250 --> 00:03:37,740
‫a specific month, and a specific day,

92
00:03:37,740 --> 00:03:39,510
‫then we'll know exactly

93
00:03:39,510 --> 00:03:41,340
‫to which folder in Amazon S3

94
00:03:41,340 --> 00:03:42,570
‫to get the data from,

95
00:03:42,570 --> 00:03:45,630
‫and therefore we'll only recover a subset of the data.

96
00:03:45,630 --> 00:03:48,660
‫Therefore, we'll have really, really good partitioning.

97
00:03:48,660 --> 00:03:51,570
‫Finally, the last performance improvement you need to do,

98
00:03:51,570 --> 00:03:53,030
‫is to use bigger files,

99
00:03:53,030 --> 00:03:54,630
‫it's to minimize your overheads.

100
00:03:54,630 --> 00:03:58,230
‫So if you have many, many, many small files in Amazon S3,

101
00:03:58,230 --> 00:04:00,180
‫Athena is not going to be as performance,

102
00:04:00,180 --> 00:04:01,890
‫as if you had larger files,

103
00:04:01,890 --> 00:04:04,620
‫for example, 128 megabytes, and over,

104
00:04:04,620 --> 00:04:06,600
‫because larger files are easier to scan,

105
00:04:06,600 --> 00:04:08,130
‫and easier to retrieve.

106
00:04:08,130 --> 00:04:12,510
‫Another feature of Amazon Athena is the Federated Query.

107
00:04:12,510 --> 00:04:15,540
‫So you know that Athena can query data in S3,

108
00:04:15,540 --> 00:04:18,030
‫but actually you can query data anywhere,

109
00:04:18,030 --> 00:04:21,390
‫for example, in relational or non-relational databases,

110
00:04:21,390 --> 00:04:24,300
‫you can query objects and custom data sources,

111
00:04:24,300 --> 00:04:27,570
‫would it be on AWS or on-premises.

112
00:04:27,570 --> 00:04:31,290
‫How? Well, you use what's called a Data Source Connector.

113
00:04:31,290 --> 00:04:32,640
‫It's a Lambda function,

114
00:04:32,640 --> 00:04:34,590
‫and that Lambda function is going to run

115
00:04:34,590 --> 00:04:38,100
‫the Federated Queries in other services.

116
00:04:38,100 --> 00:04:40,350
‫So that could be, for example, CloudWatch Logs,

117
00:04:40,350 --> 00:04:41,730
‫DynamoDB, RDS, and so on.

118
00:04:41,730 --> 00:04:43,560
‫So it's very powerful.

119
00:04:43,560 --> 00:04:45,390
‫For example, we have Athena here,

120
00:04:45,390 --> 00:04:46,800
‫and we have a Lambda function,

121
00:04:46,800 --> 00:04:48,840
‫and you'll have one Lambda function

122
00:04:48,840 --> 00:04:50,940
‫per Data Source Connector.

123
00:04:50,940 --> 00:04:52,950
‫And then through Amazon Athena,

124
00:04:52,950 --> 00:04:54,900
‫you can run the query across ElastiCache,

125
00:04:54,900 --> 00:04:58,230
‫Document DB, DynamoDB, Redshift, Aurora,

126
00:04:58,230 --> 00:05:02,190
‫SQL Server, MySQL, HBase on the EMR service,

127
00:05:02,190 --> 00:05:04,560
‫or any on-premises database.

128
00:05:04,560 --> 00:05:05,820
‫They're clear from Athena,

129
00:05:05,820 --> 00:05:08,010
‫as well as, of course, Amazon S3,

130
00:05:08,010 --> 00:05:09,330
‫and you can do your drawings,

131
00:05:09,330 --> 00:05:11,730
‫and you can do your competitions, and so on.

132
00:05:11,730 --> 00:05:14,250
‫That's why it's called a Federated Query.

133
00:05:14,250 --> 00:05:16,140
‫And then the results of this query

134
00:05:16,140 --> 00:05:18,780
‫can be stored into your Amazon S3 buckets

135
00:05:18,780 --> 00:05:20,550
‫for later analysis.

136
00:05:20,550 --> 00:05:21,810
‫So that's it for Amazon Athena.

137
00:05:21,810 --> 00:05:24,210
‫As you can see, it's a very powerful service.

138
00:05:24,210 --> 00:05:25,350
‫I hope you liked it,

139
00:05:25,350 --> 00:05:27,300
‫and I will see you in the next lecture.