1 00:00:02,340 --> 00:00:04,620 ‫So now, let's talk about Amazon Athena. 2 00:00:04,620 --> 00:00:07,110 ‫So Athena is a serverless query service 3 00:00:07,110 --> 00:00:10,980 ‫to help you analyze the data stored in Amazon S3 buckets. 4 00:00:10,980 --> 00:00:12,720 ‫And to analyze this data, 5 00:00:12,720 --> 00:00:15,180 ‫you're going to use the standard SQL language 6 00:00:15,180 --> 00:00:16,590 ‫to query the files. 7 00:00:16,590 --> 00:00:17,430 ‫Behind the scenes, 8 00:00:17,430 --> 00:00:19,320 ‫Athena is built on the Presto engine, 9 00:00:19,320 --> 00:00:21,030 ‫which uses the SQL language. 10 00:00:21,030 --> 00:00:23,700 ‫So the idea is that users are going to load data 11 00:00:23,700 --> 00:00:24,600 ‫into your S3 bucket, 12 00:00:24,600 --> 00:00:27,330 ‫or you are going to load data into your S3 bucket, 13 00:00:27,330 --> 00:00:29,070 ‫and then you would use the Athena service 14 00:00:29,070 --> 00:00:32,640 ‫to query and analyze this data in Amazon S3 15 00:00:32,640 --> 00:00:33,660 ‫without moving it. 16 00:00:33,660 --> 00:00:34,920 ‫So Athena is serverless, 17 00:00:34,920 --> 00:00:38,790 ‫and it analyzes directly your data living in your S3 bucket. 18 00:00:38,790 --> 00:00:40,350 ‫So it supports different formats, 19 00:00:40,350 --> 00:00:43,170 ‫such as CSV, JSON, ORC, 20 00:00:43,170 --> 00:00:45,660 ‫Avro, and Parquet, and possibly others. 21 00:00:45,660 --> 00:00:47,520 ‫And the pricing is very simple. 22 00:00:47,520 --> 00:00:49,650 ‫You're just going to pay a fixed amount 23 00:00:49,650 --> 00:00:52,050 ‫per terabytes of data scanned. 24 00:00:52,050 --> 00:00:54,060 ‫You don't need to provision any database again, 25 00:00:54,060 --> 00:00:56,580 ‫because the whole service is serverless. 26 00:00:56,580 --> 00:00:59,640 ‫So Athena is commonly used with another tool, 27 00:00:59,640 --> 00:01:03,630 ‫called Amazon QuickSight to create reports and dashboards. 28 00:01:03,630 --> 00:01:06,150 ‫So we'll see Quicksight later in depth, 29 00:01:06,150 --> 00:01:08,670 ‫but Amazon Quicksight connects to Athena, 30 00:01:08,670 --> 00:01:11,100 ‫which connects to your S3 buckets. 31 00:01:11,100 --> 00:01:13,440 ‫Now, the use cases for Amazon Athena 32 00:01:13,440 --> 00:01:14,940 ‫are to do ad hoc queries, 33 00:01:14,940 --> 00:01:17,880 ‫business intelligence, analytics, reporting, 34 00:01:17,880 --> 00:01:19,830 ‫and to analyze and query 35 00:01:19,830 --> 00:01:24,540 ‫any kind of logs that originates from your AWS services. 36 00:01:24,540 --> 00:01:26,370 ‫So it could be your VPC flow logs, 37 00:01:26,370 --> 00:01:30,120 ‫your load balancer logs, your CloudTrail trails, and so on. 38 00:01:30,120 --> 00:01:31,860 ‫So as an exam tip, 39 00:01:31,860 --> 00:01:34,830 ‫anytime you need to analyze data in Amazon S3 40 00:01:34,830 --> 00:01:37,230 ‫using a serverless SQL engine, 41 00:01:37,230 --> 00:01:38,823 ‫you can think about Athena. 42 00:01:39,690 --> 00:01:41,940 ‫Now, I've talked about performance improvements, 43 00:01:41,940 --> 00:01:44,180 ‫and you can actually improve Athena performance, 44 00:01:44,180 --> 00:01:45,870 ‫and in this exam, 45 00:01:45,870 --> 00:01:47,820 ‫we'll test you on this as well. 46 00:01:47,820 --> 00:01:50,640 ‫So first of all, because you pay for the amount 47 00:01:50,640 --> 00:01:53,520 ‫of data scanned per terabyte, 48 00:01:53,520 --> 00:01:56,040 ‫you need to use a type of data 49 00:01:56,040 --> 00:01:58,290 ‫where you're going to scan less data. 50 00:01:58,290 --> 00:02:01,290 ‫And for this, you can use a columnar data type 51 00:02:01,290 --> 00:02:02,123 ‫for cost-savings, 52 00:02:02,123 --> 00:02:04,770 ‫because you only scan the columns you need. 53 00:02:04,770 --> 00:02:09,090 ‫So therefore, the recommended formats for Amazon Athena, 54 00:02:09,090 --> 00:02:12,243 ‫are going to be Apache Parquet and ORC, 55 00:02:13,080 --> 00:02:15,780 ‫and it's going to give you a huge performance improvement. 56 00:02:15,780 --> 00:02:20,370 ‫And to get your files into the Apache Parquet or ORC format, 57 00:02:20,370 --> 00:02:21,630 ‫you must use a service 58 00:02:21,630 --> 00:02:23,310 ‫that we'll see as well in this section, 59 00:02:23,310 --> 00:02:24,660 ‫for example, Glue. 60 00:02:24,660 --> 00:02:25,980 ‫Glue can be very helpful 61 00:02:25,980 --> 00:02:29,130 ‫to convert your data as an ETL job, 62 00:02:29,130 --> 00:02:32,640 ‫between, for example, CSV and Parquet. 63 00:02:32,640 --> 00:02:35,850 ‫Now, also because we want to scan less data, 64 00:02:35,850 --> 00:02:38,160 ‫we need to compress data for smaller retrievals. 65 00:02:38,160 --> 00:02:41,190 ‫So there are different compression mechanisms you can use 66 00:02:41,190 --> 00:02:43,680 ‫that have listed it right here. 67 00:02:43,680 --> 00:02:46,890 ‫Next, if you know you're going to query all the time 68 00:02:46,890 --> 00:02:48,990 ‫on some specific columns, 69 00:02:48,990 --> 00:02:50,940 ‫you can partition your datasets, 70 00:02:50,940 --> 00:02:54,180 ‫and partition datasets means that in your S3 bucket, 71 00:02:54,180 --> 00:02:57,150 ‫you're going to have the full path with slashes, 72 00:02:57,150 --> 00:03:00,240 ‫and each slash will be a different column name 73 00:03:00,240 --> 00:03:01,920 ‫with a specific value. 74 00:03:01,920 --> 00:03:03,060 ‫And so you're organizing, 75 00:03:03,060 --> 00:03:05,370 ‫you're partitioning your data in Amazon S3, 76 00:03:05,370 --> 00:03:06,510 ‫so that when you query it, 77 00:03:06,510 --> 00:03:08,160 ‫you can know exactly in which folder 78 00:03:08,160 --> 00:03:09,780 ‫at which path Amazon S3, 79 00:03:09,780 --> 00:03:11,400 ‫you need to scan for data. 80 00:03:11,400 --> 00:03:13,740 ‫So here's an example of data partitions. 81 00:03:13,740 --> 00:03:16,770 ‫So we have flight data in Parquet formats, 82 00:03:16,770 --> 00:03:20,070 ‫and then we do /year=1991. 83 00:03:20,070 --> 00:03:21,750 ‫So we partition by year, 84 00:03:21,750 --> 00:03:24,000 ‫and we'll have one folder for each year. 85 00:03:24,000 --> 00:03:26,370 ‫Then within each year we'll have month, 86 00:03:26,370 --> 00:03:27,480 ‫so month equals one, 87 00:03:27,480 --> 00:03:29,370 ‫and within each month we'll have days, 88 00:03:29,370 --> 00:03:30,210 ‫this equals one. 89 00:03:30,210 --> 00:03:33,090 ‫And so when I do a query on Athena, 90 00:03:33,090 --> 00:03:35,250 ‫and I filter for a specific year, 91 00:03:35,250 --> 00:03:37,740 ‫a specific month, and a specific day, 92 00:03:37,740 --> 00:03:39,510 ‫then we'll know exactly 93 00:03:39,510 --> 00:03:41,340 ‫to which folder in Amazon S3 94 00:03:41,340 --> 00:03:42,570 ‫to get the data from, 95 00:03:42,570 --> 00:03:45,630 ‫and therefore we'll only recover a subset of the data. 96 00:03:45,630 --> 00:03:48,660 ‫Therefore, we'll have really, really good partitioning. 97 00:03:48,660 --> 00:03:51,570 ‫Finally, the last performance improvement you need to do, 98 00:03:51,570 --> 00:03:53,030 ‫is to use bigger files, 99 00:03:53,030 --> 00:03:54,630 ‫it's to minimize your overheads. 100 00:03:54,630 --> 00:03:58,230 ‫So if you have many, many, many small files in Amazon S3, 101 00:03:58,230 --> 00:04:00,180 ‫Athena is not going to be as performance, 102 00:04:00,180 --> 00:04:01,890 ‫as if you had larger files, 103 00:04:01,890 --> 00:04:04,620 ‫for example, 128 megabytes, and over, 104 00:04:04,620 --> 00:04:06,600 ‫because larger files are easier to scan, 105 00:04:06,600 --> 00:04:08,130 ‫and easier to retrieve. 106 00:04:08,130 --> 00:04:12,510 ‫Another feature of Amazon Athena is the Federated Query. 107 00:04:12,510 --> 00:04:15,540 ‫So you know that Athena can query data in S3, 108 00:04:15,540 --> 00:04:18,030 ‫but actually you can query data anywhere, 109 00:04:18,030 --> 00:04:21,390 ‫for example, in relational or non-relational databases, 110 00:04:21,390 --> 00:04:24,300 ‫you can query objects and custom data sources, 111 00:04:24,300 --> 00:04:27,570 ‫would it be on AWS or on-premises. 112 00:04:27,570 --> 00:04:31,290 ‫How? Well, you use what's called a Data Source Connector. 113 00:04:31,290 --> 00:04:32,640 ‫It's a Lambda function, 114 00:04:32,640 --> 00:04:34,590 ‫and that Lambda function is going to run 115 00:04:34,590 --> 00:04:38,100 ‫the Federated Queries in other services. 116 00:04:38,100 --> 00:04:40,350 ‫So that could be, for example, CloudWatch Logs, 117 00:04:40,350 --> 00:04:41,730 ‫DynamoDB, RDS, and so on. 118 00:04:41,730 --> 00:04:43,560 ‫So it's very powerful. 119 00:04:43,560 --> 00:04:45,390 ‫For example, we have Athena here, 120 00:04:45,390 --> 00:04:46,800 ‫and we have a Lambda function, 121 00:04:46,800 --> 00:04:48,840 ‫and you'll have one Lambda function 122 00:04:48,840 --> 00:04:50,940 ‫per Data Source Connector. 123 00:04:50,940 --> 00:04:52,950 ‫And then through Amazon Athena, 124 00:04:52,950 --> 00:04:54,900 ‫you can run the query across ElastiCache, 125 00:04:54,900 --> 00:04:58,230 ‫Document DB, DynamoDB, Redshift, Aurora, 126 00:04:58,230 --> 00:05:02,190 ‫SQL Server, MySQL, HBase on the EMR service, 127 00:05:02,190 --> 00:05:04,560 ‫or any on-premises database. 128 00:05:04,560 --> 00:05:05,820 ‫They're clear from Athena, 129 00:05:05,820 --> 00:05:08,010 ‫as well as, of course, Amazon S3, 130 00:05:08,010 --> 00:05:09,330 ‫and you can do your drawings, 131 00:05:09,330 --> 00:05:11,730 ‫and you can do your competitions, and so on. 132 00:05:11,730 --> 00:05:14,250 ‫That's why it's called a Federated Query. 133 00:05:14,250 --> 00:05:16,140 ‫And then the results of this query 134 00:05:16,140 --> 00:05:18,780 ‫can be stored into your Amazon S3 buckets 135 00:05:18,780 --> 00:05:20,550 ‫for later analysis. 136 00:05:20,550 --> 00:05:21,810 ‫So that's it for Amazon Athena. 137 00:05:21,810 --> 00:05:24,210 ‫As you can see, it's a very powerful service. 138 00:05:24,210 --> 00:05:25,350 ‫I hope you liked it, 139 00:05:25,350 --> 00:05:27,300 ‫and I will see you in the next lecture.