1 00:00:00,240 --> 00:00:02,770 ‫Okay, now let's talk about AWS Glue. 2 00:00:02,770 --> 00:00:07,280 ‫So, Glue is a managed extract, transform, and load service, 3 00:00:07,280 --> 00:00:09,770 ‫or ETL, and from an exams perspective, 4 00:00:09,770 --> 00:00:10,730 ‫that's all you need to know. 5 00:00:10,730 --> 00:00:12,830 ‫But let's go a little bit deeper dive, 6 00:00:12,830 --> 00:00:14,340 ‫to understand how that works. 7 00:00:14,340 --> 00:00:15,173 ‫So, what is ETL? 8 00:00:15,173 --> 00:00:17,673 ‫Well, ETL is very helpful when you have some datasets 9 00:00:17,673 --> 00:00:20,330 ‫but they're not exactly in the right form, 10 00:00:20,330 --> 00:00:22,290 ‫or the right format that you need, 11 00:00:22,290 --> 00:00:23,850 ‫to do your analytics on them. 12 00:00:23,850 --> 00:00:26,430 ‫And so the idea is that you would use an ETL service 13 00:00:26,430 --> 00:00:28,690 ‫to prepare and transform that data. 14 00:00:28,690 --> 00:00:32,270 ‫So, Glue does that, but traditionally you use servers 15 00:00:32,270 --> 00:00:35,550 ‫to do it, but with Glue it's a fully serverless service, 16 00:00:35,550 --> 00:00:38,950 ‫so you just worry about the actual data transformation, 17 00:00:38,950 --> 00:00:40,610 ‫and Glue does the rest. 18 00:00:40,610 --> 00:00:44,170 ‫So, in a diagram for example, Glue ETL sits in the middle, 19 00:00:44,170 --> 00:00:48,340 ‫and say we wanted to extract data from both an S3 Bucket 20 00:00:48,340 --> 00:00:50,610 ‫and an Amazon RDS database. 21 00:00:50,610 --> 00:00:53,540 ‫So, for this, we'd use Glue to extract the data 22 00:00:53,540 --> 00:00:56,460 ‫from both these sources, and then, 23 00:00:56,460 --> 00:00:59,640 ‫once the data is extracted, it is in a Glue service, 24 00:00:59,640 --> 00:01:02,930 ‫and we would write a script to do a transform part. 25 00:01:02,930 --> 00:01:05,920 ‫So here, Glue would help us transform the data, 26 00:01:05,920 --> 00:01:07,530 ‫and then, once it's transformed, 27 00:01:07,530 --> 00:01:11,350 ‫we need to actually analyze it so we can load up that data 28 00:01:11,350 --> 00:01:15,110 ‫into, for example, an Amazon Redshift database, 29 00:01:15,110 --> 00:01:17,680 ‫where we can do our analytics the right way. 30 00:01:17,680 --> 00:01:19,700 ‫And so, Glue sits here, okay? 31 00:01:19,700 --> 00:01:23,300 ‫It's a very powerful tool, because you can do any kind 32 00:01:23,300 --> 00:01:24,840 ‫of instruction transformation 33 00:01:24,840 --> 00:01:27,670 ‫and then you can load it into many different places. 34 00:01:27,670 --> 00:01:30,730 ‫So, that's for Glue ETL, and then there's another service 35 00:01:30,730 --> 00:01:34,010 ‫called Glue Data Catalog, which I think is not at the exam, 36 00:01:34,010 --> 00:01:35,940 ‫but I will still mention it to you 'cause it's important 37 00:01:35,940 --> 00:01:37,910 ‫to know it as part of the Glue family. 38 00:01:37,910 --> 00:01:40,600 ‫So, the Glue Data Catalog, as the name indicates, 39 00:01:40,600 --> 00:01:45,150 ‫is a catalog of your datasets in your Alias infrastructure, 40 00:01:45,150 --> 00:01:48,290 ‫and so this Glue Data Catalog will have a alert reference 41 00:01:48,290 --> 00:01:50,600 ‫of everything, the column names, the field names, 42 00:01:50,600 --> 00:01:52,060 ‫the field types, et cetera, et cetera. 43 00:01:52,060 --> 00:01:55,940 ‫And this can be used by services such as Athena, 44 00:01:55,940 --> 00:01:58,880 ‫Redshift and EMR to discover the datasets 45 00:01:58,880 --> 00:02:01,880 ‫and build the proper schemas for them, okay? 46 00:02:01,880 --> 00:02:03,550 ‫So, that's it. I hope you liked this lecture, 47 00:02:03,550 --> 00:02:06,080 ‫and I will see you in the next lecture.