1 00:00:00,980 --> 00:00:03,080 Hey, Cloud Gurus, in this lesson, 2 00:00:03,080 --> 00:00:06,083 we'll be talking about encoding and decoding data. 3 00:00:07,830 --> 00:00:11,030 To start off with, we'll talk about what the problem is. 4 00:00:11,030 --> 00:00:13,750 Why do we even need to encode and decode data? 5 00:00:13,750 --> 00:00:16,830 We'll look at some solutions, new and old. 6 00:00:16,830 --> 00:00:19,623 Demo--how to go about this in Azure Data Factory. 7 00:00:20,500 --> 00:00:22,723 And then, wrap everything up with a review. 8 00:00:24,210 --> 00:00:25,923 To briefly talk about the problem, 9 00:00:26,780 --> 00:00:28,480 the format of source data depends 10 00:00:28,480 --> 00:00:30,990 on the encoding options of the system. 11 00:00:30,990 --> 00:00:32,993 And this may vary from that of the sink. 12 00:00:33,880 --> 00:00:35,380 This can obviously create issues 13 00:00:35,380 --> 00:00:39,090 when performing ETL or ELT operations. 14 00:00:39,090 --> 00:00:41,610 Systems that are going to use the data from the source 15 00:00:41,610 --> 00:00:44,750 may depend on it being in a different format. 16 00:00:44,750 --> 00:00:48,260 Normally things are in UTF-8 or UTF-16, 17 00:00:48,260 --> 00:00:50,580 but that's not always the case. 18 00:00:50,580 --> 00:00:53,130 So, if your source system is using an encoding format 19 00:00:53,130 --> 00:00:55,680 other than the one you'll eventually need it to be in, 20 00:00:55,680 --> 00:00:56,760 you're going to have to go through 21 00:00:56,760 --> 00:00:59,173 some type of encoding/decoding process. 22 00:01:01,030 --> 00:01:03,210 In the past, some of the ways you might have gone 23 00:01:03,210 --> 00:01:06,940 about this are manually exporting the file. 24 00:01:06,940 --> 00:01:09,340 You can download it to your local computer, convert it 25 00:01:09,340 --> 00:01:13,130 using a text editor, and then upload it back to the cloud. 26 00:01:13,130 --> 00:01:15,960 This is easy, but obviously not optimal. 27 00:01:15,960 --> 00:01:18,480 There's a lot of overhead, both on network resources 28 00:01:18,480 --> 00:01:20,743 and on your time to accomplish this. 29 00:01:21,650 --> 00:01:24,840 You could use a virtual machine and download the file, 30 00:01:24,840 --> 00:01:28,070 run some scripts to convert it, and then upload it again. 31 00:01:28,070 --> 00:01:30,940 This keeps it in the cloud and is a bit more automated, 32 00:01:30,940 --> 00:01:33,460 but still has the overhead of downloading 33 00:01:33,460 --> 00:01:34,763 and uploading the file. 34 00:01:35,620 --> 00:01:37,170 An improvement to that would be to leave 35 00:01:37,170 --> 00:01:40,220 the file in Azure Storage and use a cloud VM 36 00:01:40,220 --> 00:01:43,210 to convert it in a similar way using scripts. 37 00:01:43,210 --> 00:01:44,950 This avoids the problem of downloading 38 00:01:44,950 --> 00:01:48,230 and uploading the file, leaving it there in Azure Storage, 39 00:01:48,230 --> 00:01:50,593 but it's still not an ideal situation. 40 00:01:51,700 --> 00:01:55,030 Luckily, there are new ways to go about this. 41 00:01:55,030 --> 00:01:58,370 We can use the copy activity, setting encoding options 42 00:01:58,370 --> 00:02:01,233 on dataset properties for your source and sink. 43 00:02:02,310 --> 00:02:05,700 Or we can use Azure Functions for those complex scenarios 44 00:02:05,700 --> 00:02:09,090 where simply setting those properties is not sufficient. 45 00:02:09,090 --> 00:02:12,220 You can insert an Azure Function activity into the pipeline 46 00:02:12,220 --> 00:02:14,623 and use custom code to go about the conversion. 47 00:02:15,530 --> 00:02:16,980 The great thing is that both of these 48 00:02:16,980 --> 00:02:19,000 are available in the Azure Data Factory. 49 00:02:19,000 --> 00:02:21,750 And so, they can be included as part of your pipelines. 50 00:02:24,150 --> 00:02:25,595 Let's jump into the demo and take a look 51 00:02:25,595 --> 00:02:29,033 at a simple example of encoding and decoding data. 52 00:02:31,400 --> 00:02:33,290 Before we jump over to the Azure portal, 53 00:02:33,290 --> 00:02:35,355 I just wanted to show you the encoding format 54 00:02:35,355 --> 00:02:37,570 of a file we'll be working with. 55 00:02:37,570 --> 00:02:40,150 This is just an example JSON file. 56 00:02:40,150 --> 00:02:43,210 And if you'll point your attention down here to the bottom, 57 00:02:43,210 --> 00:02:47,867 we have our encoding, in this case, ISO 8859-7. 58 00:02:50,130 --> 00:02:52,593 Now, I want you to note that as we go forward. 59 00:02:55,100 --> 00:02:57,320 Over in Azure Data Factory Studio, 60 00:02:57,320 --> 00:03:00,100 I have a demo pipeline open. 61 00:03:00,100 --> 00:03:04,600 This is a very simple pipeline with just one copy activity. 62 00:03:04,600 --> 00:03:08,490 As you can see, its source is a json1 dataset 63 00:03:08,490 --> 00:03:11,493 and its sink is a json2 dataset. 64 00:03:12,940 --> 00:03:15,810 If we take a look at json1, you can see this 65 00:03:15,810 --> 00:03:19,270 is pointing to my Azure Data Lake Storage linked service, 66 00:03:19,270 --> 00:03:22,380 and it's referencing that example.json file. 67 00:03:22,380 --> 00:03:24,720 I've simply uploaded that same file 68 00:03:24,720 --> 00:03:27,490 to my Azure Data Lake container. 69 00:03:27,490 --> 00:03:31,727 I have set the encoding as that same ISO 8859-7, 70 00:03:33,330 --> 00:03:36,760 so that Azure Data Factory knows what to expect. 71 00:03:36,760 --> 00:03:40,720 For my target, my json2, I have another linked service 72 00:03:40,720 --> 00:03:42,750 to Azure Data Lake, and it's pointing 73 00:03:42,750 --> 00:03:45,593 at a second file decode.json. 74 00:03:46,560 --> 00:03:50,093 And that is set to use the default UTF-8 format. 75 00:03:50,930 --> 00:03:54,500 And so, when we come over to our pipeline and trigger this, 76 00:03:54,500 --> 00:03:58,380 it will simply copy that file from the source to the sink, 77 00:03:58,380 --> 00:04:00,313 and change the format in doing so. 78 00:04:01,370 --> 00:04:03,040 I'll go ahead and kick that off. 79 00:04:03,040 --> 00:04:06,580 And then I'll rejoin you over at my text editor 80 00:04:06,580 --> 00:04:08,863 to look at what its encoding format is. 81 00:04:12,290 --> 00:04:16,410 Back over in my text editor, I have the decode.json file. 82 00:04:16,410 --> 00:04:18,190 And, again, I simply downloaded that 83 00:04:18,190 --> 00:04:20,920 from the container it was copied over to. 84 00:04:20,920 --> 00:04:22,670 You can see we are on a different file, 85 00:04:22,670 --> 00:04:25,020 just so you know I'm not playing tricks. 86 00:04:25,020 --> 00:04:26,870 And if you take note down, at the bottom, 87 00:04:26,870 --> 00:04:31,870 the format is now Unicode UTF-8. 88 00:04:31,890 --> 00:04:33,720 So we've copied a file from the source system 89 00:04:33,720 --> 00:04:35,900 that was using one format, and during 90 00:04:35,900 --> 00:04:38,603 the copy converted it to a more usable format. 91 00:04:40,960 --> 00:04:43,811 By way of review, the source system determines 92 00:04:43,811 --> 00:04:47,380 the encoding options for your incoming data. 93 00:04:47,380 --> 00:04:51,240 Again, normally you'll be working with UTF-8 or UTF-16, 94 00:04:51,240 --> 00:04:53,510 but that's not always the case. 95 00:04:53,510 --> 00:04:56,260 Your data projects will involve a variety of formats 96 00:04:56,260 --> 00:04:59,223 and so, sometimes you'll need to be able to convert those. 97 00:05:00,450 --> 00:05:02,750 Azure Data Factory allows you to do many types 98 00:05:02,750 --> 00:05:06,380 of conversions simply by setting dataset properties 99 00:05:06,380 --> 00:05:09,420 to the desired values and using the copy activity. 100 00:05:09,420 --> 00:05:12,670 And that's what we just demoed: setting the original format 101 00:05:12,670 --> 00:05:16,233 on the source, and then our desired format on the sink. 102 00:05:17,120 --> 00:05:19,230 Of course, it won't always be that easy. 103 00:05:19,230 --> 00:05:20,840 And so, for complex scenarios, 104 00:05:20,840 --> 00:05:24,470 you can use the Azure Function Activity in your pipeline 105 00:05:24,470 --> 00:05:26,950 and use custom code to perform the conversion 106 00:05:26,950 --> 00:05:29,323 before passing it along down the pipeline. 107 00:05:30,310 --> 00:05:32,500 That's it, Gurus, thank you for joining me. 108 00:05:32,500 --> 00:05:35,193 When you're ready, I'll see you in the next video.