1 00:00:01,190 --> 00:00:03,170 Hey, Cloud Gurus, welcome back. 2 00:00:03,170 --> 00:00:06,163 In this lesson, we're talking about cleansing data. 3 00:00:07,810 --> 00:00:08,643 To do that, 4 00:00:08,643 --> 00:00:12,090 we'll start with looking at a data cleansing overview. 5 00:00:12,090 --> 00:00:14,010 We'll talk about some of the available tools 6 00:00:14,010 --> 00:00:14,963 for cleansing data, 7 00:00:15,800 --> 00:00:19,120 jump in to Azure Data Factory and do a demo, 8 00:00:19,120 --> 00:00:21,623 and then, wrap everything up with a review. 9 00:00:24,270 --> 00:00:26,720 To overview data cleansing a little bit, 10 00:00:26,720 --> 00:00:30,397 what we mean is the process of analyzing data 11 00:00:30,397 --> 00:00:32,770 in order to correct, remove, de-duplicate, 12 00:00:32,770 --> 00:00:35,310 reformat, or complete it. 13 00:00:35,310 --> 00:00:37,260 This is especially important when pulling data 14 00:00:37,260 --> 00:00:38,663 from many different sources. 15 00:00:39,682 --> 00:00:41,790 It's probably not all stored in the same way. 16 00:00:41,790 --> 00:00:43,960 Some of it could have been computer generated. 17 00:00:43,960 --> 00:00:46,380 Some of it could have been entered by human hands. 18 00:00:46,380 --> 00:00:48,620 And so, there is work that needs to be done 19 00:00:48,620 --> 00:00:51,013 to bring it into a clean, consistent format. 20 00:00:51,850 --> 00:00:54,400 And really, this just goes to ensure data quality 21 00:00:54,400 --> 00:00:57,630 and completeness, allowing it to be in a more usable state 22 00:00:57,630 --> 00:00:59,513 for consumption downstream. 23 00:01:02,290 --> 00:01:04,970 There are many tools available for this type of process, 24 00:01:04,970 --> 00:01:07,240 and we won't go over all of them here, 25 00:01:07,240 --> 00:01:10,180 but some of the ones you may encounter are 26 00:01:10,180 --> 00:01:11,900 data quality services. 27 00:01:11,900 --> 00:01:13,660 And this is a component of SQL Server 28 00:01:13,660 --> 00:01:17,310 that allows for computer-assisted data cleansing. 29 00:01:17,310 --> 00:01:19,710 It uses a knowledge base to check against, 30 00:01:19,710 --> 00:01:21,830 and then, a data steward can follow behind 31 00:01:21,830 --> 00:01:25,223 and interactively approve or modify suggestions. 32 00:01:27,130 --> 00:01:29,500 There is a clean missing data module 33 00:01:29,500 --> 00:01:31,400 in Azure Machine Learning. 34 00:01:31,400 --> 00:01:33,230 And this allows you to replace, remove, 35 00:01:33,230 --> 00:01:35,810 and even infer values. 36 00:01:35,810 --> 00:01:38,840 You can also use our old friend, mapping data flows 37 00:01:38,840 --> 00:01:40,800 in Azure Data Factory. 38 00:01:40,800 --> 00:01:43,360 These activities allow you to include data cleansing 39 00:01:43,360 --> 00:01:44,723 as part of your pipeline. 40 00:01:46,050 --> 00:01:48,390 In fact, let's jump over to Azure Data Factory 41 00:01:48,390 --> 00:01:49,590 and take a look at this. 42 00:01:51,810 --> 00:01:54,070 Here we are in Azure Data Factory Studio. 43 00:01:54,070 --> 00:01:57,990 If we come back over to our friend, the author tab, 44 00:01:57,990 --> 00:02:00,683 let's create a new data flow. 45 00:02:01,650 --> 00:02:04,590 Within here, we can add our source. 46 00:02:04,590 --> 00:02:07,763 For this example, I'm going to use a customer table, 47 00:02:08,820 --> 00:02:09,663 Customer_Prod. 48 00:02:11,550 --> 00:02:13,363 And let's just name that Customers. 49 00:02:15,610 --> 00:02:18,130 Now, one of the ways that you can go about data cleansing 50 00:02:18,130 --> 00:02:22,550 using a data flow is with a derived column. 51 00:02:22,550 --> 00:02:24,683 So we can add that to our flow here. 52 00:02:26,490 --> 00:02:30,033 And let's name it CleanData. 53 00:02:31,690 --> 00:02:33,313 Within our derived column, 54 00:02:34,326 --> 00:02:37,453 we can go column by column and do transformations, 55 00:02:38,683 --> 00:02:41,450 or we can add column patterns to get everything of, 56 00:02:41,450 --> 00:02:45,130 let's say, a specific type, like a string. 57 00:02:45,130 --> 00:02:47,373 In this demo, we'll do a simple example. 58 00:02:48,548 --> 00:02:50,170 So let's just grab a single column. 59 00:02:50,170 --> 00:02:54,823 For this one, let's go down to the phone column. 60 00:02:56,370 --> 00:02:57,970 And let's say that we want to make sure 61 00:02:57,970 --> 00:03:02,420 that our phone numbers don't have any spaces in them. 62 00:03:02,420 --> 00:03:04,750 As mentioned before, when you have humans entering data, 63 00:03:04,750 --> 00:03:06,390 there can be lots of mistakes. 64 00:03:06,390 --> 00:03:09,750 Let's say you have hundreds of call center representatives. 65 00:03:09,750 --> 00:03:11,360 There's a good chance that not all of them 66 00:03:11,360 --> 00:03:14,010 are entering the data exactly the same. 67 00:03:14,010 --> 00:03:16,120 So let's jump over to our expression 68 00:03:16,120 --> 00:03:17,763 and use expression builder. 69 00:03:19,050 --> 00:03:23,690 Azure Data Factory is very helpful in giving you categories 70 00:03:23,690 --> 00:03:26,850 to help you find exactly what you're looking for better. 71 00:03:26,850 --> 00:03:29,100 I'm going to go ahead and search for trim, 72 00:03:29,100 --> 00:03:31,460 because I know that's what I'm looking for, 73 00:03:31,460 --> 00:03:33,310 and grab this one down at the bottom. 74 00:03:34,302 --> 00:03:36,360 And you can see, it even gives you some text 75 00:03:36,360 --> 00:03:38,453 to help explain how to use the function. 76 00:03:39,740 --> 00:03:40,683 Let's click that. 77 00:03:41,750 --> 00:03:43,993 We want to trim phone. 78 00:03:45,840 --> 00:03:49,563 And what we want to trim is a space. 79 00:03:50,770 --> 00:03:52,810 And so, we'll save that. 80 00:03:52,810 --> 00:03:54,163 Save and finish. 81 00:03:55,700 --> 00:03:57,920 And we're telling it to grab the phone column 82 00:03:57,920 --> 00:04:01,383 and cleanse that data by removing any spaces. 83 00:04:02,360 --> 00:04:04,183 Now, this is a very simple example, 84 00:04:05,235 --> 00:04:06,590 but you can see how you can take it and extrapolate it 85 00:04:06,590 --> 00:04:09,860 to more complex scenarios, transforming multiple fields, 86 00:04:09,860 --> 00:04:14,400 such as addresses and names, or even column patterns, 87 00:04:14,400 --> 00:04:16,700 grabbing all of the ones that have strings 88 00:04:16,700 --> 00:04:19,040 and removing extra spaces from each of them. 89 00:04:19,040 --> 00:04:21,500 Or dashes, exclamation marks, 90 00:04:21,500 --> 00:04:24,590 whatever it is that you need to remove from your data 91 00:04:24,590 --> 00:04:27,130 in order to mold it into a more usable state 92 00:04:27,130 --> 00:04:28,483 for later consumption. 93 00:04:31,130 --> 00:04:32,447 By way of review, 94 00:04:32,447 --> 00:04:35,510 cleansing data is the process of analyzing data 95 00:04:35,510 --> 00:04:38,310 in order to correct, remove, de-duplicate, 96 00:04:38,310 --> 00:04:40,323 reformat, or complete it. 97 00:04:41,570 --> 00:04:44,209 There are multiple tools available for this, 98 00:04:44,209 --> 00:04:45,790 including Data Quality Services, 99 00:04:45,790 --> 00:04:47,630 Azure Machine Learning modules, 100 00:04:47,630 --> 00:04:51,373 and, as we saw, Azure Data Factory activities. 101 00:04:52,220 --> 00:04:53,590 And the whole purpose behind this 102 00:04:53,590 --> 00:04:56,740 is to prepare it for downstream use. 103 00:04:56,740 --> 00:04:58,910 Your data comes in in a very raw format, 104 00:04:58,910 --> 00:05:01,670 as we've discussed earlier with our different zones. 105 00:05:01,670 --> 00:05:03,710 And this is the part of the process that you're using 106 00:05:03,710 --> 00:05:07,540 to curate and cleanse that data to make it more usable 107 00:05:07,540 --> 00:05:09,530 for reports, dashboards, 108 00:05:09,530 --> 00:05:11,480 and many other functions down the road. 109 00:05:12,490 --> 00:05:13,560 That's it for this video. 110 00:05:13,560 --> 00:05:14,780 I hope you've enjoyed it. 111 00:05:14,780 --> 00:05:17,280 When you were ready, I'll see you in the next one.