1 00:00:00,550 --> 00:00:02,310 In this lesson, we are going to talk about 2 00:00:02,310 --> 00:00:05,500 "Purging Data Based upon Business Requirements". 3 00:00:05,500 --> 00:00:07,712 Well, what does that mean? We're going to start off 4 00:00:07,712 --> 00:00:10,110 by talking about what data purging is. 5 00:00:10,110 --> 00:00:11,510 Then we're going to talk about some reasons 6 00:00:11,510 --> 00:00:13,430 that you want to purge data. 7 00:00:13,430 --> 00:00:16,710 Followed by how we choose the data that we want to purge. 8 00:00:16,710 --> 00:00:19,350 And then we'll finish up with some best practices. 9 00:00:19,350 --> 00:00:21,293 So with that, let's get started. 10 00:00:22,260 --> 00:00:24,690 First up, what is data purging? 11 00:00:24,690 --> 00:00:25,910 Well, at the core, 12 00:00:25,910 --> 00:00:29,280 it's the process of deleting data from the database. 13 00:00:29,280 --> 00:00:32,500 However, there's a bit more to it than just deleting stuff, 14 00:00:32,500 --> 00:00:34,140 and so I want to start with a discussion 15 00:00:34,140 --> 00:00:36,580 of data lifecycle management. 16 00:00:36,580 --> 00:00:40,405 Data lifecycle management is a core concept in 17 00:00:40,405 --> 00:00:44,050 data engineering, and it starts with the creation of data. 18 00:00:44,050 --> 00:00:47,410 So think about this as a point of sale. 19 00:00:47,410 --> 00:00:50,453 As soon as that sale is created, we have data. 20 00:00:51,610 --> 00:00:54,100 That data then moves through processing. 21 00:00:54,100 --> 00:00:56,010 This is where it's ingested into the system 22 00:00:56,010 --> 00:00:58,310 and transformed, and eventually winds up 23 00:00:58,310 --> 00:01:01,840 in a database where we can then use that data 24 00:01:01,840 --> 00:01:05,053 to do reporting or a whole host of different things. 25 00:01:06,210 --> 00:01:09,800 After awhile though, that data is no longer useful. 26 00:01:09,800 --> 00:01:11,630 And once that data is no longer useful, 27 00:01:11,630 --> 00:01:13,920 we will often archive that data, 28 00:01:13,920 --> 00:01:16,210 or store it for a variety of reasons. 29 00:01:16,210 --> 00:01:18,630 And we'll talk about that here in a few minutes as well. 30 00:01:18,630 --> 00:01:21,710 And then once we're done with that archiving phase, 31 00:01:21,710 --> 00:01:24,300 then we destroy or purge the data. 32 00:01:24,300 --> 00:01:26,560 This is data lifecycle management, 33 00:01:26,560 --> 00:01:29,320 and it's an important concept not just for this lesson, 34 00:01:29,320 --> 00:01:32,510 but for the entire course, to think about your data 35 00:01:32,510 --> 00:01:35,223 as it travels through these 5 stages. 36 00:01:36,240 --> 00:01:39,250 So another point, data archiving 37 00:01:39,250 --> 00:01:42,220 or retention is not data deletion. 38 00:01:42,220 --> 00:01:44,460 This data archiving is going to be set 39 00:01:44,460 --> 00:01:46,530 based upon your business needs. 40 00:01:46,530 --> 00:01:49,560 It could be that we want 5 years of transaction data. 41 00:01:49,560 --> 00:01:52,510 Or it could be because of compliance requirements 42 00:01:52,510 --> 00:01:55,800 or regulatory requirements, we need 7 years of data. 43 00:01:55,800 --> 00:01:58,500 And so this data archive or data retention 44 00:01:58,500 --> 00:02:01,110 is going to be based upon your business needs. 45 00:02:01,110 --> 00:02:03,210 So why, then, would we purge the data? 46 00:02:03,210 --> 00:02:07,180 Well, again, it could be regulatory requirements. 47 00:02:07,180 --> 00:02:11,410 Due to right to erasure laws, or GDPR, or other reasons, 48 00:02:11,410 --> 00:02:15,140 we might have a need, because of regulatory requirements, 49 00:02:15,140 --> 00:02:17,890 to delete the data at a given time period; 50 00:02:17,890 --> 00:02:20,860 3 years, 5 years, whatever that is. 51 00:02:20,860 --> 00:02:23,940 We also want to reduce our footprint, because it saves money 52 00:02:23,940 --> 00:02:25,630 and it reduces the risk. 53 00:02:25,630 --> 00:02:27,650 The less data that we have, 54 00:02:27,650 --> 00:02:29,630 the better off we're going to be 55 00:02:29,630 --> 00:02:31,350 if that data is not being used, 56 00:02:31,350 --> 00:02:33,920 because again, it's going to save us money in the long term, 57 00:02:33,920 --> 00:02:35,540 not storing and managing it, 58 00:02:35,540 --> 00:02:38,820 and it also reduces the risk to our organization. 59 00:02:38,820 --> 00:02:41,020 So you want to think through and make sure that the data 60 00:02:41,020 --> 00:02:43,783 that you're storing actually has a purpose. 61 00:02:46,340 --> 00:02:49,330 So then how do we choose the data that we want to purge? 62 00:02:49,330 --> 00:02:51,340 Well, first we'd look at outdated. 63 00:02:51,340 --> 00:02:54,000 So based upon our data retention policy, 64 00:02:54,000 --> 00:02:56,080 we might have data that is past the due date, 65 00:02:56,080 --> 00:02:59,590 and we need to delete it because of that. Or errors. 66 00:02:59,590 --> 00:03:02,170 If we have inaccurate data, we need to clean that up, 67 00:03:02,170 --> 00:03:04,780 either by transforming and conforming the data 68 00:03:04,780 --> 00:03:07,880 to the norm of the rest of the dataset, 69 00:03:07,880 --> 00:03:09,913 or by deleting that extra data. 70 00:03:10,780 --> 00:03:13,170 Duplicate data. So if we have duplicate data, 71 00:03:13,170 --> 00:03:16,100 hey, we don't need 2 sets of data in most cases, 72 00:03:16,100 --> 00:03:17,760 and we don't want to make a data dump. 73 00:03:17,760 --> 00:03:20,100 So again, make sure that your data has a purpose 74 00:03:20,100 --> 00:03:22,050 if it's duplicated. 75 00:03:22,050 --> 00:03:24,620 Regulatory requirements, we've already talked about that. 76 00:03:24,620 --> 00:03:26,160 It is important that, when you look 77 00:03:26,160 --> 00:03:29,200 at regulatory requirements, that you make a footprint map. 78 00:03:29,200 --> 00:03:32,880 We should know where the regulatory requirements are 79 00:03:32,880 --> 00:03:36,223 and what data those regulatory requirements touch on. 80 00:03:38,150 --> 00:03:40,030 So what are some best practices? 81 00:03:40,030 --> 00:03:43,170 Well, first you need a data retention policy. 82 00:03:43,170 --> 00:03:44,770 I mentioned that in the last slide. 83 00:03:44,770 --> 00:03:47,010 And the data retention policy is going to include 84 00:03:47,010 --> 00:03:49,923 your lifetime and your regulatory requirements. 85 00:03:50,930 --> 00:03:55,660 Data needs to move from active, to archive, to purge. 86 00:03:55,660 --> 00:03:57,640 That should be the process as you move data 87 00:03:57,640 --> 00:04:00,853 through that system. Think data lifecycle management there. 88 00:04:01,920 --> 00:04:04,240 Watch preconfigured backups from services. 89 00:04:04,240 --> 00:04:07,430 So when we look at Azure services, quite a few of them 90 00:04:07,430 --> 00:04:10,070 have preconfigured backups where it's automatically 91 00:04:10,070 --> 00:04:11,820 going to store your data. 92 00:04:11,820 --> 00:04:14,490 That may be fine, but you also need to weigh that 93 00:04:14,490 --> 00:04:17,560 against any sort of regulatory requirements that you have 94 00:04:17,560 --> 00:04:19,183 about purging your data. 95 00:04:21,190 --> 00:04:24,060 Plan your purging strategy for off hours. 96 00:04:24,060 --> 00:04:26,890 Purging or archiving, either one, 97 00:04:26,890 --> 00:04:29,490 generally isn't something that's time-critical. 98 00:04:29,490 --> 00:04:31,900 And so if you start at the morning, or the afternoon, 99 00:04:31,900 --> 00:04:34,210 or the evening, it generally doesn't make any difference 100 00:04:34,210 --> 00:04:37,490 for a purging strategy or your data retention policy. 101 00:04:37,490 --> 00:04:40,820 So you want to plan your purging strategy in off hours 102 00:04:40,820 --> 00:04:42,320 where you're going to have less traffic 103 00:04:42,320 --> 00:04:43,840 and less weight on the system 104 00:04:43,840 --> 00:04:46,113 for people that are actively running queries. 105 00:04:47,870 --> 00:04:51,290 5: assess and run cost management analysis. 106 00:04:51,290 --> 00:04:53,610 So as you look at your data footprint, 107 00:04:53,610 --> 00:04:56,230 you should be running a cost management analysis, 108 00:04:56,230 --> 00:04:59,300 including that storage of the data in archive, 109 00:04:59,300 --> 00:05:01,440 and also the movement as we move it through 110 00:05:01,440 --> 00:05:04,140 those various stages. Especially if we have more than 111 00:05:04,140 --> 00:05:07,010 1 archive or we're moving data around a lot. 112 00:05:07,010 --> 00:05:08,170 So it's important that you have 113 00:05:08,170 --> 00:05:10,163 that cost management analysis done. 114 00:05:11,500 --> 00:05:14,590 And last, for multi-cloud and hybrid environments, 115 00:05:14,590 --> 00:05:16,280 you need to map your storage. 116 00:05:16,280 --> 00:05:18,560 If we have environments that are not in Azure, 117 00:05:18,560 --> 00:05:20,850 you need to know where you're going to archive, 118 00:05:20,850 --> 00:05:24,050 and where that data lives, and how you're going to move data 119 00:05:24,050 --> 00:05:25,740 between all of those environments 120 00:05:25,740 --> 00:05:27,520 so that, again, you know where everything is 121 00:05:27,520 --> 00:05:29,953 and you can reduce your footprint and your risk. 122 00:05:31,640 --> 00:05:34,140 So wrapping up, just a couple of points to remember. 123 00:05:34,140 --> 00:05:37,720 The key here, have a plan. And one that's written down, 124 00:05:37,720 --> 00:05:39,040 not just one that's in your head. 125 00:05:39,040 --> 00:05:42,500 I know that seems really easy, but there's a lot 126 00:05:42,500 --> 00:05:44,870 of business groups that don't actually 127 00:05:44,870 --> 00:05:47,930 write these things down, so make sure that you do that. 128 00:05:47,930 --> 00:05:50,990 And then finally, look at your standard application storage 129 00:05:50,990 --> 00:05:52,160 and your backup. 130 00:05:52,160 --> 00:05:54,820 Standard backups might cause regulatory issues, 131 00:05:54,820 --> 00:05:55,760 we talked about that. 132 00:05:55,760 --> 00:05:57,560 So make sure that you understand the services 133 00:05:57,560 --> 00:05:59,520 that you're using, how they back up, 134 00:05:59,520 --> 00:06:00,700 and where they back up to, 135 00:06:00,700 --> 00:06:03,720 so that you can avoid any regulatory issues. 136 00:06:03,720 --> 00:06:05,300 All right, that's it for this lesson. 137 00:06:05,300 --> 00:06:06,550 I'll see you in the next.