1 00:00:00,000 --> 00:00:01,490 Hey, what's up, Gurus? 2 00:00:01,490 --> 00:00:02,560 Brian back with you 3 00:00:02,560 --> 00:00:06,990 to talk about Performing Data Exploratory Analysis. 4 00:00:06,990 --> 00:00:09,920 So in this lesson, we are going to talk a little bit 5 00:00:09,920 --> 00:00:13,110 about what data exploratory analysis is. 6 00:00:13,110 --> 00:00:15,520 I'm going to give you just a brief code example, 7 00:00:15,520 --> 00:00:17,440 and then we're going to wrap up this lesson. 8 00:00:17,440 --> 00:00:19,140 It's going to be a pretty short lesson 9 00:00:19,140 --> 00:00:22,960 because the focus for this is not going to be on the code. 10 00:00:22,960 --> 00:00:26,720 It's just going to be on understanding the overall concept. 11 00:00:26,720 --> 00:00:28,993 So with that, let's dive in and get started. 12 00:00:30,690 --> 00:00:33,518 So at a high level, data exploratory analysis, 13 00:00:33,518 --> 00:00:38,470 it's just talking about initial investigations on data. 14 00:00:38,470 --> 00:00:41,510 So we talk about taking a look at data, 15 00:00:41,510 --> 00:00:43,780 without diving too far in. 16 00:00:43,780 --> 00:00:47,550 So what we want to be able to do is test a hypothesis 17 00:00:47,550 --> 00:00:51,820 or write some code and see if we're in the right ballpark. 18 00:00:51,820 --> 00:00:56,000 And so, exploratory data analysis allows us to do that. 19 00:00:56,000 --> 00:00:57,140 And it helps us 20 00:00:57,140 --> 00:01:00,000 through something called Kusto query language, 21 00:01:00,000 --> 00:01:02,250 to give a summary of statistics 22 00:01:02,250 --> 00:01:05,080 and graphical representations of our data. 23 00:01:05,080 --> 00:01:07,900 It's actually not that far off from Databricks, 24 00:01:07,900 --> 00:01:10,180 which we'll cover later in the course, 25 00:01:10,180 --> 00:01:13,420 but as you build cells, you can actually, in Databricks, 26 00:01:13,420 --> 00:01:15,630 take a look at an individual snippet of code 27 00:01:15,630 --> 00:01:17,520 and pull up graphs or charts 28 00:01:17,520 --> 00:01:19,320 to kind of see what's going on with the data 29 00:01:19,320 --> 00:01:21,993 that way, as well. Similar concept. 30 00:01:22,850 --> 00:01:26,040 So the takeaway from this is it's making sense 31 00:01:26,040 --> 00:01:30,320 of the data you have, before going too deep with it. 32 00:01:30,320 --> 00:01:33,900 Now, in order to look at exploratory data analysis, 33 00:01:33,900 --> 00:01:36,890 we're actually going to talk about Data Explorer, 34 00:01:36,890 --> 00:01:39,320 which is an Azure tool that you can use 35 00:01:39,320 --> 00:01:40,840 to take a look at code, 36 00:01:40,840 --> 00:01:43,870 and we're going to use just a few basic queries 37 00:01:44,830 --> 00:01:45,930 in order to do that. 38 00:01:45,930 --> 00:01:48,520 So, here's a code example. 39 00:01:48,520 --> 00:01:51,810 Let's say that we have traffic events, 40 00:01:51,810 --> 00:01:54,400 and we're taking a look at traffic events 41 00:01:54,400 --> 00:01:56,290 across the United States. 42 00:01:56,290 --> 00:01:59,480 So we could start off by sorting by the start time 43 00:01:59,480 --> 00:02:01,520 of that traffic event, 44 00:02:01,520 --> 00:02:05,320 and just pulling 10 results to kind of see what we have. 45 00:02:05,320 --> 00:02:06,670 So once we have that, 46 00:02:06,670 --> 00:02:09,660 we can then go on and get a little bit more information. 47 00:02:09,660 --> 00:02:11,850 So we could say, let's take a look at the traffic events 48 00:02:11,850 --> 00:02:14,460 and based on what we saw, let's sort by the start time. 49 00:02:14,460 --> 00:02:16,790 But instead of getting a hundred columns, 50 00:02:16,790 --> 00:02:19,350 I'm only interested in the start time, the end time, 51 00:02:19,350 --> 00:02:22,030 the state that it occurred in, the type of event, 52 00:02:22,030 --> 00:02:24,370 so maybe that was a single car crash, 53 00:02:24,370 --> 00:02:27,170 and then the damaged property detail, right? 54 00:02:27,170 --> 00:02:30,080 So I could specify those columns 55 00:02:30,080 --> 00:02:32,420 by using a project statement in Kusto. 56 00:02:32,420 --> 00:02:33,530 And then again, I could say 57 00:02:33,530 --> 00:02:36,730 I'm only interested in taking a look at 10 results. 58 00:02:36,730 --> 00:02:37,650 So what I'm doing 59 00:02:37,650 --> 00:02:41,900 is I'm performing that exploratory data analysis 60 00:02:41,900 --> 00:02:43,830 by quickly running through 61 00:02:43,830 --> 00:02:46,230 and looking at little bits of data, 62 00:02:46,230 --> 00:02:48,773 and kind of seeing what's going on with that data. 63 00:02:49,610 --> 00:02:52,210 So once I get that information, I could then go on and say, 64 00:02:52,210 --> 00:02:54,300 all right, let's summarize all of this. 65 00:02:54,300 --> 00:02:56,310 I could summarize the event, 66 00:02:56,310 --> 00:02:58,620 take a look at an average, sort it by state, 67 00:02:58,620 --> 00:02:59,940 and you can here. 68 00:02:59,940 --> 00:03:02,500 And then I could render that in a column chart. 69 00:03:02,500 --> 00:03:03,430 So I could move 70 00:03:03,430 --> 00:03:05,750 and actually start to take a look at the data 71 00:03:05,750 --> 00:03:07,840 by looking at a column. 72 00:03:07,840 --> 00:03:10,410 So now, within the span of--what's that-- 73 00:03:10,410 --> 00:03:12,300 maybe 10 lines of code, 74 00:03:12,300 --> 00:03:15,690 I've been able to pull some data out of my database. 75 00:03:15,690 --> 00:03:18,070 I've been able to look at what's going on 76 00:03:18,070 --> 00:03:20,730 and build some pretty basic charts. 77 00:03:20,730 --> 00:03:24,380 That gives me a decent idea of where we're at. 78 00:03:24,380 --> 00:03:28,410 For instance, why is Florida so much lower than California? 79 00:03:28,410 --> 00:03:31,510 So you can search a look at anomalies and start to get ideas 80 00:03:31,510 --> 00:03:34,470 on the data itself to figure out what's happening. 81 00:03:34,470 --> 00:03:38,773 So at its core, that's what exploratory data analysis is. 82 00:03:40,620 --> 00:03:43,580 So, kind of wrapping all this back up together, 83 00:03:43,580 --> 00:03:45,340 exploratory data analysis, 84 00:03:45,340 --> 00:03:47,850 it lets you make sense of the data you have 85 00:03:47,850 --> 00:03:52,070 before you build giant complex processes around it. 86 00:03:52,070 --> 00:03:54,150 And using Azure Data Explorer, 87 00:03:54,150 --> 00:03:57,010 it allows you to perform that real-time analysis 88 00:03:57,010 --> 00:04:00,810 on large volumes of data very quickly. 89 00:04:00,810 --> 00:04:02,660 And we use Kusto queries. 90 00:04:02,660 --> 00:04:05,730 And that's how we, just in the example of the code, 91 00:04:05,730 --> 00:04:09,100 that's how we interact with and explore the data. 92 00:04:09,100 --> 00:04:10,680 Really, that's all there is to this. 93 00:04:10,680 --> 00:04:12,830 We don't need to dive any farther into it. 94 00:04:12,830 --> 00:04:16,180 So I'm not going to give you tons of examples on code. 95 00:04:16,180 --> 00:04:19,110 This is what you need to understand for the DP-203. 96 00:04:19,110 --> 00:04:20,140 I hope this has been helpful. 97 00:04:20,140 --> 00:04:22,490 And with that, I'll see you in the next lesson.