1 00:00:00,690 --> 00:00:02,410 ‫Now let's talk about how to get 2 00:00:02,410 --> 00:00:04,000 ‫data in Kinesis and for this 3 00:00:04,000 --> 00:00:06,920 ‫we are going to leverage Kinesis Producers. 4 00:00:06,920 --> 00:00:10,767 ‫So they're used to send data into our data streams 5 00:00:10,767 --> 00:00:13,580 ‫and a data record, as we said, consists of a sequence number 6 00:00:13,580 --> 00:00:16,600 ‫which is unique per partition-key within the shard, 7 00:00:16,600 --> 00:00:19,990 ‫the partition key as well, which we must specify 8 00:00:19,990 --> 00:00:22,110 ‫while we put records into a stream 9 00:00:22,110 --> 00:00:24,720 ‫and the data blob, which is up to one megabytes. 10 00:00:24,720 --> 00:00:27,560 ‫The producers can be anything from the SDK 11 00:00:27,560 --> 00:00:29,700 ‫to create a very simple producer 12 00:00:29,700 --> 00:00:32,600 ‫to using the Kinesis Producer Library, the KPL, 13 00:00:32,600 --> 00:00:34,080 ‫which supports different languages such as 14 00:00:34,080 --> 00:00:37,640 ‫C++ or Java and it is built on top of the SDK 15 00:00:37,640 --> 00:00:40,850 ‫but has some advanced capabilities available as API, 16 00:00:40,850 --> 00:00:45,200 ‫so for example, batching, compression and retries. 17 00:00:45,200 --> 00:00:48,910 ‫The Kinesis Agent is another way to send data into Kinesis. 18 00:00:48,910 --> 00:00:51,760 ‫It is built on top of the Kinesis Producer Library 19 00:00:51,760 --> 00:00:54,150 ‫and this is going to be used to monitor log files 20 00:00:54,150 --> 00:00:57,123 ‫and stream those into your Kinesis data streams. 21 00:00:58,010 --> 00:00:59,900 ‫In terms of write throughput, we know this already. 22 00:00:59,900 --> 00:01:01,520 ‫We get one megabytes per second 23 00:01:01,520 --> 00:01:04,300 ‫or 1,000 records per second per shard. 24 00:01:04,300 --> 00:01:07,120 ‫And the API to send data into Kinesis 25 00:01:07,120 --> 00:01:09,610 ‫is called the PutRecord API. 26 00:01:09,610 --> 00:01:12,600 ‫If we use batching with PutRecord APIs we can reduce costs 27 00:01:12,600 --> 00:01:14,900 ‫and therefore also increase throughputs, 28 00:01:14,900 --> 00:01:17,470 ‫which is something that's the Kinesis Producer Library 29 00:01:17,470 --> 00:01:19,290 ‫does for us already. 30 00:01:19,290 --> 00:01:21,210 ‫So if we look at the producer side 31 00:01:21,210 --> 00:01:23,950 ‫we have a stream with say six shards 32 00:01:23,950 --> 00:01:27,830 ‫and we have some producers, for example, IoT devices. 33 00:01:27,830 --> 00:01:29,610 ‫So they're going to send data at a rate of 34 00:01:29,610 --> 00:01:33,180 ‫one megabyte per second or 1,000 records per second 35 00:01:33,180 --> 00:01:37,583 ‫per shard, and say we have one device ID, 111222333. 36 00:01:38,450 --> 00:01:40,080 ‫So it's going to produce some data 37 00:01:40,080 --> 00:01:42,160 ‫and we choose to elect the partition key 38 00:01:42,160 --> 00:01:43,800 ‫to be the device ID. 39 00:01:43,800 --> 00:01:45,480 ‫So as you can see the partition key is the 40 00:01:45,480 --> 00:01:46,980 ‫device ID in this case. 41 00:01:46,980 --> 00:01:49,210 ‫And so what will happen is that it will go through 42 00:01:49,210 --> 00:01:51,960 ‫a hash function, which is a mathematical function 43 00:01:51,960 --> 00:01:55,260 ‫that will take as an input the partition key 44 00:01:55,260 --> 00:01:59,040 ‫and we'll figure out to which shard to send the data. 45 00:01:59,040 --> 00:02:01,790 ‫And so, for example, it goes to shard one, 46 00:02:01,790 --> 00:02:03,710 ‫thanks to the hash function and so that means that 47 00:02:03,710 --> 00:02:06,420 ‫all the data that will share the same partition key 48 00:02:06,420 --> 00:02:08,220 ‫is going to go to the same shard. 49 00:02:08,220 --> 00:02:13,220 ‫So all the data of the device ID will end up in shard one. 50 00:02:13,340 --> 00:02:15,370 ‫Now, if you have a second device ID 51 00:02:15,370 --> 00:02:18,970 ‫with a different device ID, so 444555666, 52 00:02:18,970 --> 00:02:21,330 ‫then the data blob is going to have 53 00:02:21,330 --> 00:02:23,600 ‫a different partition key but it's going to go through 54 00:02:23,600 --> 00:02:25,510 ‫the same hash function. 55 00:02:25,510 --> 00:02:27,830 ‫And this time the hash function may decide 56 00:02:27,830 --> 00:02:30,930 ‫to send this data into shard two and so that means 57 00:02:30,930 --> 00:02:33,550 ‫that this device ID will have all this data 58 00:02:33,550 --> 00:02:35,970 ‫sent to the shard number two. 59 00:02:35,970 --> 00:02:40,130 ‫And so this is how you produce to Kinesis data streams 60 00:02:40,130 --> 00:02:41,900 ‫with a partition key. 61 00:02:41,900 --> 00:02:46,470 ‫Now, as you can see, if one device is very chatty 62 00:02:46,470 --> 00:02:49,580 ‫and sends a lot of data it may overwhelm a shard. 63 00:02:49,580 --> 00:02:51,850 ‫Also, you need to make sure your partition key 64 00:02:51,850 --> 00:02:54,770 ‫is very well distributed to avoid the concept 65 00:02:54,770 --> 00:02:57,840 ‫of a hot partition, because then you will have one shard 66 00:02:57,840 --> 00:03:00,400 ‫that's going to have more throughputs than the others 67 00:03:00,400 --> 00:03:01,900 ‫and they will bring some imbalance. 68 00:03:01,900 --> 00:03:03,860 ‫So you need to think about a way to have 69 00:03:03,860 --> 00:03:05,420 ‫a distributed partition key. 70 00:03:05,420 --> 00:03:08,580 ‫For example, if you have six shards and 10,000 users 71 00:03:08,580 --> 00:03:10,890 ‫the user ID is very distributed. 72 00:03:10,890 --> 00:03:14,700 ‫But if you have six shards and you just look at Chrome, 73 00:03:14,700 --> 00:03:17,710 ‫Firefox and Safari as web browser 74 00:03:17,710 --> 00:03:20,330 ‫and the name of the web browser as a partition key, 75 00:03:20,330 --> 00:03:23,420 ‫then it's going to be very hot maybe for Chrome, 76 00:03:23,420 --> 00:03:25,690 ‫because there are many, many Chrome users in the world 77 00:03:25,690 --> 00:03:27,430 ‫versus Firefox or Safari. 78 00:03:27,430 --> 00:03:28,480 ‫Who knows, right? 79 00:03:28,480 --> 00:03:30,020 ‫So you need to make sure that you use a 80 00:03:30,020 --> 00:03:32,380 ‫distributed partition key. 81 00:03:32,380 --> 00:03:34,944 ‫Which brings us into the area of 82 00:03:34,944 --> 00:03:37,110 ‫ProvisionedThroughputExceeded. 83 00:03:37,110 --> 00:03:39,110 ‫So when we produce to Kinesis data streams 84 00:03:39,110 --> 00:03:41,120 ‫from our applications, we know that we can produce 85 00:03:41,120 --> 00:03:43,460 ‫1 megabyte per second or 1,000 records per second 86 00:03:43,460 --> 00:03:46,040 ‫and so as long as we do so it goes well. 87 00:03:46,040 --> 00:03:49,590 ‫But if we start over-producing into a shard 88 00:03:49,590 --> 00:03:50,740 ‫we're going to get an exception 89 00:03:50,740 --> 00:03:54,260 ‫because we are going over the provision throughput. 90 00:03:54,260 --> 00:03:57,600 ‫So we get the ProvisionedThroughputExceeded exception. 91 00:03:57,600 --> 00:03:59,500 ‫And so the solution to this is number one, 92 00:03:59,500 --> 00:04:02,000 ‫to make sure you are using a highly distributed 93 00:04:02,000 --> 00:04:04,150 ‫partition key because if not, then this error 94 00:04:04,150 --> 00:04:05,670 ‫will happen a lot. 95 00:04:05,670 --> 00:04:10,100 ‫And we need to implement retries with exponential backoff 96 00:04:10,100 --> 00:04:13,010 ‫to ensure that we can retry these exceptions 97 00:04:13,010 --> 00:04:15,320 ‫and this is something that comes up at the exam. 98 00:04:15,320 --> 00:04:17,960 ‫And finally, you need to scale the shard, 99 00:04:17,960 --> 00:04:19,210 ‫maybe it's called shard-splitting 100 00:04:19,210 --> 00:04:22,920 ‫to split a shard into and augment the throughputs 101 00:04:22,920 --> 00:04:26,820 ‫and so we'll see shard-splitting in a future lecture. 102 00:04:26,820 --> 00:04:28,370 ‫But by increasing the number of shards 103 00:04:28,370 --> 00:04:31,370 ‫this can help address throughput exceptions obviously, 104 00:04:31,370 --> 00:04:34,030 ‫because if you go from six shards to seven shards 105 00:04:34,030 --> 00:04:35,954 ‫we have one more megabyte per second available 106 00:04:35,954 --> 00:04:37,460 ‫in our stream. 107 00:04:37,460 --> 00:04:39,000 ‫So that's it for this lecture. 108 00:04:39,000 --> 00:04:41,950 ‫I hope you liked it and I will see you in the next lecture.