Use AWS Lambda, Amazon Route 53, and Amazon SNS to monitor the lead replica endpoints for Redis cluster (cluster mode disabled).
In Redis Amazon Elasticache, connect to the Elasticache node or cluster using the specified endpoint.According to the Elasticache components and features for Redis User Guide for Redis, the Redis (cluster mode invalid) cluster with multiple nodes has two types of endpoints:
Elasticache's primary point function for Redis always provides consistent when solving the primary node endpoint.AWS customers appreciate this function.
As a best practices, and for workload balancing, you need to send a read request to the lead replica.However, if a failover occurs, the previously used lead replica can be promoted to the primary roll.If you continue to send a read request to the same endpoint, the load on the new primary (old lead replica) may increase.In this case, it is convenient to have a lead replica endpoint that always shows replicas even after the failover occurs.
To do this, set the AWS Lambda function that can monitor and update the lead replica endpoint.This purpose is to create and use his custom CNAME for each lead replica in the Amazon Route 53 private zone, and use these CNAMEs in Redis clients.
When a failover occurs, a push notification is delivered to the Amazon Simple Notification Service (Amazon SNS) topic.The Lambda function that listens in this SNS topic updates CNAME to the right lead replica endpoint.As a result, the Redis client always has a CNAME that indicates the endpoint of the lead replica in addition to the normal Elasticache primary point for writing operations with primary nodes.
This article describes the procedure of the AWS Lambda function that lists Amazon SNS and updates the CNAME used in the Redis Cluster Amazon Elasticache (disabled cluster mode).
Overview of solutions
The structure of this solution is as follows:
Client application
In this example, we use Elasticache primary ends for writing.For reading, use five lead replicas in custom CNAME.
Amazon Elasticache
As shown in the example below, select SNS topics for Elasticache cluster with one primary node and five replicas (cluster mode is invalid).
Amazon Route 53
DNS プライベートゾーン (private.redisdub.pl
) を作成し、このゾーンで次の CNAME を使用します。
AWS Identity and Access Management (IAM)
The Lambda function has an IAM roll for giving the necessary authority to execute the function.
AWS Lambda
Lambda 関数は、クラスタの SNS トピックをリッスンしています。この SNS トピックにイベントが発生するたびに、それがフェイルオーバーなのか、あるいはリードレプリカの追加または削除なのかを検出します。この 3 つのイベントのいずれかが発生すると、Lambda 関数は API コールを実行して Redis クラスターの最新の構造を取得します (elasticache.describe_replication_groups
)。
応答に基づき、この関数は別の API コールを実行して、Route 53 プライベートゾーンで CNAME を更新または作成します ( route53.change_resource_record_sets
)。フェイルオーバーの場合は、既存の “読み取り” CNAME を更新します。ノードの作成または削除の場合は、それに応じて CNAME が追加または削除されます。
In this scenario, the application constantly starts reading operations on the lead replica in addition to writing in primary nodes via the primary point.
Results and benchmarks
In the following tests, the following Redis benchmark command is executed to execute a clon job for the five client instances.
redis-benchmark -n 10000 -k 0 -H Readonly1.private.redisdub.pl -p 6379
-n 10000
は 10,000 件のリクエストを実行します。-k 0
は、リクエストごとに再接続します。
-H Readonly1.private.redisdub.pl
は、レプリカ用に作成された 1 つの CNAME への接続を示します。
Each of the five clients target one unique CNAME.
次のスクリーンショットは、Amazon CloudWatch メトリクスのNewConnections
です。これは、ベンチマークによって生成されたリクエストがリードレプリカ全体に均等に分散されていることを示しています。
この CloudWatch メトリクスを詳しく見ると、16:00 にトリガーされたフェイルオーバーと、プライマリtestdns-001
がtestdns-002
に対してフェイルオーバーしていることがわかります。
また、testdns-002
がベンチマークからリクエストを受け取っているのがわかります。フェイルオーバーがトリガーされた 16:00 には、CNAME レコードが更新されたため、リクエストの数が減少しています。その後、testdns-002
が新しいプライマリになり、読み取り操作ReadOnly1.private.redisdub.pl
の CNAME を介したリクエストの受信を停止します。
Before Failover at 16:00:
プライマリエンドポイント –> プライマリtestdns-001
ReadOnly1.private.redisdub.pl –> レプリカtestdns-002
After Failover at 16:00:
プライマリエンドポイント –> 新しいプライマリtestdns-002
ReadOnly1.private.redisdub.pl –> レプリカtestdns-001
通常のフェイルオーバーのシナリオと同様に、以前のプライマリノードtestdns-001
が置き換わります。プライマリノードが出現して動作すると、ベンチマークからリクエストを受信し始めるのがわかります。これは、ReadOnly1.private.redisdub.pl
が testdns-001 を示しているからです。
testdns-001
は、16:06 以降リクエストを受信できる可能性がありましたが、次のベンチマークの実施は 16:10 でした。そのため、16:06 から 16:10 にかけて、フラットな茶色の線が表示されています。
Details of the procedure for implementing this tool
Step 1: Create a private zone in the Route 53 of Virtual Private Cloud (VPC) where the Redis cluster and client are located.
For more information, see Creating a private host zone for the Amazon Route 53 Developer Guide Developer Guide.
It is also necessary to create the following lead replica CNAME.
You can create CNAME for the same number as existing lead replicas.
When you create a CNAME, you can use it with a client application.
Step 2: Add an SNS topic to the cluster.
For more information, see Managing Elasticache Amazon SNS notifications for Redis Elasticache user guides.
Step 3: Create an IAM role in the Lambda function according to the following policy.
次のコンテンツを使用して、新しいポリシーRedisReplica_Route53
を作成します。
{"Version": "2012-10-17","Statement": [ {"Sid": "Stmt1511707556511","Action": ["route53:GetHostedZone","route53:ChangeResourceRecordSets"],"Effect": "Allow","Resource": "arn:aws:route53:::hostedzone/Z32WVXIKNNRFKK" }]}
For more information about IAM policies and rolls, see the following topics.
Step 4: Create the Lambda function using the following code.
In the AWS Lambda console, select Function Creation and select Create from the beginning.Name the function called Redisreplica-Autocname.
In the role list, select the Select an existing role option.Next, select the role created in Step 3.
Select [Create Function].A page with three tabs, [Settings], [Trigger], and [Monitoring] are displayed.In the Settings tab, [Python 2] as [Runime].Select 7].Next, copy the following code and paste it on the editor.
from__future__ import print_functionimport boto3import reimport jsonimport osAWS_REGION = os.environ['aws_region']CNAME = os.environ['cname']ZONE_ID = os.environ['zone_id']CLUSTER = os.environ['cluster']def aws_session(role_arn=None, session_name='my_session'): """ If role_arn is given assumes a role and returns boto3 session otherwise return a regular session with the current IAM userFailoverComplete/role """ if role_arn:client = boto3.client('sts')response = client.assume_role(RoleArn=role_arn, RoleSessionName=session_name)session = boto3.Session(aws_access_key_id=response['Credentials']['AccessKeyId'],aws_secret_access_key=response['Credentials']['SecretAccessKey'],aws_session_token=response['Credentials']['SessionToken'])return session else:return boto3.Session()def get_nodes(cluster, session): """ return list of nodes that breaks down a cluster """ elasticache = session.client('elasticache', region_name=AWS_REGION) repgroups = elasticache.describe_replication_groups()['ReplicationGroups'] nodes = {} for repgroup in repgroups:if repgroup['ReplicationGroupId'] == cluster:for nodegrp in repgroup['NodeGroups']: for cachecluster in nodegrp['NodeGroupMembers']:nodes[cachecluster['CacheClusterId']] = {}nodes[cachecluster['CacheClusterId']]['role'] = cachecluster['CurrentRole']nodes[cachecluster['CacheClusterId']]['addr'] = cachecluster['ReadEndpoint']['Address'] return(nodes)def update_cname(nodes, cname, zone, session): """ update CNAME entries from a dictionary of nodes. """ route53 = session.client('route53') dzone = route53.get_hosted_zone(Id=zone) dzonedomain = dzone["HostedZone"]["Name"] """ CNAME should be a valid zone's sud-domain """ if not re.match('[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{1,63})*\.' + dzonedomain, cname):return('Error, cname {} doesnt match domain {}'.format(cname, dzonedomain)) response = {} num = 1 for node_name in nodes.keys():node = nodes[node_name]if node['role'] == 'replica':realcname = '.'.join( [i + str(num) if enum == 0 else i for enum, i in enumerate(cname.split('.'))])dns_changes = { 'Changes': [{'Action': 'UPSERT','ResourceRecordSet': { 'Name': realcname, 'Type': 'CNAME', 'TTL': 10, 'ResourceRecords': [{ 'Value': node['addr'],} ],}} ]}print( "DEBUG - Updating Route53 to create CNAME {} for {}".format(realcname, node['addr']))response[node_name] = route53.change_resource_record_sets( HostedZoneId=zone, ChangeBatch=dns_changes)num += 1 return(response)def lambda_handler(event, context): """ Main lambda function Parse and check the event validity """ msg = json.loads(event['Records'][0]['Sns']['Message']) msg_type = msg.keys()[0] msg_event = msg_type.split(':')[1] msg_node = msg[msg_type] events = ['CacheNodeReplaceComplete', 'TestFailoverApiCalled','FailoverComplete', 'CacheClusterProvisioningComplete'] if msg_event not in events:print('Event {} is not valid for RedisReplica-autocname function'.format(msg_type))return else:print('Event {} is valid, processing with RedisReplica-autocname...'.format(msg_type)) session = aws_session() nodes = get_nodes(CLUSTER, session) if msg_node not in [node for node in nodes.keys()]:print('{} not a node of cluster {}'.format(msg_node, CLUSTER))return dnsupdate = update_cname(nodes, CNAME, ZONE_ID, session) """ dnsupdate return list when OK and string on error """ if isinstance(dnsupdate, str):print(dnsupdate)return for response in dnsupdate.iteritems():print("DNS record {} R53 status is {}".format(response[0], response[1]['ChangeInfo']['Status'])) return
For more information about the creation of functions, see Create Lambda functions for the AWS Lambda developer guide.
In the next step, set four environmental variables (variable names/values).These are pairs of key and value, and you need to set as follows.
Set the timeout of the function to control the code execution performance on the same [Settings] tab of the console.We recommend that you set this timeout to 15 seconds.(The execution time was about 3 seconds in the test, but it may differ depending on the environment.)
Finally, on the Trigger tab, select SNS and select the topic associated with the Redis cluster.
Step 5: Test the environment.
The last step is an environment test by execution of manual failover.This failover triggers the event in the SNS topic, and the Lambda function detects a failover.As a result, new primary/replica mapping is collected and his CNAME in the private zone is updated.
When Time To Live (TTL) expires (15 seconds in the Amazon EC2 private host zone), the client instance gets a new DNS record.Connect to a new lead replica (old primary) to perform reading operations.There is no need for other changes in the application.
Another test is to add a new node to the replication group.The Lambda function automatically creates a new CNAME.Already read1.MyRedis.COM and READ2.MyRedis.If you are using COM, Read3.MyRedis.COM is created and you can add this new CNAME to the application.This function always holds the same number of CNAME as the lead replica.In other words, if one node is deleted, one CNAME will be deleted.
About Lambda function
The LAMBDA function lists the SNS topic directly.In addition, various verification is performed to filter the following.
Each message of the SNS topic triggers the Lambda function, but keep in mind that only the corresponding message triggers the action.It is recommended that you use an SNS topic dedicated to Redis cluster so that you do not pay more than necessary.
summary
The ELASTICACHE primary point function for Redis is always connected to the primary node in the cluster, even if a specific node is changed with primary rolls.When sending a read request to the lead replica, it is convenient to have a replica endpoint that always indicates replica even after the failover occurs.This article describes how to create an AWS Lambda function that can monitor and update the lead replica endpoint.By implementing this process, the Redis client will have a CNAME that indicates the endpoint of the lead replica in addition to his normal Elasticache primary point for writing operations with primary nodes.
Based on the practical knowledge obtained from this article, the structure of this solution can be reused to other projects.You can change the event type used to filter and execute the Lambda function, and add code to execute according to the Elasticache event.
About the author
Yann Richard is an AWS Cloud Support Engineer and Elasticache service related expert.My personal goal is to complete a full marathon within 4 hours and move data under milliseconds.
Julieen Prigent is an AWS Linux cloud support engineer.He likes to challenge the limits of physical strength, whether it is a technical exploration or a long -distance trail run.