Issue summary
You might face very slow speed for the copy to S3 operation due to EBS API timeouts.
Please see below troubleshooting and additional performance suggestions
Troubleshooting
The most common issue with copy to S3 speed is timeout to the EBS API endpoint, In version 3.1a+ we have started to use the new EBS Direct API for copy to S3 related operations, this new API let us check changed blocks for the snapshot and help us reduce cost & time for copy process.
If you have this issue, you will see this error in /var/log/cmp/c2s3_log.log logs in the temporary worker instance.
2021-02-01 13:31:02,483:[1362][140364389091072][yVtOLmSFrH] ERROR: __call__(call_with_retry.py:16) list_snapshot_blocks failed [retry=0]
Traceback (most recent call last):
..
..
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
..
..
urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f2930675048>, 'Connection to ebs.east-us-1.amazonaws.com timed out. (connect timeout=20)')
Version 3.2.1 Troubleshooting:
In this version we have expended the worker test, which can now check also for EBS API connectivity
So you can just run the test for the relevant worker configuration and make sure that EBS API test passes.
If the test fails, you need to check what blocks the communication (security group for example) and if you are using just Private IP then you need to make sure the EC2 temp worker is able to reach the EBS AWS Endpoint, either via proxy or VPC endpoint.
For manual troubleshooting suggestions see below.
Version 3.1.0a/b/3.2.0 Troubleshooting:
You need to connect to the worker instance via ssh(user is ubuntu & your private key selected in the worker configuration).
You also have the option of launching a new Linux instance under the same VPC/SecurityGroups then follow the steps listed below.
If you don't know how to connect via SSH, You can see information from AWS on how to connect by clicking on the instance and then connect:
if you connected to the worker instance, then you should turn on termination protection so instance would not get deleted while testing/Troubleshooting.
Note: Don't forget to disable this option when done and delete the worker instance.
Once connected, you can test connectivity by running this curl command:
Note: You need to replace us-east-1 with the relevant region, the region to choose is the one where the snapshot to copy is located (or the one mention in the error log)
If you get timeout, you need to check your VPC/SecurityGroup settings for what is blocking the communication, Different networks might have different reasons for the timeout.
For the test instance used for this KB, It was resolved by doing the following:
1. We added EBS VPC endpoint to the worker VPC.
2. Updating the Security Group attached to the endpoint with inbound role
Note: The security group above was attached only to the endpoint.
Once you update your network so that the CURL commands works, you can try to run Copy to S3 again.
Additional suggestion
When coping to S3 it is recommended to have the S3 bucket located in the same region as the snapshot, having the bucket located in same region shuold provide better upload speed to the bucket and lower data transfer charge.
In addition, it is recommended to add VPC S3 endpoint which should help to improve the performance a bit and also make sure communication to the S3 is done over private IP, which is more secure