This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Performance Efficiency

Right size your resources and streamline you monitoring to deliver performant workloads

Performance covers a lot of ground and you really need to balance business and technical requirements here also. What’s an acceptable load time for your customers? If you get that wrong, customers can become frustrated and you’ll lose business. We could of course build something super responsive but waste resources and money, so there is a fine line to tread.

When we think about S3 and performance we no longer have to consider things like randomizing prefixes to spread the data out in order to avoid hot spots in data. Amazon listened to it’s customer and handled this for us. However there are things we can do still for performance, these can be from using batch to process lots of data using S3-Select to minimise the data you pullfrom s3 and that helps with cost also. But the easiest way to keep an eye on things is to use the S3 dashboard something that I geek out about on a regular basis because it has pretty graphs :) We’ll look at some of the operations in this chapter and I’ll show you how to keep an eye on systems and keep objects flying in and out of your buckets.

1 - SNS and EventBridge

SNS and EventBridge Triggers.

OpEx Sec Rel Perf Cost Sus

Enabling notifications is a bucket-level operation. You store notification configuration information in the notification subresource that’s associated with a bucket. After you create or change the bucket notification configuration, it usually takes about five minutes for the changes to take effect. When the notification is first enabled, an s3:TestEvent occurs. Amazon S3 stores the notification configuration as XML in the notification subresource that’s associated with a bucket.

Technical Considerations

Using SNS or eventbridge can give you great flexibility to have actions performed when a file is upload/deleted/updated in S3. During this guide you’ll also see you can additionally use SQS or trigger Lambda directly and you may wonder why not use these approaches instead, and you’d be right, It’s more efficient to go direct to Lambda, however, SNS can deliver to multiple subscribers (lambda, email, etc) So it gives you a few more options. Eventbridge is also an enhancement over direct to Lambda as it allows you to filter which messages will actually triger Lambda running and potentially save you 1000’s of unneeded invocations.

Business Considerations

Using cheaper storage such as S3 has a real potential to lower your bill, but you’ll probably want to do something with that data. This chapter shows that S3 can be a power hub allowing your data to be automatically processed on update or other operations. This way of working can help your business transform to a micro-services style of working, which will help you gain speed in rolling out new features and updates without affecting the entire business, thus you can innovate faster.

1.1 - Enabling EventBridge

EventBridge Triggers.

OpEx Sec Rel Perf Cost Sus

You can enable Amazon EventBridge using the S3 console, AWS Command Line Interface (AWS CLI), or Amazon S3 REST API.

Using the S3 console

To enable EventBridge event delivery in the S3 console.

  • Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  • In the Buckets list, choose the name of the bucket that you want to enable events for.
  • Choose Properties.
  • Navigate to the Event Notifications section and find the Amazon EventBridge subsection. Choose Edit.

Enabling EventBridge

  • Under Send notifications to Amazon EventBridge for all events in this bucket choose On.

Note After you enable EventBridge, it takes around five minutes for the changes to take effect.

Using the AWS CLI

The following example creates a bucket notification configuration for bucket with Amazon EventBridge enabled.

aws s3api put-bucket-notification-configuration --bucket <BUCKET-NAME> --notification-configuration '{ "EventBridgeConfiguration": {} }'

Creating EventBridge rules

Once enabled you can create Amazon EventBridge rules for certain tasks. For example, you can send email notifications when an object is created.

1.2 - SNS Topic Notifications

SNS Triggers.

OpEx Sec Rel Perf Cost Sus

Configuring event notifications via the console

Publish event messages to an SNS Topic

  • Head to the SNS console and create a new topic, Just set the name and leave everything else as standard.
  • Make a note of the ARN you’ll need this in a second
  • Now edit the SNS topic and edit the Acess Policy. We are going to narrow the policy down to SNS:Publish from your bucket only. Make sure your replace , and with your details:
{
    "Version": "2012-10-17",
    "Id": "example-ID",
    "Statement": [
        {
            "Sid": "Example SNS topic policy",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": [
                "SNS:Publish"
            ],
            "Resource": "<SNS-ARN>",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:*:*:<BUCKET-NAME>"
                },
                "StringEquals": {
                    "aws:SourceAccount": "<ACCOUNT-ID>"
                }
            }
        }
    ]
}
  • Save your settings
  • Now back on the S3 console select your bucket and click edit
  • Click on the Properties tab and scroll down to Notifications

Enable Notifications

  • Create a new notification and follow the settings in the following screen shot and be sure to select the correct SNS Topic!

Add The configuration

2 - Multipart Uploads

Speeding up uploads

OpEx Sec Rel Perf Cost Sus

If you are looking to upload an object greater than ~100mb in size you should consider using multipart uploads. This will speed up your total time to upload by using multiple threads. You also get the added bennefit that if a part fails in upload, you can just reupload that part and not the entire part again.

Using multipart upload provides the following advantages:

  • Improved throughput - You can upload parts in parallel to improve throughput.
  • Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network error.
  • Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.
  • Begin an upload before you know the final object size - You can upload an object as you are creating it.

There are three steps to a multipart upload,

  • You initiate a request to S3 and get a UUID back
  • You upload the parts
  • You send a complete upload to S3 and S3 reconsructs the object for you

When uploading your file is split into parts, anywhere between w and 10,000, and if you are doing this programatically you need to track the part numbers and the ETag responses from S3 within your application.i The good news is the aws cli automatically splits large files for you. I’ve included information from the aws website below showing you how to upload and tweak how many concurrent connections you are making, so if you have a fast and stable connection you can use it to it’s full potential, equally you could minimize the calls and slow down the amount of the concurrent uploads.

To use a high-level aws s3 command for your multipart upload, run this command:

$ aws s3 cp large_test_file s3://DOC-EXAMPLE-BUCKET/

This example uses the command aws s3 cp, but other aws s3 commands that involve uploading objects into an S3 bucket (for example, aws s3 sync or aws s3 mv) also automatically perform a multipart upload when the object is large.

Objects that are uploaded to Amazon S3 using multipart uploads have a different ETag format than objects that are uploaded using a traditional PUT request. To store the MD5 checksum value of the source file as a reference, upload the file with the checksum value as custom metadata. To add the MD5 checksum value as custom metadata, include the optional parameter –metadata md5=“examplemd5value1234/4Q” in the upload command, similar to the following:

$ aws s3 cp large_test_file s3://DOC-EXAMPLE-BUCKET/ --metadata md5="examplemd5value1234/4Q"

To use more of your host’s bandwidth and resources during the upload, increase the maximum number of concurrent requests set in your AWS CLI configuration. By default, the AWS CLI uses 10 maximum concurrent requests. This command sets the maximum concurrent number of requests to 20:

$ aws configure set default.s3.max_concurrent_requests 20

Upload the file in multiple parts using low-level (aws s3api) commands

Important: Use this aws s3api procedure only when aws s3 commands don’t support a specific upload need, such as when the multipart upload involves multiple servers, a multipart upload is being manually stopped and resumed, or when the aws s3 command doesn’t support a required request parameter. For other multipart uploads, use aws s3 cp or other high-level s3 commands.

  1. Split the file that you want to upload into multiple parts. Tip: If you’re using a Linux operating system, use the split command.

  2. Run this command to initiate a multipart upload and to retrieve the associated upload ID. The command returns a response that contains the UploadID:

aws s3api create-multipart-upload --bucket DOC-EXAMPLE-BUCKET --key large_test_file
  1. Copy the UploadID value as a reference for later steps.

  2. Run this command to upload the first part of the file. Be sure to replace all values with the values for your bucket, file, and multipart upload. The command returns a response that contains an ETag value for the part of the file that you uploaded. For more information on each parameter, see upload-part.

aws s3api upload-part --bucket DOC-EXAMPLE-BUCKET --key large_test_file --part-number 1 --body large_test_file.001 --upload-id exampleTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk --content-md5 exampleaAmjr+4sRXUwf0w==
  1. Copy the ETag value as a reference for later steps.

  2. Repeat steps 4 and 5 for each part of the file. Be sure to increase the part number with each new part that you upload.

  3. After you upload all the file parts, run this command to list the uploaded parts and confirm that the list is complete:

aws s3api list-parts --bucket DOC-EXAMPLE-BUCKET --key large_test_file --upload-id exampleTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk
  1. Compile the ETag values for each file part that you uploaded into a JSON-formatted file that is similar to the following:
{
    "Parts": [{
        "ETag": "example8be9a0268ebfb8b115d4c1fd3",
        "PartNumber":1
    },

    ....

    {
        "ETag": "example246e31ab807da6f62802c1ae8",
        "PartNumber":4
    }]
}
  1. Name the file fileparts.json.

  2. Run this command to complete the multipart upload. Replace the value for –multipart-upload with the path to the JSON-formatted file with ETags that you created.

aws s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket DOC-EXAMPLE-BUCKET --key large_test_file --upload-id exampleTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk
  1. If the previous command is successful, then you receive a response similar to the following:
{
    "ETag": "\\"exampleae01633ff0af167d925cad279-2\\"",
    "Bucket": "DOC-EXAMPLE-BUCKET",
    "Location": "https://DOC-EXAMPLE-BUCKET.s3.amazonaws.com/large_test_file",
   
    "Key": "large_test_file"
}

Resolve upload failures

If you use the high-level aws s3 commands for a multipart upload and the upload fails (due either to a timeout or a manual cancellation), you must start a new multipart upload. In most cases, the AWS CLI automatically cancels the multipart upload and then removes any multipart files that you created. This process can take several minutes.

If you use aws s3api commands for a multipart upload and the process is interrupted, you must remove incomplete parts of the upload, and then re-upload the parts.

To remove the incomplete parts, use the AbortIncompleteMultipartUpload lifecycle action. Or, use aws s3api commands to remove the incomplete parts by following these steps:

  1. Run this command to list incomplete multipart file uploads. Replace the value for –bucket with the name of your bucket.
aws s3api list-multipart-uploads --bucket DOC-EXAMPLE-BUCKET
  1. The command returns a message with any file parts that weren’t processed, similar to the following:
{
    "Uploads": [
        {
           
    "Initiator": {
                "DisplayName": "multipartmessage",
                "ID": "290xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    "
            },
            "Initiated": "2016-03-31T06:13:15.000Z",
           
    "UploadId": "examplevQpHp7eHc_J5s9U.kzM3GAHeOJh1P8wVTmRqEVojwiwu3wPX6fWYzADNtOHklJI6W6Q9NJUYgjePKCVpbl_rDP6mGIr2AQJNKB_A-",
            "StorageClass": "STANDARD",
           
    "Key": "",
            "Owner": {
                "DisplayName": "multipartmessage",
               
    "ID": "290xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx "
            }
        }
   ]
}
  1. Run this command to remove the incomplete parts:
aws s3api abort-multipart-upload --bucket DOC-EXAMPLE-BUCKET --key large_test_file --upload-id examplevQpHp7eHc_J5s9U.kzM3GAHeOJh1P8wVTmRqEVojwiwu3wPX6fWYzADNtOHklJI6W6Q9NJUYgjePKCVpbl_rDP6mGIr2AQJNKB

3 - Batch

Performing Large Scale Batch opperations on S3

OpEx Sec Rel Perf Cost Sus

You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify. A single job can perform a specified operation on billions of objects containing exabytes of data. Amazon S3 tracks progress, sends notifications, and stores a detailed completion report of all actions, providing a fully managed, auditable, and serverless experience. You can use S3 Batch Operations through the AWS Management Console, AWS CLI, Amazon SDKs, or REST API.

Use S3 Batch Operations to copy objects and set object tags or access control lists (ACLs). You can also initiate object restores from S3 Glacier Flexible Retrieval or invoke an AWS Lambda function to perform custom actions using your objects. You can perform these operations on a custom list of objects, or you can use an Amazon S3 Inventory report to easily generate lists of objects. Amazon S3 Batch Operations use the same Amazon S3 APIs that you already use with Amazon S3, so you’ll find the interface familiar.

S3 Batch Operations basics

You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can run a single operation or action on lists of Amazon S3 objects that you specify.

Terminology

This section uses the terms jobs, operations, and tasks, which are defined as follows:

Job

A job is the basic unit of work for S3 Batch Operations. A job contains all of the information necessary to run the specified operation on the objects listed in the manifest. After you provide this information and request that the job begin, the job performs the operation for each object in the manifest.

Operation

The operation is the type of API action, such as copying objects, that you want the Batch Operations job to run. Each job performs a single type of operation across all objects that are specified in the manifest.

Task

A task is the unit of execution for a job. A task represents a single call to an Amazon S3 or AWS Lambda API operation to perform the job’s operation on a single object. Over the course of a job’s lifetime, S3 Batch Operations create one task for each object specified in the manifest.

How an S3 Batch Operations job works

A job is the basic unit of work for S3 Batch Operations. A job contains all of the information necessary to run the specified operation on a list of objects. To create a job, you give S3 Batch Operations a list of objects and specify the action to perform on those objects.

For information about the operations that S3 Batch Operations supports, see Operations supported by S3 Batch Operations.

A batch job performs a specified operation on every object that is included in its manifest. A manifest lists the objects that you want a batch job to process and it is stored as an object in a bucket. You can use a comma-separated values (CSV)-formatted Amazon S3 Inventory report as a manifest, which makes it easy to create large lists of objects located in a bucket. You can also specify a manifest in a simple CSV format that enables you to perform batch operations on a customized list of objects contained within a single bucket.

After you create a job, Amazon S3 processes the list of objects in the manifest and runs the specified operation against each object. While a job is running, you can monitor its progress programmatically or through the Amazon S3 console. You can also configure a job to generate a completion report when it finishes.

4 - S3-Select

Filtering and retrieving data using Amazon S3 Select

OpEx Sec Rel Perf Cost Sus

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the result are delimited.

You pass SQL expressions to Amazon S3 in the request. Amazon S3 Select supports a subset of SQL. For more information about the SQL elements that are supported by Amazon S3 Select, see SQL reference for Amazon S3 Select and S3 Glacier Select.

You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the Amazon S3 console. The Amazon S3 console limits the amount of data returned to 40 MB. To retrieve more data, use the AWS CLI or the API.

Requirements and limits

The following are requirements for using Amazon S3 Select:

  • You must have s3:GetObject permission for the object you are querying.
  • If the object you are querying is encrypted with a customer-provided encryption key (SSE-C), you must use https, and you must provide the encryption key in the request.

The following limits apply when using Amazon S3 Select:

  • The maximum length of a SQL expression is 256 KB.
  • The maximum length of a record in the input or result is 1 MB.
  • Amazon S3 Select can only emit nested data using the JSON output format.
  • You cannot specify the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or REDUCED_REDUNDANCY storage classes. For more information, about storage classes see Storage Classes.

Additional limitations apply when using Amazon S3 Select with Parquet objects:

  • Amazon S3 Select supports only columnar compression using GZIP or Snappy. Amazon S3 Select doesn’t support whole-object compression for Parquet objects.
  • Amazon S3 Select doesn’t support Parquet output. You must specify the output format as CSV or JSON.
  • The maximum uncompressed row group size is 512 MB.
  • You must use the data types specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

Constructing a request

When you construct a request, you provide details of the object that is being queried using an InputSerialization object. You provide details of how the results are to be returned using an OutputSerialization object. You also include the SQL expression that Amazon S3 uses to filter the request.

For more information about constructing an Amazon S3 Select request, see SELECTObjectContent in the Amazon Simple Storage Service API Reference. You can also see one of the SDK code examples in the following sections.

Requests using scan ranges

With Amazon S3 Select, you can scan a subset of an object by specifying a range of bytes to query. This capability lets you parallelize scanning the whole object by splitting the work into separate Amazon S3 Select requests for a series of non-overlapping scan ranges. Scan ranges don’t need to be aligned with record boundaries. An Amazon S3 Select scan range request runs across the byte range that you specify. A record that starts within the scan range specified but extends beyond the scan range will be processed by the query. For example; the following shows an Amazon S3 object containing a series of records in a line-delimited CSV format:

A,B
C,D
D,E
E,F
G,H
I,J

Use the Amazon S3 Select ScanRange parameter and Start at (Byte) 1 and End at (Byte) 4. So the scan range would start at “,” and scan till the end of record starting at “C” and return the result C, D because that is the end of the record.

Amazon S3 Select scan range requests support Parquet, CSV (without quoted delimiters), and JSON objects (in LINES mode only). CSV and JSON objects must be uncompressed. For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed.

Amazon S3 Select scan range requests are available to use on the Amazon S3 CLI, API and SDK. You can use the ScanRange parameter in the Amazon S3 Select request for this feature. For more information, see the Amazon S3 SELECT Object Content in the Amazon Simple Storage Service API Reference.

Errors

Amazon S3 Select returns an error code and associated error message when an issue is encountered while attempting to run a query. For a list of error codes and descriptions, see the List of SELECT Object Content Error Codes section of the Error Responses page in the Amazon Simple Storage Service API Reference.

5 - SFTP Service

Allowing access to legacy applications

OpEx Sec Rel Perf Cost Sus

If you have applications or customers who need to transfer data in or out via SFTP, AWS Transfer for SFTP will help you. It allows clients to use the tools they are used to, but allows you take advantage of cheaper storage systems like S3! You can use the console to enable it but I’ve included source code to get this setup via terraform for you.

AWS Transfer for SFTP

The code builds on our simple bucket example but adds in the transfer family:

resource "aws_transfer_server" "example" {
  security_policy_name = "TransferSecurityPolicy-2020-06"
  tags = {
      Name = local.bucket_name
      Project = "${var.project}"
      Environment = "${var.env}"
      Owner = "${var.owner}"
      CostCenter = "${var.cost}"
      Confidentiality = "${var.conf}"
  }
}

You can now take advantage of S3s scale and features such as versioning and intelligent tiering but access in a traditional way. The Final thing you’ll need to do is add a user with a SSH key via the console. If you don’t have a SSH already just run:

ssh-keygen

Now head to the AWS transfer pages in the Web Console for AWS. Here you will be able to add a user and your public key.

Adding a user

You can now connect either by the sftp command,

sftp localfile remote_file_directory/.

Or use a visual tool. You’ll need the provate key to connect.