Large Scale S3 to Glacier Migration Using Step Functions

The Challenge

During a recent consulting project, I encountered a challenging situation. We needed to move terabytes of data, comprising around 130 million objects, from S3 Standard to Glacier Deep Archive to reduce costs. Using the standard lifecycle policy for this migration would have been expensive due to the sheer volume of data and objects involved. We needed an alternative solution that would be cost-effective and wouldn't take months to complete.

Exploring Solutions

Initially, I explored various options such as using Athena, Glue, and other methods. Eventually, I discovered AWS Step Functions with Distributed Map, which proved to be the ideal solution for our problem.

The Solution: Step Functions with Distributed Map

Using Step Functions with Distributed Map, we were able to complete the entire migration in just a couple of days. Remarkably, this approach reduced our migration costs to approximately 5% of what we would have spent using the traditional lifecycle policy. Here's how we implemented the solution:

We created a Step Function with a Map state that invokes a Lambda function. To limit Lambda invocations, we processed objects in batches of 1,100.
The Map state's source was our S3 bucket containing millions of objects and terabytes of data. The Map state automatically handled listing all objects and managed pagination.
For every 1,100 objects, one Lambda function invokes which would zip them together and push the archive to Glacier. This approach significantly reduced our Glacier API calls, resulting in substantial cost savings.

Here's our Step Function definition for reference:

{
  "Comment": "A description of my state machine",
  "StartAt": "Map",
  "States": {
    "Map": {
      "Type": "Map",
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "DISTRIBUTED",
          "ExecutionType": "STANDARD"
        },
        "StartAt": "Lambda Invoke",
        "States": {
          "Lambda Invoke": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "OutputPath": "$.Payload",
            "Parameters": {
              "FunctionName": "arn:aws:lambda:us-east-1:123456:function:transfer:$LATEST",
              "Payload": {
                "S3Key.$": "$.Items[*].Key",
                "executionId.$": "$$.Execution.Id"
              }
            },
            "Retry": [
              {
                "ErrorEquals": [
                  "Lambda.ServiceException",
                  "Lambda.AWSLambdaException",
                  "Lambda.SdkClientException",
                  "Lambda.TooManyRequestsException"
                ],
                "IntervalSeconds": 1,
                "MaxAttempts": 1,
                "BackoffRate": 2
              }
            ],
            "End": true
          }
        }
      },
      "Label": "Map",
      "MaxConcurrency": 50,
      "ItemReader": {
        "Resource": "arn:aws:states:::s3:listObjectsV2",
        "Parameters": {
          "Bucket": "bucket123"
        },
        "ReaderConfig": {}
      },
      "ItemBatcher": {
        "MaxItemsPerBatch": 1100,
        "MaxInputBytesPerBatch": 262144
      },
      "End": true,
      "ToleratedFailurePercentage": 99
    }
  }
}

Handling Additional Constraints

For more complex scenarios, such as migrating objects older than 6 months, we utilized S3 Inventory. The process was as follows:

We generated an S3 Inventory report containing object metadata, including object keys and last modified dates.
Using a simple Python script, we filtered objects based on the last modified date and output the results to a CSV file.
This CSV file then served as the source for our Step Function's Distributed Map state machine.

Conclusion

By leveraging AWS Step Functions with Distributed Map, we were able to efficiently migrate a large volume of data from S3 to Glacier Deep Archive. This approach not only saved us significant costs but also dramatically reduced the migration time.

References and Further Reading

https://aws.amazon.com/blogs/aws/step-functions-distributed-map-a-serverless-solution-for-large-scale-parallel-data-processing/

Last updated 3 months ago

{ "Comment": "A description of my state machine", "StartAt": "Map", "States": { "Map": { "Type": "Map", "ItemProcessor": { "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "STANDARD" }, "StartAt": "Lambda Invoke", "States": { "Lambda Invoke": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "OutputPath": "$.Payload", "Parameters": { "FunctionName": "arn:aws:lambda:us-east-1:123456:function:transfer:$LATEST", "Payload": { "S3Key.$": "$.Items[*].Key", "executionId.$": "$$.Execution.Id" } }, "Retry": [ { "ErrorEquals": [ "Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException", "Lambda.TooManyRequestsException" ], "IntervalSeconds": 1, "MaxAttempts": 1, "BackoffRate": 2 } ], "End": true } } }, "Label": "Map", "MaxConcurrency": 50, "ItemReader": { "Resource": "arn:aws:states:::s3:listObjectsV2", "Parameters": { "Bucket": "bucket123" }, "ReaderConfig": {} }, "ItemBatcher": { "MaxItemsPerBatch": 1100, "MaxInputBytesPerBatch": 262144 }, "End": true, "ToleratedFailurePercentage": 99 } } }