Skip to content

Nadeera3784/aws-textract-lamda-function

Repository files navigation

AWS Textract Lambda Function

This project contains source code and supporting files for a serverless application that extracts text from documents using AWS Textract. The application can be deployed with the SAM CLI and includes the following files and folders:

  • core/ - Code for the application's Lambda function that processes documents with Textract
  • events/ - Invocation events that you can use to invoke the function
  • tests/ - Unit tests for the application code
  • template.yaml - A template that defines the application's AWS resources

The application uses several AWS resources, including Lambda functions, an API Gateway API, and AWS Textract service. These resources are defined in the template.yaml file in this project.

Overview

This Lambda function:

  • Accepts GET requests with a document_name query parameter
  • Retrieves documents from a specified S3 bucket
  • Uses AWS Textract to extract text from the document
  • Returns the extracted text in a structured JSON response

API Usage

Endpoint

GET https://{id}.execute-api.us-east-1.amazonaws.com/Prod/extract-text?document_name=your-document.pdf

Parameters

  • document_name (required): Name of the document file in the S3 bucket

Response

{
  "document_name": "sample-document.pdf",
  "extracted_text": [
    "Line 1 of text",
    "Line 2 of text",
    "..."
  ],
  "full_response": {
    "Blocks": [...]
  }
}

Error Responses

  • 400: Missing required parameters or environment variables
  • 403: Access denied to S3 bucket or Textract service
  • 404: Document or bucket not found
  • 500: Internal server error

Prerequisites

To use the SAM CLI, you need the following tools:

Setup

  1. Create an S3 bucket to store your documents:

    aws s3 mb s3://my-textract-documents
  2. Upload sample documents to your S3 bucket:

    aws s3 cp sample-document.pdf s3://my-textract-documents/

Deploy the application

To build and deploy your application for the first time, run the following in your shell:

sam build --use-container
sam deploy --guided

During the guided deployment, you will be prompted to provide:

  • Stack Name: The name of the stack to deploy to CloudFormation
  • AWS Region: The AWS region you want to deploy your app to
  • S3BucketName: The name of your S3 bucket containing documents
  • Confirm changes before deploy: If set to yes, any change sets will be shown for manual review
  • Allow SAM CLI IAM role creation: Required for creating IAM roles for Lambda and Textract access

You can find your API Gateway Endpoint URL in the output values displayed after deployment.

Test locally

Build your application with the sam build --use-container command:

sam build --use-container

Test a single function by invoking it directly with a test event:

sam local invoke TextractFunction --event events/event.json

Run the API locally on port 3000:

sam local start-api
curl "http://localhost:3000/extract-text?document_name=sample-document.pdf"

Environment Variables

The Lambda function requires the following environment variable:

  • S3_BUCKET_NAME: Name of the S3 bucket containing documents to process

This is automatically set during deployment via the SAM template.

IAM Permissions

The Lambda function requires the following permissions:

  • textract:DetectDocumentText - To extract text from documents
  • textract:AnalyzeDocument - For advanced document analysis
  • s3:GetObject - To read documents from the specified S3 bucket

These permissions are automatically configured in the SAM template.

Tests

Tests are defined in the tests folder. Install test dependencies and run tests:

pip install -r tests/requirements.txt --user
# Run unit tests
python -m pytest tests/unit -v
# Run integration tests (requires deployed stack)
AWS_SAM_STACK_NAME="your-stack-name" python -m pytest tests/integration -v

Cleanup

To delete the application and all its resources:

sam delete --stack-name "your-stack-name"

Supported Document Formats

AWS Textract supports the following document formats:

  • PDF (up to 3000 pages)
  • PNG
  • JPEG
  • TIFF

Maximum file size: 512 MB for documents stored in S3.

About

Extracts text from documents using AWS Textract + SAM

Topics

Resources

Stars

Watchers

Forks