This project contains source code and supporting files for a serverless application that extracts text from documents using AWS Textract. The application can be deployed with the SAM CLI and includes the following files and folders:
core/
- Code for the application's Lambda function that processes documents with Textractevents/
- Invocation events that you can use to invoke the functiontests/
- Unit tests for the application codetemplate.yaml
- A template that defines the application's AWS resources
The application uses several AWS resources, including Lambda functions, an API Gateway API, and AWS Textract service. These resources are defined in the template.yaml
file in this project.
This Lambda function:
- Accepts GET requests with a
document_name
query parameter - Retrieves documents from a specified S3 bucket
- Uses AWS Textract to extract text from the document
- Returns the extracted text in a structured JSON response
GET https://{id}.execute-api.us-east-1.amazonaws.com/Prod/extract-text?document_name=your-document.pdf
document_name
(required): Name of the document file in the S3 bucket
{
"document_name": "sample-document.pdf",
"extracted_text": [
"Line 1 of text",
"Line 2 of text",
"..."
],
"full_response": {
"Blocks": [...]
}
}
400
: Missing required parameters or environment variables403
: Access denied to S3 bucket or Textract service404
: Document or bucket not found500
: Internal server error
To use the SAM CLI, you need the following tools:
- SAM CLI - Install the SAM CLI
- Python 3 installed
- Docker - Install Docker community edition
- AWS CLI configured with appropriate permissions
-
Create an S3 bucket to store your documents:
aws s3 mb s3://my-textract-documents
-
Upload sample documents to your S3 bucket:
aws s3 cp sample-document.pdf s3://my-textract-documents/
To build and deploy your application for the first time, run the following in your shell:
sam build --use-container
sam deploy --guided
During the guided deployment, you will be prompted to provide:
- Stack Name: The name of the stack to deploy to CloudFormation
- AWS Region: The AWS region you want to deploy your app to
- S3BucketName: The name of your S3 bucket containing documents
- Confirm changes before deploy: If set to yes, any change sets will be shown for manual review
- Allow SAM CLI IAM role creation: Required for creating IAM roles for Lambda and Textract access
You can find your API Gateway Endpoint URL in the output values displayed after deployment.
Build your application with the sam build --use-container
command:
sam build --use-container
Test a single function by invoking it directly with a test event:
sam local invoke TextractFunction --event events/event.json
Run the API locally on port 3000:
sam local start-api
curl "http://localhost:3000/extract-text?document_name=sample-document.pdf"
The Lambda function requires the following environment variable:
S3_BUCKET_NAME
: Name of the S3 bucket containing documents to process
This is automatically set during deployment via the SAM template.
The Lambda function requires the following permissions:
textract:DetectDocumentText
- To extract text from documentstextract:AnalyzeDocument
- For advanced document analysiss3:GetObject
- To read documents from the specified S3 bucket
These permissions are automatically configured in the SAM template.
Tests are defined in the tests
folder. Install test dependencies and run tests:
pip install -r tests/requirements.txt --user
# Run unit tests
python -m pytest tests/unit -v
# Run integration tests (requires deployed stack)
AWS_SAM_STACK_NAME="your-stack-name" python -m pytest tests/integration -v
To delete the application and all its resources:
sam delete --stack-name "your-stack-name"
AWS Textract supports the following document formats:
- PDF (up to 3000 pages)
- PNG
- JPEG
- TIFF
Maximum file size: 512 MB for documents stored in S3.