A PHP library that transforms DOCX documents into clean, semantic HTML or JSON. Built on top of PHPWord, it adds style-based transformation, custom class mapping, and powerful CLI tools for batch processing.
- Style-Based Transformation: Map DOCX paragraph styles to custom HTML elements (e.g., Quote → blockquote)
- Clean Output: Semantic HTML with CSS classes, not inline styles
- Image Extraction: Extract images to separate assets directory with automatic deduplication and relative path references
- CLI Tools: Command-line interface for single-file and batch conversion
- YAML Configuration: Flexible style mapping and transformation rules
- Leverages PHPWord: Uses PHPWord's robust DOCX parsing, adds transformation layer
# Install dependencies
composer install
# Convert a DOCX file to HTML
./docx-converter/bin/docx-converter convert document.docx -o output.html
# With custom style mapping
./docx-converter/bin/docx-converter convert document.docx -s styles.yaml -o output.html
# Extract images to an assets directory
./docx-converter/bin/docx-converter convert document.docx -o output.html --assets-dir ./assets
# JSON output
./docx-converter/bin/docx-converter convert document.docx -f json -o output.json📚 Complete documentation is in the docs/ folder:
- Documentation Index - Start here for navigation
- Product Requirements - Features, priorities, roadmap
- Technical Specification - Architecture and implementation
- PHPWord Integration Strategy - How we leverage PHPWord
- CLI Implementation Guide - Step-by-step development guide
- Project Tracker - Current tasks and milestones
Phase: 🟡 Phase 1 - CLI-First MVP (Active Development)
Target: November 2025
Progress: See Project Tracker
- PHP 8.0 or higher
- Composer
- PHPWord 1.1+
- Symfony Console 6.0+
- Symfony YAML 6.0+
# Clone the repository
git clone <repository-url>
cd DocxStruct
# Install dependencies
composer install
# Make CLI executable
chmod +x docx-converter/bin/docx-converteruse DocxConverter\DocxConverter;
$converter = new DocxConverter();
$html = $converter->loadDocument('input.docx')->toHtml();
file_put_contents('output.html', $html);$styleMap = [
'Quote' => [
'convertTo' => 'blockquote',
'className' => 'pullquote'
],
'Heading1' => [
'className' => 'section-title'
]
];
$html = $converter->loadDocument('input.docx')
->withCustomStyleMap($styleMap)
->toHtml();When converting documents with images, you can extract them to a separate assets directory. The converter will:
- Extract all images to the specified directory
- Use relative paths in the generated HTML
- Create a manifest file for deduplication (same image used multiple times is only extracted once)
- Hash image content to ensure uniqueness
$converter = new DocxConverter();
$converter->loadDocument('document.docx')
->withAssetsDir('./output/assets')
->setOutputFilePath('./output/document.html')
->toHtml();The assets directory will contain:
- Image files named by their content hash (e.g.,
f829b914fc47cfc9c0747c119c27cf1b.png) assets-manifest.json- A manifest mapping content hashes to filenames
CLI Usage:
./docx-converter/bin/docx-converter convert document.docx \
-o output/document.html \
--assets-dir output/assetsRelative Paths:
When an output file path is provided (via -o option or setOutputFilePath()), image references in the HTML will be relative to the output file location. Without an output file path, absolute paths are used.
# Create batch config (batch-config.yaml)
# files:
# - input: doc1.docx
# output: doc1.html
# - input: doc2.docx
# output: doc2.json
# format: json
./docx-converter/bin/docx-converter batch --config batch-config.yamlSee the CLI Implementation Guide for detailed development instructions.
# Run tests
vendor/bin/phpunit
# Regenerate autoloader
composer dump-autoload- Review the Product Requirements Document
- Check Project Tracker for current tasks
- Follow coding guidelines in Technical Specification
- Reference AI Coding Instructions for patterns
If you're using GitHub Copilot to work on this project, see GitHub Copilot Tools Configuration to learn how to configure which tools are enabled by default.
[Add license information]
Built on top of PHPWord - A pure PHP library for reading and writing Word documents.