Automated Machine Learning Pipeline

The Automated Machine Learning Pipeline (AMLP) provides an integrated framework that unifies the entire workflow from dataset creation to model validation. It leverages large language model (LLM) agents to assist with electronic-structure code selection, thereby reducing the manual effort typically required. AMLP also incorporates automated dataset tools for efficient input generation—including geometry (or cell) optimizations and ab initio molecular dynamics (AIMD)—as well as output conversion and preparation of data in the MACE-compatible format. It supports three DFT packages—Gaussian, VASP, and CP2K—ensuring flexibility across different electronic-structure environments. Its analysis module, AMLP-Analysis (built on ASE), further supports a broad range of molecular simulations, enabling systematic evaluation and validation of machine learning interatomic potentials.

Automated Machine Learning Pipeline

Overview

The Automated Machine Learning Pipeline use multi-agent DFT research system as an integrated framework that combines:

AI-driven research analysis - Uses specialized AI agents to analyze research topics and generate summaries
DFT code expertise - Provides expert recommendations for Gaussian, VASP, and CP2K simulations
Input file generation - Efficiently processes crystallographic structures for DFT calculations
Output data processing - Extracts and formats simulation results for analysis or ML model training

Features

🤖 AI-Agent Research Assistance

AMLP includes multiple AI agents to assist with different aspects of computational chemistry research:

Experimental Chemist Agent: Summarizes and interprets experimental aspects of research topics.
Theoretical Chemist Agent: Analyzes theoretical foundations and computational methodologies.
DFT Expert Agents: Specialized agents for Gaussian, VASP, and CP2K that provide code-specific recommendations.
Supervisor Agents: Integrate information from all agents and generate comprehensive reports.

📝 Input Generation

Multi-code support: Generate inputs for CP2K, VASP, and Gaussian
Batch processing: Convert multiple structure files automatically
Format conversion: Process CIF and XYZ files with validation
Supercell creation: Build supercells with custom dimensions
Interactive guidance: Step-by-step parameter selection for DFT calculations

📊 Output Processing

DFT output extraction: Extract energies, forces, and coordinates from simulation results
ML-ready dataset creation: Convert DFT outputs to HDF5 format for machine learning potentials
AIMD processing: Generate AIMD inputs from optimized structures at multiple temperatures

📥 Installation

Requirements

Python 3.8+
Required Python packages:
- NumPy
- PyYAML
- ASE (Atomic Simulation Environment, optional but recommended)
- openai (for AI agent functionality)
- requests

Setup

Clone the repository:

git clone https://github.com/adamlaho/AMLP.git
cd AMLP

Install dependencies:

pip install -r requirements.txt

API Configuration

The AI agents in this system use OpenAI's API for text generation. Follow these steps to configure API access:

Get API Key:
- Sign up for an account at OpenAI Platform
- Navigate to the API keys section and create a new secret key
- Copy the key (you will not be able to view it again)

Set Environment Variable:

🔑 The system looks for the API key in the OPENAI_API_KEY environment variable:

# On Linux/macOS
export OPENAI_API_KEY="your-api-key-here"

# On Windows
set OPENAI_API_KEY=your-api-key-here

Model Configuration:

By default, the agents use predefined settings. These can be customized in the configuration file:
```
AMLP/multi_agent_dft/config/default_config.yaml
```
In this file, you can adjust: • The type of AI models (e.g., OpenAI models). • PublicationAPI parameters. • Other runtime conditions for agent behavior.
Usage Monitoring:
- Be aware of your OpenAI API usage limits
- The AI agent functionality will consume tokens based on the length of inputs and outputs
- The system implements basic retry logic for API rate limiting (3 attempts with exponential backoff)

Usage

Basic Usage

Run the main script to start the system:

python3 amlpt.py

The system will present a menu with five operation modes:

AI-agent feedback (research summaries & reports)
Input generation (CP2K/VASP/Gaussian)
Output processing (extract forces, energies, coordinates)
ML potential dataset creation (JSON to MACE HDF5)
AIMD processing (JSON to CP2K AIMD inputs)

AI-Assisted Research Workflow

This mode helps you explore research topics with AI assistance:

Enter a research topic or question
The system will refine your query and analyze literature
Review reports from Experimental and Theoretical Chemist agents
Examine DFT-specific recommendations from expert agents
Use the generated reports to guide your computational research

Example:

Enter your research topic or question: Metal oxide catalysts for water splitting

Structure File Support

The system supports the following structure file formats:

CIF (Crystallographic Information File)
XYZ (Cartesian coordinates)

📝 Input Generation for Cell and Geometry optimizations

Generate input files for DFT calculations using either batch mode or guided mode:

Batch Mode

Automatically convert all supported files using default templates:

Batch-mode: which DFT code? (CP2K/VASP/Gaussian): cp2k
Path to file or directory: ./structures
Output directory: ./cp2k_inputs

Guided Mode

Step through detailed parameter selection for your DFT calculation:

Which DFT code? (CP2K/VASP/Gaussian): VASP

📊 Output Processing

Extract data from DFT calculation outputs:

Select DFT code (1/2/3): 1
Path to CP2K input file (.inp): ./cp2k_calcs/input.inp
Path to CP2K output file: ./cp2k_calcs/output.out
Path for output JSON file [output_data.json]: results.json

🔥 AIMD Input Generation

Generate AIMD inputs from optimized structures at multiple temperatures based on the cell/geo. optimization .json output processed file:

Path to your JSON file or directory: ./optimized_structures
Output directory for generated files: ./aimd_inputs
Select template (1-5) [1]: 2

🧮 ML Dataset Creation

Convert DFT outputs to machine learning potential training data:

Full path to JSON file containing DFT data: ./results/dft_data.json
Output directory for HDF5 datasets [current directory]: ./ml_datasets
Dataset base name [dft_data]: water_system

Output Files

Depending on the mode, the system generates:

CP2K: .inp input files
VASP: INCAR, POSCAR, KPOINTS, and POTCAR files in subdirectories
Gaussian: .com
Research Reports: .txt
Processed Data: .json and .h5 data files

Troubleshooting

API-Related Issues

Authentication Errors: Verify your API key is correct and properly set in the environment or config file
Rate Limiting: If you see RateLimitError, the system will automatically retry with exponential backoff
Model Not Available: Ensure you're using a model that's available to your API key level

Common Issues

File validation errors: Check if your CIF or XYZ files follow standard format
Missing cell parameters: Ensure cell information is properly defined for periodic systems
ASE import errors: Install ASE for full functionality: pip install ase

AMLP-analysis: Automated Machine Learning Pipeline - Analysis module

AMLP-A is a tool that helps you analyze atomic structures using machine learning. It combines several analysis methods into one easy workflow:

⚡ Single Point calculation
🔄 Geometry optimization
📦 Cell optimization
🌡️ Molecular dynamics simulations with different ensembles
📈 Structural analysis (RDF, coordination and Energy drift)

What Can AMLP-Analysis Do?

Use Pre-trained Models: Works with MACE machine learning potentials
Run Multiple Analyses: Perform different analyses in a single workflow
Easy Configuration: Change simulation settings using a simple YAML file
Reproducible Research: Get consistent results for scientific work

Getting Started with AMLP-Analysis

System Requirements

Python 3.7 or newer (Python 3.9 recommended)
Required packages: numpy, matplotlib, pyyaml, torch, tqdm, scipy, ase, mace-torch

How to Use AMLP-Analysis

Run the main analysis script:

python3 amlpa.py <input_file.xyz> config.yaml

Configuration Guide

Create a config.yaml file to customize your analysis. Here's what you can configure:

Basic Settings

base_name: 'acridine_test'
output_dir: './test_results'

Structure Settings

# Cell parameters
readcell_info: true              # Try to read cell from XYZ header
cell_params: null                # Fallback: [a, b, c, alpha, beta, gamma]
pbc: true                        # Enable periodic boundary conditions

# CELL REPLICATION 
replicate_initial: true #false         # Set to true to test supercell
replicate_dims: [2, 2, 1]       # Integer factors or [true, true, false] for 2D

# System settings
pbc: True                         # Use periodic boundaries?

Calculator settings (MACE)

device: 'gpu'                    # 'gpu' or 'cpu'
gpus: ['cuda:0']                 # GPU devices list
model_paths: 
  - '/path/of/your/own/model'

Use your own MACE potential or you can also use the foundation models from MACE on this website https://github.com/ACEsuit/mace-mp

Analysis Options

Simulation and Analysis tasks (Enable/Disable)

single_point: false                # Single point energy calculation
geo_opt: false                    # Geometry optimization
cell_opt: false                  # Cell optimization 
run_rmsd: false                   # RMSD analysis after optimization
run_coordination: false           # Coordination number analysis
run_energy_drift: true            # Energy drift analysis (NVE only)

🔄 Geometry Optimization

optimizer: 'BFGS'                # Options: BFGS, LBFGS, FIRE
fmax: 0.05                       # Force convergence (eV/Å) - relaxed for quick test
# Normally use fmax: 0.001 for production

📦 Cell optimization settings

cell_optimizer: 'BFGS'           # Options: BFGS, LBFGS, FIRE
cell_fmax: 0.05                  # Stress convergence (eV/Å) - relaxed for quick test
scalar_pressure: 0.0             # Target pressure (GPa)

🌡️ Molecular Dynamics

Run MD at multiple temperatures to test different features:

Use NVE at one temperature to test energy drift analysis
Use other thermostats at other temperatures to test thermostat functionality

temperatures: [300]              # List of temperatures (K) to simulate
md_steps: 1000                   # Total MD steps 
timestep: 1.0                    # MD timestep (fs)
save_interval: 50                # Save trajectory every N steps
log_interval: 50                 # Log of the simulations

Thermostat selection

Options: 'nve', 'langevin', 'berendsen', 'nose-hoover' (or 'nh')

'nve': Microcanonical ensemble (constant N, V, E) Use this to test energy conservation and drift analysis
'langevin': Canonical ensemble with stochastic dynamics
'berendsen': Canonical ensemble with velocity rescaling (NEW)
'nose-hoover': Canonical ensemble with deterministic thermostat

thermostat: 'berendsen'
# Thermostat-specific parameters (only used when applicable)
friction: 0.01                   # Langevin friction coefficient (1/time)
taut: 100.0                      # Berendsen time constant (fs) 
# Nosé-Hoover specific parameters
nh_tdamp: 50.0        # Damping time in fs (recommended: 100 × timestep)
nh_tchain: 3          # Chain length (longer = better canonical ensemble)
nh_tloop: 1           # Sub-steps (higher = more accurate but slower)
energy_drift_start_time_ps: 0.1  # Skip initial equilibration (ps) only for NVE

📈 Radial Distribution Function Analysis settings

RDF uses the SIMULATION cell

Cutoff is automatically validated: rmax <= 0.5 * min(cell_length)
Atom pairs for RDF analysis
Format: list of [type1, type2] pairs

atom_pairs:
  - ['N', 'N']
  - ['C', 'N']
  - ['C', 'C']
# RDF parameters
rdf_rmin: 0.5                    # Minimum distance (Å)
rdf_rmax: 8.0                    # Maximum distance (Å) (will be auto-validated against cell size)
rdf_nbins: 100                   # Number of histogram bins
rdf_nframes: 50                  # Number of frames to sample from trajectory
                                 # Use null to analyze all frames
# RDF smoothing for plots
rdf_smoothing_sigma: 1.0         # Gaussian smoothing sigma (0 = no smoothing)

📈 Coordination Analysis

coordination_cutoff: 3.0         # Cutoff distance for neighbor counting (Å)
coordination_atom_types: 'all'   # Analyze all atoms, or specify ['N', 'C']

Contributing

Contributions are welcome! Fork the repository and submit pull requests with improvements or bug fixes. For major changes, please open an issue first to discuss your ideas.

License

MIT License

Acknowledgments

This project utilizes several open-source packages including ASE, NumPy, PyYAML, and MACE
DFT code parameters are based on best practices from the computational chemistry community
The AI agent system utilizes OpenAI's GPT models for text generation

Disclaimer: Parts of this project are currently under active development. All features, APIs, and documentation may change as new functionalities are implemented and improvements are made.

Citation

If you use this code or data in your research, please cite:

@misc{lahouari2025automatedmachinelearningpipeline,
      title={Automated Machine Learning Pipeline for Training and Analysis Using Large Language Models}, 
      author={Adam Lahouari and Jutta Rogal and Mark E. Tuckerman},
      year={2025},
      eprint={2509.21647},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={https://arxiv.org/abs/2509.21647}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
docs		docs
example		example
multi_agent_dft		multi_agent_dft
reports_20250819_120549		reports_20250819_120549
.DS_Store		.DS_Store
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
amlp_logo.png		amlp_logo.png
amlpa.py		amlpa.py
amlpt.py		amlpt.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

adamlaho/AMLP

Folders and files

Latest commit

History

Repository files navigation

Automated Machine Learning Pipeline

Table of Contents