Minerva 4 AI

Optimizing AI models on European infrastructure.

These guides are part of the MINERVA European Support Centre effort to help AI researchers, engineers, and support teams make effective use of European High Performance Computing (HPC) and AI infrastructure. They focus on the practical and conceptual knowledge needed to develop, train, optimize, deploy, and interpret AI models on shared supercomputing and GPU systems.

Modern AI workflows rarely depend on model code alone. Performance, reliability, cost, and reproducibility are shaped by the full infrastructure stack: accelerators, memory, interconnects, storage, schedulers, software libraries, training strategies, evaluation pipelines, and responsible deployment practices. This guidebook brings these layers together with an emphasis on scalable AI workloads in European research and industrial contexts.

What to expect

The style of the guidebook is intentionally mixed. Some chapters are more theoretical and in-depth, because understanding compute architectures, networking, storage systems, and performance metrics is essential when diagnosing bottlenecks or choosing infrastructure. Other chapters are more operational, with concrete patterns, commands, recipes, and workflow advice for running jobs on shared HPC systems.

The goal is not to replace vendor documentation, academic textbooks, or site-specific user manuals. Instead, the guidebook provides a bridge between AI practice and HPC practice: enough background to understand why systems behave as they do, and enough practical guidance to make better decisions when scaling real workloads.

Who this is for

These materials are intended for AI practitioners working with large models and shared infrastructure, including:

  • researchers preparing training, fine-tuning, inference, or evaluation workloads;
  • machine learning engineers and research software engineers porting AI workflows to HPC systems;
  • HPC support teams helping users optimize jobs, diagnose failures, and use resources efficiently;
  • project teams moving from local, workstation, or cloud workflows toward European supercomputing environments.

Guidebook map

Infrastructure Foundations for Scalable LLM Training introduces the systems layer: compute devices, GPU architectures, networking, storage, data pipelines, orchestration, and scheduling. It combines architectural explanations with practical guidance for understanding where performance and reliability issues come from.

Efficient and Reliable LLM Pretraining focuses on the training phase: parallelism strategies, scaling techniques, efficiency, sustainability, fault tolerance, reproducibility, precision, checkpointing, and evaluation during training.

Post-Training: Alignment and User-Centered Objectives covers the stages after pretraining, including instruction tuning, reinforcement learning for alignment, efficient inference, and test-time compute.

Explainable AI introduces core explainability concepts and methods, then connects them to practical examples, notebooks, and use cases across modalities.

How to use this guidebook

The guides can be read linearly, especially if you are new to HPC-AI workflows. They are also designed for selective reading:

  • start with Infrastructure Foundations if you need to understand the hardware and system environment;
  • start with Efficient and Reliable LLM Pretraining if you are preparing or improving large training runs;
  • start with Post-Training if you already have a model and are working on specialization, alignment, or inference;
  • start with Explainable AI if your focus is interpretability, analysis, or responsible model use.

The sidebar provides the full table of contents. Site-specific details such as partitions, modules, quotas, container policies, and access procedures will vary across European HPC centers, so always combine these best practices with the documentation of the system you are using.

License

This guidebook is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, provided appropriate credit is given. The accompanying code examples, scripts, and Jupyter notebooks are released under the Apache License, Version 2.0, which permits use, reproduction, and distribution in both open and proprietary contexts.

Funding

This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101182737. The JU receives support from the Digital Europe Programme. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or EuroHPC Joint Undertaking. Neither the European Union nor the granting authority can be held responsible for them.