6 References

Adebayo, Julius, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. “Sanity Checks for Saliency Maps.” Advances in Neural Information Processing Systems 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/294a8ed24b1ad22ec2e7efea049b8737-Abstract.html.

Advanced Micro Devices, Inc. 2021. AMD CDNA 2 Architecture. Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-paper.pdf.

Advanced Micro Devices, Inc. 2023. AMD CDNA 3 Architecture. Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.

Advanced Micro Devices, Inc. 2024. AMD CDNA 4 Architecture. Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-4-architecture-whitepaper.pdf.

Andersch, Michael, Greg Palmer, Ronny Krashinsky, et al. 2022. NVIDIA Hopper Architecture in-Depth. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. “Machine Bias.” ProPublica, May 23. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.

Anyscale. 2023. Ray: Job Submission and Execution. https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html.

Apptainer Contributors. 2023. Apptainer User Documentation. https://apptainer.org/docs/user/main/.

Armato III, Samuel G., Geoffrey McLennan, Luc Bidaut, et al. 2011. “The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans.” Medical Physics 38 (2): 915–31. https://doi.org/https://doi.org/10.1118/1.3528204.

Artificial Analysis. 2024. Artificial Analysis. https://artificialanalysis.ai/.

ashworks1706. 2023. RLHF from Scratch Tutorial Notebook [Google Colab Notebook]. Https://colab.research.google.com/github/ashworks1706/rlhf-from-scratch/blob/main/tutorial.ipynb | https://github.com/ashworks1706/rlhf-from-scratch/.

Bai, Yuntao, Andy Jones, Kamal Ndousse, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv Preprint arXiv:2204.05862.

Bakouch, Elie et al. 2025. SmolLM3: Smol, Multilingual, Long-Context Reasoner. https://huggingface.co/blog/smollm3.

BeeGFS. BeeGFS Architecture Overview. https://doc.beegfs.io/latest/architecture/overview.html.

Bekman, Stas. 2023-2026. “Machine Learning Engineering Open Book.” In GitHub Repository. Stasosphere Online Inc. https://github.com/stas00/ml-engineering.

Besta, Maciej, Nils Blach, Ales Kubicek, et al. 2024. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” Proceedings of the AAAI Conference on Artificial Intelligence 38: 17682–90.

Blomer, Jakob. 2015. “A Survey on Distributed File System Technology.” Journal of Physics: Conference Series 608: 012039.

Borthakur, Dhruba et al. 2008. “HDFS Architecture Guide.” Hadoop Apache Project 53 (1-13): 2.

Brandon, William, Aniruddha Nrusimha, Kevin Qian, et al. 2023. Striped Attention: Faster Ring Attention for Causal Transformers. https://arxiv.org/abs/2311.09431.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Brinkmann, André, Kathryn Mohror, Weikuan Yu, et al. 2020. “Ad Hoc File Systems for High-Performance Computing.” Journal of Computer Science and Technology 35 (1): 4–26.

Cai, Weilin, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2025. “A Survey on Mixture of Experts.” IEEE Transactions on Knowledge and Data Engineering, 1–20. https://doi.org/10.1109/tkde.2025.3554028.

Cardoso, M. Jorge, Wenqi Li, Richard Brown, et al. 2022. MONAI: An Open-Source Framework for Deep Learning in Healthcare. https://arxiv.org/abs/2211.02701.

Carneiro, André Ramos, Jean Luca Bez, Carla Osthoff, Lucas Mello Schnorr, and Philippe OA Navaux. 2023. “Uncovering i/o Demands on HPC Platforms: Peeking Under the Hood of Santos Dumont.” Journal of Parallel and Distributed Computing 182: 104744.

Casper, Stephen, Xander Davies, Claudia Shi, et al. 2023. “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.” arXiv Preprint arXiv:2307.15217.

Chattopadhyay, Aditya, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. 2017. “Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks.” In arXiv.org. https://doi.org/10.1109/WACV.2018.00097.

Chen, Chaofan, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. 2019. “This Looks Like That: Deep Learning for Interpretable Image Recognition.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc. https://dl.acm.org/doi/10.5555/3454287.3455088.

Chen, Lichang, Shiyang Li, Jun Yan, et al. 2024. “Alpagasus: Training a Better Alpaca with Fewer Data.” International Conference on Learning Representations 2024: 34767–97.

Chen, Mark, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. https://arxiv.org/abs/2107.03374.

Chen, Sihong, Kai Ma, and Yefeng Zheng. 2019. Med3D: Transfer Learning for 3D Medical Image Analysis. https://arxiv.org/abs/1904.00625.

Chen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. “Training Deep Nets with Sublinear Memory Cost.” arXiv Preprint arXiv:1604.06174. https://arxiv.org/abs/1604.06174.

Chen, Zixiang, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024. “Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.” arXiv Preprint arXiv:2401.01335.

Chien-Chin Huang, Fegin, Lucas Pasqualin, Iris Zhang, Less Wright, Pradeep Fernando, and Will Constable. 2024. Optimizing Checkpointing Efficiency with PyTorch DCP. https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250.

Chitty-Venkata, Krishna Teja, Siddhisanket Raskar, Bharat Kale, et al. 2024. “Llm-Inference-Bench: Inference Benchmarking of Large Language Models on Ai Accelerators.” SC24-w: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1362–79.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. 2023. “Palm: Scaling Language Modeling with Pathways.” Journal of Machine Learning Research 24 (240): 1–113.

Chowdhury, Fahim, Yue Zhu, Todd Heer, et al. 2019. “I/o Characterization and Performance Evaluation of Beegfs for Deep Learning.” Proceedings of the 48th International Conference on Parallel Processing, 1–10.

Cineca. File Systems and Data Management. https://docs.hpc.cineca.it/hpc/hpc_data_storage.html.

CleverX. n.d. The Complete Guide to RLHF. Https://cleverx.com/guides/the-complete-guide-to-rlhf/.

Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021. “Training Verifiers to Solve Math Word Problems.” CoRR abs/2110.14168. https://arxiv.org/abs/2110.14168.

Conover, Mike, Matt Hayes, Ankit Mathur, et al. 2023. “Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM.” https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.

Contributors, OpenCompass. 2023. OpenCompass: A Universal Evaluation Platform for Foundation Models. Https://github.com/open-compass/opencompass.

Corbalán, Julita, Lluis Alonso, Jordi Aneas, and Luigi Brochard. 2020. “Energy Optimization and Analysis with EAR.” 2020 IEEE International Conference on Cluster Computing (CLUSTER), 464–72.

Corporation, NVIDIA. 2024. MLPerf Training Benchmark Results. https://developer.nvidia.com/blog/nvidia-sets-new-generative-ai-performance-and-scale-records-in-mlperf-training-v4-0/.

Daly, John. 2003. “A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps.” International Conference on Computational Science, 3–12.

Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.

Darshan. Darshan HPC i/o Characterization Tool. https://wordpress.cels.anl.gov/darshan/.

Databricks. 2018. MLflow: An Open Source Platform for the Machine Learning Lifecycle. https://mlflow.org/.

Dayan, Omer, Emre Karabulut, and Noam Neria. 2024. A Benchmarking Study: Which Serving Technology to Choose for LLMs? Https://pages.run.ai/hubfs/PDFs/Serving-Large-Language-Models-Run-ai-Benchmarking-Study.pdf.

Deepfa. 2025. RLHF: How Artificial Intelligence Learns from Human Feedback? Https://deepfa.ir/en/blog/reinforcement-learning-human-feedback-rlhf-guide.

DeepSeek-AI. 2024a. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv Preprint arXiv:2402.03300.

DeepSeek-AI. 2024b. DeepSeek-V3 Technical Report. DeepSeek. https://arxiv.org/abs/2412.19437.

Dettmers, Tim. 2021. Bitsandbytes: 8-Bit CUDA Functions for PyTorch. https://github.com/TimDettmers/bitsandbytes.

Dettmers, Tim, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2021. “8-Bit Optimizers via Block-Wise Quantization.” arXiv Preprint arXiv:2110.02861. https://arxiv.org/abs/2110.02861.

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “Qlora: Efficient Finetuning of Quantized Llms.” Advances in Neural Information Processing Systems 36: 10088–115.

Dimino, Fabrizio, Krati Saxena, Bhaskarjit Sarmah, and Stefano Pasquali. 2025. “Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5.” Proceedings of the 6th ACM International Conference on AI in Finance (ICAIF ’25). https://doi.org/10.1145/3768292.3770394.

Ding, Nan, and Samuel Williams. 2019. “An Instruction Roofline Model for Gpus.” 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 7–18.

Dolas, Sagar, Ana Varbanescu, and Benjamin Czaja. 2022. “Making Scientific Research on Dutch National Supercomputer Energy Efficient.” ERCIM News, no. 131. https://ercim-news.ercim.eu/en131/special/making-scientific-research-on-dutch-national-supercomputer-energy-efficient.

Domyn. 2025. Domyn Swarm. https://github.com/igeniusai/domyn-swarm.

Doshi-Velez, Finale, and Been Kim. 2017. “Towards a Rigorous Science of Interpretable Machine Learning.” arXiv Preprint arXiv:1702.08608.

Duan, Jiangfei, Shuo Zhang, Zerui Wang, et al. 2024. “Efficient Training of Large Language Models on Distributed Infrastructures: A Survey.” arXiv Preprint arXiv:2407.20018.

El-Sayed, Nosayba, and Bianca Schroeder. 2014. “To Checkpoint or Not to Checkpoint: Understanding Energy-Performance-i/o Tradeoffs in Hpc Checkpointing.” 2014 IEEE International Conference on Cluster Computing (CLUSTER), 93–102.

Ettinger, Allyson, Amanda Bertsch, Bailey Kuehl, et al. 2025. “OLMo 3.” CoRR abs/2512.13961. https://doi.org/10.48550/ARXIV.2512.13961.

Exxact Corporation. 2017. Discover the Difference Between Deep Learning Training and Inference. Https://www.exxactcorp.com/blog/HPC/discover-the-difference-between-deep-learning-training-and-inference.

Faiz, Ahmad, Sotaro Kaneda, Ruhan Wang, et al. 2023. “Llmcarbon: Modeling the End-to-End Carbon Footprint of Large Language Models.” arXiv Preprint arXiv:2309.14393.

Frantar, Elias, Saleh Ashkboos, Bert Hoover, Liane Dery, David Grangier, and Dan Alistarh. 2023. “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” arXiv Preprint arXiv:2210.17323.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5): 1189–232. https://doi.org/10.1214/aos/1013203451.

Fu, Yichao, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. 2025. Deep Think with Confidence. https://arxiv.org/abs/2508.15260.

Fu, Zhenxiao, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2025. “Llmco2: Advancing Accurate Carbon Footprint Prediction for Llm Inferences.” ACM SIGENERGY Energy Informatics Review 5 (2): 63–68.

Gao, Leo, Jonathan Tow, Baber Abbasi, et al. 2023. A Framework for Few-Shot Language Model Evaluation. Version v0.4.0. Zenodo. https://doi.org/10.5281/zenodo.10256836.

George, Anjus, Andreas Dilger, Michael J Brim, et al. 2025. “Lustre Unveiled: Evolution, Design, Advancements, and Current Trends.” ACM Transactions on Storage 21 (3): 1–109.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. 2003. “The Google File System.” Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, 29–43.

Goldberg, David. 1991. What Every Computer Scientist Should Know about Floating-Point Arithmetic. ACM Computing Surveys. https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.

Google. 2019. TensorBoard: TensorFlow’s Visualization Toolkit. https://www.tensorflow.org/tensorboard.

Google. 2025. Logistic Regression: Loss and Regularization. Https://developers.google.com/machine-learning/crash-course/logistic-regression/loss-regularization.

Google Cloud. 2019. BFloat16: The Secret to High Performance on Cloud TPUs. Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.

Google Cloud. n.d. Reinforcement Learning from Human Feedback on Google Cloud. Https://cloud.google.com/blog/products/ai-machine-learning/rlhf-on-google-cloud.

Google LLC. 2020. Coral Accelerator Module Datasheet. PDF. https://www.coral.ai/static/files/Coral-Accelerator-Module-datasheet.pdf.

Gossman, Mikaila, Avinash Maurya, Bogdan Nicolae, and Jon Calhoun. 2026. “Understanding LLM Checkpoint/Restore i/o Strategies and Patterns.” Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 263–73.

GrafanaLabs. Grafana. https://grafana.com/.

Grattafiori, Aaron et al. 2024. The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024a. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024b. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783.

Grosse, Roger, Juhan Bae, Cem Anil, et al. 2023. Studying Large Language Model Generalization with Influence Functions. arXiv. https://doi.org/10.48550/arXiv.2308.03296.

Gurumoorthy, Karthik S., Amit Dhurandhar, Guillermo Cecchi, and Charu Aggarwal. 2019. “Efficient Data Representation by Selecting Prototypes with Importance Weights.” 2019 IEEE International Conference on Data Mining (ICDM), November, 260–69. https://doi.org/10.1109/ICDM.2019.00036.

Guthmann, François. 2023. Occupancy Explained. GPUOpen. https://gpuopen.com/learn/occupancy-explained/.

Ha, V. Q. 2025. Building an RLHF Pipeline for LLMs: A Beginner-Friendly Tutorial. Https://medium.com/@vi.ha.engr/building-an-rlhf-pipeline-for-llms-a-beginner-friendly-tutorial-21112bfcff9b.

Habib, Nathan, Clémentine Fourrier, Hynek Kydlíček, Thomas Wolf, and Lewis Tunstall. 2023. LightEval: A Lightweight Framework for LLM Evaluation. Version 0.11.0. https://github.com/huggingface/lighteval.

Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2020. “Measuring Massive Multitask Language Understanding.” arXiv Preprint arXiv:2009.03300.

Hildebrand, Dean, and Denis Serenyi. 2021. Colossus Under the Hood: A Peek into Google’s Scalable Storage System. https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system.

HPE. About Progressive File Layout (PFL). https://support.hpe.com/hpesc/public/docDisplay?docId=a00114946en_us&page=About_Progressive_File_Layout.html&docLocale=en_US.

Hu, Edward J, Yelong Shen, Phillip Wallis, et al. 2022. “Lora: Low-Rank Adaptation of Large Language Models.” Iclr 1 (2): 3.

IBM. 2026a. Basic Structure of IBM Storage Scale. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=overview-basic-structure-storage-scale.

IBM. 2026b. Cache Usage. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=considerations-cache-usage.

IBM. 2026c. GPFS Architecture. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=overview-gpfs-architecture.

IEEE Standards Association. 2019. IEEE Standard for Floating-Point Arithmetic. IEEE.

influxdata. Influxdata. https://www.influxdata.com/.

IOR. IOR. https://wordpress.cels.anl.gov/darshan/.

Ip, Jeffrey, and Kritin Vongthongsri. 2026. deepeval. Version 3.9.7. https://github.com/confident-ai/deepeval.

Jacovi, Alon, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. https://arxiv.org/abs/2305.10160.

Jain, Naman, King Han, Alex Gu, et al. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. https://arxiv.org/abs/2403.07974.

Jegham, Nidhal, Marwan Abdelatti, Chan Young Koh, Lassad Elmoubarki, and Abdeltawab Hendawi. 2025. “How Hungry Is Ai? Benchmarking Energy, Water, and Carbon Footprint of Llm Inference.” arXiv Preprint arXiv:2505.09598.

Jennings, Joseph et al. NeMo-Curator: A Toolkit for Data Curation. https://github.com/NVIDIA-NeMo/Curator.

Jiang, Albert Q., Alexandre Sablayrolles, Antoine Roux, et al. 2024. Mixtral of Experts. https://arxiv.org/abs/2401.04088.

Jiang, Peng, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. 2024. “Preventing the Immense Increase in the Life-Cycle Energy and Carbon Footprints of LLM-Powered Intelligent Chatbots.” Engineering 40: 202–10.

Jiang, Ziheng, Haibin Lin, Yinmin Zhong, et al. 2024. “MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs.” 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 745–60.

Kalamkar, Dhiraj, Dheevatsa Mudigere, Naveen Mellempudi, et al. 2019. A Study of BFLOAT16 for Deep Learning Training. https://arxiv.org/abs/1905.12322.

Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Karun, A Kala, and K Chitharanjan. 2013. “A Review on Hadoop—HDFS Infrastructure Extensions.” 2013 IEEE Conference on Information & Communication Technologies, 132–37.

Kim, Been, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. “Examples Are Not Enough, Learn to Criticize! Criticism for Interpretability.” Advances in Neural Information Processing Systems 29. https://papers.nips.cc/paper_files/paper/2016/hash/5680522b8e2bb01943234bce7bf84534-Abstract.html.

Koh, Pang Wei, and Percy Liang. 2017. “Understanding Black-Box Predictions via Influence Functions.” Proceedings of the 34th International Conference on Machine Learning, July, 1885–94. https://proceedings.mlr.press/v70/koh17a.html.

Köpf, Andreas, Yannic Kilcher, Dimitri von Rütte, et al. 2023. “OpenAssistant Conversations Dataset (OASST1).” Hugging Face. https://doi.org/10.48550/arXiv.2304.07327.

Korthikanti, Vijay Anand, Jared Casper, Sangkug Lym, et al. 2023. “Reducing Activation Recomputation in Large Transformer Models.” Proceedings of Machine Learning and Systems 5: 341–53.

Korthikanti, Vijay, Jared Casper, Sangkug Lym, et al. 2022. Reducing Activation Recomputation in Large Transformer Models. https://arxiv.org/abs/2205.05198.

Krashinsky, Ronny, Olivier Giroux, Stephen Jones, Nick Stam, and Sridhar Ramaswamy. 2020. NVIDIA Ampere Architecture in-Depth. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, et al. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, et al. 2023. “Efficient Memory Management for Large Language Model Serving with Pagedattention.” Proceedings of the 29th Symposium on Operating Systems Principles, 611–26.

Kydlíček, Hynek. n.d. Math-Verify: Math Verification Library. Version 0.6.1. https://github.com/huggingface/math-verify.

Lacoste, Alexandre, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. “Quantifying the Carbon Emissions of Machine Learning.” arXiv Preprint arXiv:1910.09700.

Lamy-Poirier, Joel. 2023. Breadth-First Pipeline Parallelism. https://arxiv.org/abs/2211.05953.

Lewis, Noah, Jean Luca Bez, and Suren Byna. 2025. “I/o in Machine Learning Applications on Hpc Systems: A 360-Degree Survey.” ACM Computing Surveys 57 (10): 1–41.

Lhoest, Quentin, Albert Villanova Del Moral, Yacine Jernite, et al. 2021. “Datasets: A Community Library for Natural Language Processing.” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 175–84.

Lian, Wing, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". 2023. “OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces.” In HuggingFace Repository. Https://https://huggingface.co/datasets/Open-Orca/OpenOrca; HuggingFace.

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, et al. 2023. Holistic Evaluation of Language Models. https://arxiv.org/abs/2211.09110.

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, et al. 2023. “Holistic Evaluation of Language Models.” Transactions on Machine Learning Research. https://openreview.net/forum?id=iO4LZibEqW.

Lightly.ai. n.d. Reinforcement Learning from Human Feedback (RLHF). Https://www.lightly.ai/blog/rlhf-reinforcement-learning-from-human-feedback.

Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of Summarization.” Text Summarization Branches Out: Proceedings of the ACL-04 Workshop 8.

Lin, Ji, Jiaming Tang, Haotian Tang, Shaohan Yang, Xingyu Dang, and Song Han. 2023. “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration.” arXiv Preprint arXiv:2306.00978.

Liu, Aixin, Bei Feng, Bing Xue, et al. 2024. “Deepseek-V3 Technical Report.” arXiv Preprint arXiv:2412.19437.

Liu, Biao, Ning Xu, Junming Yang, Hao Xu, and Xin Geng. 2026. “VRM: Teaching Reward Models to Understand Authentic Human Preferences.” arXiv Preprint arXiv:2603.04974.

Liu, Hao, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. https://arxiv.org/abs/2310.01889.

Liu, Junnan, Hongwei Liu, Linchen Xiao, et al. 2025. Are Your LLMs Capable of Stable Reasoning? https://arxiv.org/abs/2412.13147.

LMSYS. 2023. Chatbot Arena Conversations (Version 1.0) [Dataset]. Https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.

Lockwood, Glenn. 2017. What’s so Bad about POSIX i/o? https://www.nextplatform.com/code/2017/09/11/whats-so-bad-about-posix-i/o/1636132.

Lottick, Kadan, Silvia Susai, Sorelle A Friedler, and Jonathan P Wilson. 2019. “Energy Usage Reports: Environmental Awareness as Part of Algorithmic Accountability.” arXiv Preprint arXiv:1911.08354.

Lucas, Ryan, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, and Rahul Mazumder. 2025. “Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction.” arXiv Preprint arXiv:2509.12464.

Lundberg, Scott M., Gabriel Erion, Hugh Chen, et al. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence 2 (1): 56–67. https://doi.org/10.1038/s42256-019-0138-9.

Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” Proceedings of the 31st International Conference on Neural Information Processing Systems (Red Hook, NY, USA), NIPS’17, December, 4768–77. https://dl.acm.org/doi/10.5555/3295222.3295230.

Lüttgau, Jakob, Michael Kuhn, Kira Duwe, et al. 2018. “Survey of Storage Systems for High-Performance Computing.” Supercomputing Frontiers and Innovations 5 (1).

Mangrulkar, Sourab, Sylvain Gugger, Lysandre Debut, et al. 2022. PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning Methods. https://github.com/huggingface/peft.

Maurya, Avinash, Robert Underwood, M Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. “Datastates-Llm: Lazy Asynchronous Checkpointing for Large Language Models.” Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 227–39.

May, Philip et al. 2024. UltraChat-200k-ShareGPT-Clean Dataset. HuggingFace. https://huggingface.co/datasets/PhilipMay/UltraChat-200k-ShareGPT-clean.

McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. “An Empirical Model of Large-Batch Training.” CoRR abs/1812.06162. http://arxiv.org/abs/1812.06162.

Micikevicius, Paulius, Dusan Stosic, Neil Burgess, Marius Cornea, Ryan Prenger, and Hassan Mayer. 2022. “FP8 Formats for Deep Learning.” arXiv Preprint arXiv:2209.05433.

Microsoft. 2020. DeepSpeed: Deep Learning Optimization Library. https://github.com/microsoft/DeepSpeed.

Microsoft. 2021. DeepSpeed Elasticity. https://www.deepspeed.ai/.

Microsoft. 2024. Model Checkpointing. https://www.deepspeed.ai/tutorials/checkpointing/.

MMEngine Contributors. 2022. OpenMMLab Foundational Library for Training Deep Learning Models. https://github.com/open-mmlab/mmengine.

Mölder, Felix, Kim Philipp Jablonski, Brice Letcher, et al. 2021. Snakemake: A Scalable Bioinformatics Workflow Engine. https://snakemake.github.io/.

Molnar, Christoph. 2025. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 3rd ed. https://christophm.github.io/interpretable-ml-book.

Moritz, Philipp, Robert Nishihara, Stephanie Wang, et al. 2018. “Ray: A Distributed Framework for Emerging AI Applications.” 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 561–77.

Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. 2019. “Definitions, Methods, and Applications in Interpretable Machine Learning.” Proceedings of the National Academy of Sciences 116 (44): 22071–80. https://doi.org/10.1073/pnas.1900654116.

NERSC. 2024. Optimizing i/o Performance. https://docs.nersc.gov/performance/io/.

Nextflow Contributors. 2023. Nextflow Documentation. https://www.nextflow.io/docs/latest/index.html.

Niu, Chenxu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025. “Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines.” ACM SIGENERGY Energy Informatics Review 5 (2): 56–62.

NVIDIA. 2020a. Accelerating AI Training with TF32 Tensor Cores. NVIDIA Developer Blog. https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/.

NVIDIA. 2020b. NVIDIA Ampere GA-100 GPU Architecture. NVIDIA. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.

NVIDIA. 2022. NVIDIA Hopper GPU Architecture in-Depth. NVIDIA Developer Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.

NVIDIA. 2023. NVIDIA System Management Interface (Nvidia-Smi). https://developer.nvidia.com/system-management-interface.

NVIDIA. 2024a. Automatic Mixed Precision for Deep Learning. https://developer.nvidia.com/automatic-mixed-precision.

NVIDIA. 2024b. Mixed Precision Training. NVIDIA Deep Learning Performance Documentation. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/.

NVIDIA. 2024c. Nvidia-Ml-Py: Python Bindings for the NVIDIA Management Library. https://pypi.org/project/nvidia-ml-py/.

NVIDIA Corporation. 2020. NVIDIA Ampere GA100 GPU Architecture. NVIDIA Corporation. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.

NVIDIA Corporation. 2022. NVIDIA Hopper H100 Tensor Core GPU Architecture. NVIDIA Corporation. https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c?ncid=no-ncid.

NVIDIA Corporation. 2023. NVIDIA Grace Hopper Superchip Architecture. NVIDIA Corporation. https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper?ncid=no-ncid.

NVIDIA Corporation. 2024a. NVIDIA Blackwell Architecture. NVIDIA Data Center Architecture Page. https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/.

NVIDIA Corporation. 2024b. NVIDIA Nsight Systems User Guide. https://docs.nvidia.com/nsight-systems/.

NVIDIA Corporation. 2024c. NVIDIA TensorRT-LLM. GitHub repository, release 0.5.0. https://github.com/NVIDIA/TensorRT-LLM.

NVIDIA Corporation. 2025a. NVIDIA DGX B200 User Guide. NVIDIA Documentation. https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html.

NVIDIA Corporation. 2025b. NVIDIA RTX Blackwell PRO GPU Architecture. PDF. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/pdf/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1_1.pdf.

NVIDIA Corporation. 2025c. NVIDIA RTX PRO 6000 Blackwell Series. NVIDIA Product Page. https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000-family/.

NVIDIA Corporation. 2025d. NVIDIA RTX PRO 6000 Blackwell Server Edition. NVIDIA Product Page. https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/.

Olmo, Team, Allyson Ettinger, Amanda Bertsch, et al. 2025. “Olmo 3.” arXiv Preprint arXiv:2512.13961.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44.

Özcan, Miray, Philipp Wiesner, Philipp Weiß, and Odej Kao. 2025. “Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations.” arXiv Preprint arXiv:2507.11417.

Pan, Xueting, Ziqian Luo, and Lisang Zhou. 2024. “Navigating the Landscape of Distributed File Systems: Architectures, Implementations, and Considerations.” arXiv Preprint arXiv:2403.15701.

Pandey, Richa, and SP Sah. 2016. “A Review on Google File System.” International Journal of Computer Science Trends and Technology (IJCST) 4: 177–80.

Panigutti, Cecilia, Ronan Hamon, Isabelle Hupont, et al. 2023. “The Role of Explainable AI in the Context of the AI Act.” Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’23, June, 1139–50. https://doi.org/10.1145/3593013.3594069.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18.

Patel, Zeeshan, Ethan He, Parth Mannan, et al. 2025. “Training Video Foundation Models with NVIDIA NeMo.” arXiv Preprint arXiv:2503.12964.

Pathak, Shreyasi, Jörg Schlötterer, Jeroen Veltman, Jeroen Geerdink, Maurice van Keulen, and Christin Seifert. 2024. “Prototype-Based Interpretable Breast Cancer Prediction Models: Analysis and Challenges.” In Explainable Artificial Intelligence, edited by Luca Longo, Sebastian Lapuschkin, and Christin Seifert. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-63787-2_2.

Patil, Shishir G., Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Gorilla: Large Language Model Connected with Massive APIs. https://arxiv.org/abs/2305.15334.

Peddie, Jon. 2023. The History of the GPU-Eras and Environment. Springer.

Penedo, Guilherme et al. DataTrove: Large Scale Data Processing. https://github.com/huggingface/datatrove.

Petrini, Fabrizio. 2002. “Scaling to Thousands of Processors with Buffered Coscheduling.” Scaling to New Heights Workshop.

Petsiuk, Vitali, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-Box Models. arXiv. https://doi.org/10.48550/arXiv.1806.07421.

Piotr Mazurek, Felix Gabriel. 2025. “LLM Inference Economics from First Principles.” Substack, May. https://www.tensoreconomics.com/p/moe-inference-economics-from-first.

Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2023. “Efficiently Scaling Transformer Inference.” Proceedings of Machine Learning and Systems 5: 606–24.

PrimeIntellect. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/PrimeIntellect/AIME-24.

PrimeIntellect. 2025. AIME 2025 Dataset. https://huggingface.co/datasets/PrimeIntellect/AIME-25.

Pruthi, Garima, Frederick Liu, Satyen Kale, and Mukund Sundararajan. 2020. “Estimating Training Data Influence by Tracing Gradient Descent.” Advances in Neural Information Processing Systems 33: 19920–30. https://proceedings.neurips.cc/paper/2020/hash/e6385d39ec9394f2f3a354d9d2b88eec-Abstract.html.

PyTorch Contributors. 2022. TorchElastic: Fault-Tolerant Distributed Training. https://docs.pytorch.org/docs/stable/elastic/run.html.

PyTorch Contributors. 2023a. Anomaly Detection in Autograd. https://pytorch.org/docs/stable/autograd.html#anomaly-detection.

PyTorch Contributors. 2023b. Fully Sharded Data Parallel (FSDP). https://docs.pytorch.org/docs/stable/fsdp.html.

PyTorch Contributors. 2023c. Getting Started with Fully Sharded Data Parallel (FSDP). https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html.

PyTorch Contributors. 2023d. Reproducibility and Randomness. https://pytorch.org/docs/stable/notes/randomness.html.

PyTorch Contributors. 2023e. Torch.distributed.checkpoint. https://pytorch.org/docs/stable/distributed.checkpoint.html.

PyTorch Contributors. 2023f. Torch.nn.functional.scaled_dot_product_attention. https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html.

PyTorch Contributors. 2024. Automatic Mixed Precision Examples. https://docs.pytorch.org/docs/stable/notes/amp_examples.html#typical-mixed-precision-training.

Qi, Penghui, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. https://arxiv.org/abs/2401.10241.

Qin, Yujia, Shihao Liang, Yining Ye, et al. 2023. “Toolllm: Facilitating Large Language Models to Master 16000+ Real-World Apis.” arXiv Preprint arXiv:2307.16789.

Qwen Team. 2025. Qwen3-8B. https://huggingface.co/Qwen/Qwen3-8B.

Radford, Alec, Jeffrey Wu, Rewon Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems 36: 53728–41.

Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “Zero: Memory Optimizations Toward Training Trillion Parameter Models.” SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–16.

Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021a. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. https://arxiv.org/abs/2104.07857.

Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021b. “Zero-Infinity: Breaking the Gpu Memory Wall for Extreme Scale Deep Learning.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–14.

Regulation (EU) 2016/679 of the European Parliament and of the Council: General Data Protection Regulation (GDPR), Pub. L. No. 2016/679 (2016). https://eur-lex.europa.eu/eli/reg/2016/679/oj.

Rein, David, Betty Li Hou, Asa Cooper Stickland, et al. 2023. GPQA: A Graduate-Level Google-Proof q&a Benchmark. https://arxiv.org/abs/2311.12022.

Ren, Jie, Samyam Rajbhandari, Reza Yazdani Aminabadi, et al. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. https://arxiv.org/abs/2101.06840.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA), KDD ’16, August, 1135–44. https://doi.org/10.1145/2939672.2939778.

Rocklin, Matthew et al. 2015. “Dask: Parallel Computation with Blocked Algorithms and Task Scheduling.” SciPy, 126–32.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15. https://doi.org/10.1038/s42256-019-0048-x.

Salloum, Salman, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, and Joshua Zhexue Huang. 2016. “Big Data Analytics on Apache Spark.” International Journal of Data Science and Analytics 1 (3): 145–64.

Scheda, Riccardo, Domitilla Brandoni, Laura Cavalli, and Laura Morselli. 2025. “Evaluation of Distributed Asynchronous Checkpointing in High-Performance Computing.” International Conference on High Performance Computing, 162–71.

Schmid, Philipp. 2024. Pass@k Vs Pass^k: Understanding Agent Reliability. https://www.philschmid.de/agents-pass-at-k-pass-power-k.

Schmuck, Frank, and Roger Haskin. 2002. “GPFS: A Shared-Disk File System for Large Computing Clusters.” Proceedings of the 1st USENIX Conference on File and Storage Technologies.

Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv Preprint arXiv:1707.06347.

Sel, Bilgehan, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. 2023. “Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models.” arXiv Preprint arXiv:2308.10379.

Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2020. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” International Journal of Computer Vision 128 (2): 336–59. https://doi.org/10.1007/s11263-019-01228-7.

Shapley, Lloyd S. 1952. A Value for n-Person Games. RAND Corporation. https://doi.org/10.7249/P0295.

Singh, Shivalika, Freddie Vargus, Daniel Dsouza, et al. 2024. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. https://arxiv.org/abs/2402.06619.

Smilkov, Daniel, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. SmoothGrad: Removing Noise by Adding Noise. arXiv. https://doi.org/10.48550/arXiv.1706.03825.

Smith, Alan. 2024. AMD Instinct MI300X Hot Chips 2024 Presentation. Presentation at Hot Chips 2024. https://hc2024.hotchips.org/assets/program/conference/day1/23_HC2024.AMD.MI300X.ASmith(MI300X).v1.Final.20240817.pdf.

Song, Xinyuan, Keyu Wang, PengXiang Li, Lu Yin, and Shiwei Liu. 2025. “Demystifying the Roles of Llm Layers in Retrieval, Knowledge, and Reasoning.” arXiv Preprint arXiv:2510.02091.

Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2015. Striving for Simplicity: The All Convolutional Net. arXiv. https://doi.org/10.48550/arXiv.1412.6806.

Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. arXiv. https://doi.org/10.48550/arXiv.1703.01365.

Taori, Rohan, Ishaan Gulrajani, Tianyi Zhang, et al. 2023. “Stanford Alpaca: An Instruction-Following LLaMA Model.” In GitHub Repository. Https://github.com/tatsu-lab/stanford_alpaca; GitHub.

Tazi, Nouamane et al. 2025. The Ultra-Scale Playbook: Training LLMs on GPU Clusters. https://huggingface.co/spaces/nanotron/ultrascale-playbook.

Tensorflow. Tensorflow Dataset. https://www.tensorflow.org/api_docs/python/tf/data/Dataset.

TensorFlow Contributors. 2024. Mixed Precision. https://www.tensorflow.org/guide/mixed_precision.

TOP500 Project. 2025. TOP500 List. top500.org. https://www.top500.org/.

Vef, Marc-André, Vasily Tarasov, Dean Hildebrand, and André Brinkmann. 2018. “Challenges and Solutions for Tracing Storage Systems: A Case Study with Spectrum Scale.” ACM Trans. Storage (New York, NY, USA) 14 (2). https://doi.org/10.1145/3149376.

Wan, Borui, Gaohong Liu, Zuquan Song, et al. 2025. “Robust Llm Training Infrastructure at Bytedance.” Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 186–203.

Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, et al. 2023. “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 13484–508.

Wang, Yunsong, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth, and Samuel Williams. 2020. “Time-Based Roofline for Deep Learning Performance Analysis.” 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), 10–19.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35: 24824–37.

Wei, Tianwen, Liang Zhao, Lichang Zhang, et al. 2023. “Skywork: A More Open Bilingual Foundation Model.” arXiv Preprint arXiv:2310.19341.

Weights & Biases. 2020. Weights & Biases. https://wandb.ai/.

Wen, Xumeng, Zihan Liu, Shun Zheng, et al. n.d. “Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base Llms, 2025.” URL Https://Arxiv. Org/Abs/2506.14245.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76.

Workshop, BigScience, Teven Le Scao, Angela Fan, et al. 2022. “Bloom: A 176b-Parameter Open-Access Multilingual Language Model.” arXiv Preprint arXiv:2211.05100.

Xu, Can, Qingfeng Sun, Kai Zheng, et al. 2023. “Wizardlm: Empowering Large Language Models to Follow Complex Instructions.” arXiv Preprint arXiv:2304.12244.

Xu, Weizheng, Youtao Zhang, and Xulong Tang. 2021. “Parallelizing Dnn Training on Gpus: Challenges and Opportunities.” Companion Proceedings of the Web Conference 2021, 174–78.

Yang, Charlene, Yunsong Wang, Thorsten Kurth, Steven Farrell, and Samuel Williams. 2021. “Hierarchical Roofline Performance Analysis for Deep Learning Applications.” Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2, 473–91.

Yao, Shunyu, Dian Yu, Jeffrey Zhao, et al. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Advances in Neural Information Processing Systems 36: 11809–22.

Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2022. “React: Synergizing Reasoning and Acting in Language Models.” The Eleventh International Conference on Learning Representations.

Young, John W. 1974. “A First Order Approximation to the Optimum Checkpoint Interval.” Communications of the ACM 17 (9): 530–31.

Yu, Guangba, Zirui Wang, Yujie Huang, et al. 2026. “Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs.” arXiv Preprint arXiv:2601.13655.

Yu, Qiying, Zheng Zhang, Ruofei Zhu, et al. 2025. “Dapo: An Open-Source Llm Reinforcement Learning System at Scale, 2025.” URL Https://Arxiv. Org/Abs/2503.14476 1: 2.

Yun, Jiwoong, Sangmin Choi, François Rameau, Byeongho Kang, and Ziyan Fu. 2025. Revisiting 16-Bit Neural Network Training: A Practical Approach for Resource-Limited Learning. https://doi.org/10.48550/arXiv.2305.10947.

Z, Haitao. 2025. What Is the MFU for DeepSeek-V3 Training? Medium.com. https://medium.com/@dlrover/what-is-the-mfu-for-deepseek-v3-training-0d9ea4d42eb4.

Zech, John R., Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. 2018. “Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs: A Cross-Sectional Study.” PLOS Medicine 15 (11): e1002683. https://doi.org/10.1371/journal.pmed.1002683.

Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Franziska Roesner. 2019. “HellaSwag: Can a Machine Really Finish Your Sentence?” arXiv Preprint arXiv:1905.07830.

Zhang, Jixiao, and Chunsheng Zuo. 2025. “Grpo-Lead: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models.” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 5642–65.

Zhang, Shengyu, Linfeng Dong, Xiaoya Li, et al. 2026. “Instruction Tuning for Large Language Models: A Survey.” ACM Computing Surveys 58 (7): 1–36.

Zhao, R., B. Vogel, and T. Ahmed. 2019. “Adaptive Loss Scaling for Mixed Precision Training.” arXiv Preprint arXiv:1910.12385, ahead of print. https://doi.org/10.48550/arXiv.1910.12385.

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. https://arxiv.org/abs/2306.05685.

Zhong, Tianle, Jiechen Zhao, Xindi Guo, Qiang Su, and Geoffrey Fox. 2024. “Optimizing Data i/o for LLM Datasets on Remote Storage.” AIOps.

Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2015. Learning Deep Features for Discriminative Localization. arXiv. https://doi.org/10.48550/arXiv.1512.04150.

Zhou, Chunting, Pengfei Liu, Puxin Xu, et al. 2023. “Lima: Less Is More for Alignment.” Advances in Neural Information Processing Systems 36: 55006–21.

Zhou, Denny, Nathanael Schärli, Le Hou, et al. 2022. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.” arXiv Preprint arXiv:2205.10625.

Zmora, Neta, Hao Wu, and Jay Rodge. 2021. “Achieving Fp32 Accuracy for Int8 Inference Using Quantization Aware Training with Nvidia Tensorrt.” NVIDIA Technical Blog. URL: Https://Developer.nvidia.com/Blog/Achieving-Fp32-Accuracy-for-Int8-Inference-Using-Quantization-Aware-Training-with-Tensorrt/.