6 References
Adebayo, Julius, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz
Hardt, and Been Kim. 2018. “Sanity Checks for
Saliency Maps.” Advances in
Neural Information Processing
Systems 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/294a8ed24b1ad22ec2e7efea049b8737-Abstract.html.
Advanced Micro Devices, Inc. 2021. AMD CDNA 2 Architecture.
Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-paper.pdf.
Advanced Micro Devices, Inc. 2023. AMD CDNA 3 Architecture.
Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.
Advanced Micro Devices, Inc. 2024. AMD CDNA 4 Architecture.
Advanced Micro Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-4-architecture-whitepaper.pdf.
Andersch, Michael, Greg Palmer, Ronny Krashinsky, et al. 2022.
NVIDIA Hopper Architecture in-Depth.
NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016.
“Machine Bias.” ProPublica, May 23. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Anyscale. 2023. Ray: Job Submission and Execution. https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html.
Apptainer Contributors. 2023. Apptainer User Documentation. https://apptainer.org/docs/user/main/.
Armato III, Samuel G., Geoffrey McLennan, Luc Bidaut, et al. 2011.
“The Lung Image Database Consortium (LIDC) and Image Database
Resource Initiative (IDRI): A Completed Reference Database of Lung
Nodules on CT Scans.” Medical Physics 38 (2): 915–31.
https://doi.org/https://doi.org/10.1118/1.3528204.
Artificial Analysis. 2024. Artificial Analysis. https://artificialanalysis.ai/.
ashworks1706. 2023. RLHF from Scratch
Tutorial Notebook [Google Colab Notebook]. Https://colab.research.google.com/github/ashworks1706/rlhf-from-scratch/blob/main/tutorial.ipynb
| https://github.com/ashworks1706/rlhf-from-scratch/.
Bai, Yuntao, Andy Jones, Kamal Ndousse, et
al. 2022. “Training a Helpful and Harmless Assistant with
Reinforcement Learning from Human Feedback.” arXiv Preprint
arXiv:2204.05862.
Bakouch, Elie et al. 2025. SmolLM3: Smol, Multilingual, Long-Context
Reasoner. https://huggingface.co/blog/smollm3.
BeeGFS. BeeGFS Architecture Overview. https://doc.beegfs.io/latest/architecture/overview.html.
Bekman, Stas. 2023-2026. “Machine Learning Engineering Open
Book.” In GitHub Repository. Stasosphere Online Inc. https://github.com/stas00/ml-engineering.
Besta, Maciej, Nils Blach, Ales Kubicek, et
al. 2024. “Graph of Thoughts: Solving Elaborate Problems
with Large Language Models.” Proceedings of the AAAI
Conference on Artificial Intelligence 38: 17682–90.
Blomer, Jakob. 2015. “A Survey on Distributed File System
Technology.” Journal of Physics: Conference Series 608:
012039.
Borthakur, Dhruba et al. 2008. “HDFS
Architecture Guide.” Hadoop Apache Project 53 (1-13): 2.
Brandon, William, Aniruddha Nrusimha, Kevin Qian, et al. 2023.
Striped Attention: Faster Ring Attention for Causal
Transformers. https://arxiv.org/abs/2311.09431.
Breiman, Leo. 2001. “Random Forests.”
Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Brinkmann, André, Kathryn Mohror, Weikuan Yu, et al. 2020. “Ad Hoc
File Systems for High-Performance Computing.” Journal of
Computer Science and Technology 35 (1): 4–26.
Cai, Weilin, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi
Huang. 2025. “A Survey on Mixture of Experts.” IEEE
Transactions on Knowledge and Data Engineering, 1–20. https://doi.org/10.1109/tkde.2025.3554028.
Cardoso, M. Jorge, Wenqi Li, Richard Brown, et al. 2022. MONAI: An
Open-Source Framework for Deep Learning in Healthcare. https://arxiv.org/abs/2211.02701.
Carneiro, André Ramos, Jean Luca Bez, Carla Osthoff, Lucas Mello
Schnorr, and Philippe OA Navaux. 2023. “Uncovering i/o Demands on
HPC Platforms: Peeking Under the Hood of Santos Dumont.”
Journal of Parallel and Distributed Computing 182: 104744.
Casper, Stephen, Xander Davies, Claudia Shi, et
al. 2023. “Open Problems and Fundamental Limitations of
Reinforcement Learning from Human Feedback.” arXiv Preprint
arXiv:2307.15217.
Chattopadhyay, Aditya, Anirban Sarkar, Prantik Howlader, and Vineeth N.
Balasubramanian. 2017. “Grad-CAM++:
Improved Visual Explanations for
Deep Convolutional
Networks.” In arXiv.org. https://doi.org/10.1109/WACV.2018.00097.
Chen, Chaofan, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su,
and Cynthia Rudin. 2019. “This Looks Like That: Deep Learning for
Interpretable Image Recognition.” In Proceedings of the 33rd
International Conference on
Neural Information Processing
Systems. Curran Associates Inc. https://dl.acm.org/doi/10.5555/3454287.3455088.
Chen, Lichang, Shiyang Li, Jun Yan, et al.
2024. “Alpagasus: Training a Better Alpaca with Fewer
Data.” International Conference on Learning
Representations 2024: 34767–97.
Chen, Mark, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large
Language Models Trained on Code. https://arxiv.org/abs/2107.03374.
Chen, Sihong, Kai Ma, and Yefeng Zheng. 2019. Med3D: Transfer
Learning for 3D Medical Image Analysis. https://arxiv.org/abs/1904.00625.
Chen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016.
“Training Deep Nets with Sublinear Memory Cost.” arXiv
Preprint arXiv:1604.06174. https://arxiv.org/abs/1604.06174.
Chen, Zixiang, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu.
2024. “Self-Play Fine-Tuning Converts Weak Language Models to
Strong Language Models.” arXiv Preprint
arXiv:2401.01335.
Chien-Chin Huang, Fegin, Lucas Pasqualin, Iris Zhang, Less Wright,
Pradeep Fernando, and Will Constable. 2024. Optimizing Checkpointing
Efficiency with PyTorch DCP. https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250.
Chitty-Venkata, Krishna Teja, Siddhisanket Raskar, Bharat Kale, et al.
2024. “Llm-Inference-Bench: Inference Benchmarking of Large
Language Models on Ai Accelerators.” SC24-w: Workshops of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, 1362–79.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin,
et al. 2022. “PaLM: Scaling Language Modeling with
Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin,
et al. 2023. “Palm: Scaling Language Modeling with
Pathways.” Journal of Machine Learning Research 24
(240): 1–113.
Chowdhury, Fahim, Yue Zhu, Todd Heer, et al. 2019. “I/o
Characterization and Performance Evaluation of Beegfs for Deep
Learning.” Proceedings of the 48th International Conference
on Parallel Processing, 1–10.
Cineca. File Systems and Data Management. https://docs.hpc.cineca.it/hpc/hpc_data_storage.html.
CleverX. n.d. The Complete Guide to RLHF. Https://cleverx.com/guides/the-complete-guide-to-rlhf/.
Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021.
“Training Verifiers to Solve Math Word Problems.”
CoRR abs/2110.14168. https://arxiv.org/abs/2110.14168.
Conover, Mike, Matt Hayes, Ankit Mathur, et al. 2023. “Free Dolly:
Introducing the World’s First Truly Open Instruction-Tuned LLM.”
https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
Contributors, OpenCompass. 2023. OpenCompass: A Universal Evaluation
Platform for Foundation Models. Https://github.com/open-compass/opencompass.
Corbalán, Julita, Lluis Alonso, Jordi Aneas, and Luigi Brochard. 2020.
“Energy Optimization and Analysis with EAR.”
2020 IEEE International Conference on Cluster Computing
(CLUSTER), 464–72.
Corporation, NVIDIA. 2024. MLPerf Training Benchmark Results.
https://developer.nvidia.com/blog/nvidia-sets-new-generative-ai-performance-and-scale-records-in-mlperf-training-v4-0/.
Daly, John. 2003. “A Model for Predicting the Optimum Checkpoint
Interval for Restart Dumps.” International Conference on
Computational Science, 3–12.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
2022. “FlashAttention: Fast and Memory-Efficient Exact Attention
with IO-Awareness.” arXiv Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.
Darshan. Darshan HPC i/o Characterization Tool. https://wordpress.cels.anl.gov/darshan/.
Databricks. 2018. MLflow: An Open Source Platform for the Machine
Learning Lifecycle. https://mlflow.org/.
Dayan, Omer, Emre Karabulut, and Noam Neria. 2024. A Benchmarking
Study: Which Serving Technology to Choose for LLMs? Https://pages.run.ai/hubfs/PDFs/Serving-Large-Language-Models-Run-ai-Benchmarking-Study.pdf.
Deepfa. 2025. RLHF: How Artificial Intelligence Learns from Human
Feedback? Https://deepfa.ir/en/blog/reinforcement-learning-human-feedback-rlhf-guide.
DeepSeek-AI. 2024a. “DeepSeekMath: Pushing the Limits of
Mathematical Reasoning in Open Language Models.” arXiv
Preprint arXiv:2402.03300.
DeepSeek-AI. 2024b. DeepSeek-V3 Technical Report.
DeepSeek. https://arxiv.org/abs/2412.19437.
Dettmers, Tim. 2021. Bitsandbytes: 8-Bit CUDA Functions for
PyTorch. https://github.com/TimDettmers/bitsandbytes.
Dettmers, Tim, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2021.
“8-Bit Optimizers via Block-Wise Quantization.” arXiv
Preprint arXiv:2110.02861. https://arxiv.org/abs/2110.02861.
Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.
2023. “Qlora: Efficient Finetuning of Quantized Llms.”
Advances in Neural Information Processing Systems 36:
10088–115.
Dimino, Fabrizio, Krati Saxena, Bhaskarjit Sarmah, and Stefano Pasquali.
2025. “Tracing Positional Bias in
Financial Decision-Making:
Mechanistic Insights from
Qwen2.5.” Proceedings of the 6th
ACM International Conference on
AI in Finance (ICAIF ’25).
https://doi.org/10.1145/3768292.3770394.
Ding, Nan, and Samuel Williams. 2019. “An Instruction Roofline
Model for Gpus.” 2019 IEEE/ACM Performance Modeling,
Benchmarking and Simulation of High Performance Computer Systems
(PMBS), 7–18.
Dolas, Sagar, Ana Varbanescu, and Benjamin Czaja. 2022. “Making
Scientific Research on Dutch National Supercomputer Energy
Efficient.” ERCIM News, no. 131. https://ercim-news.ercim.eu/en131/special/making-scientific-research-on-dutch-national-supercomputer-energy-efficient.
Domyn. 2025. Domyn Swarm. https://github.com/igeniusai/domyn-swarm.
Doshi-Velez, Finale, and Been Kim. 2017. “Towards a Rigorous
Science of Interpretable Machine Learning.” arXiv Preprint
arXiv:1702.08608.
Duan, Jiangfei, Shuo Zhang, Zerui Wang, et
al. 2024. “Efficient Training of Large Language Models on
Distributed Infrastructures: A Survey.” arXiv Preprint
arXiv:2407.20018.
El-Sayed, Nosayba, and Bianca Schroeder. 2014. “To Checkpoint or
Not to Checkpoint: Understanding Energy-Performance-i/o Tradeoffs in Hpc
Checkpointing.” 2014 IEEE International Conference on Cluster
Computing (CLUSTER), 93–102.
Ettinger, Allyson, Amanda Bertsch, Bailey Kuehl, et al. 2025.
“OLMo 3.” CoRR abs/2512.13961. https://doi.org/10.48550/ARXIV.2512.13961.
Exxact Corporation. 2017. Discover the Difference Between Deep
Learning Training and Inference. Https://www.exxactcorp.com/blog/HPC/discover-the-difference-between-deep-learning-training-and-inference.
Faiz, Ahmad, Sotaro Kaneda, Ruhan Wang, et al. 2023. “Llmcarbon:
Modeling the End-to-End Carbon Footprint of Large Language
Models.” arXiv Preprint arXiv:2309.14393.
Frantar, Elias, Saleh Ashkboos, Bert Hoover, Liane Dery, David Grangier,
and Dan Alistarh. 2023. “GPTQ: Accurate Post-Training Quantization
for Generative Pre-Trained Transformers.” arXiv Preprint
arXiv:2210.17323.
Friedman, Jerome H. 2001. “Greedy Function Approximation:
A Gradient Boosting Machine.” The Annals of
Statistics 29 (5): 1189–232. https://doi.org/10.1214/aos/1013203451.
Fu, Yichao, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. 2025. Deep
Think with Confidence. https://arxiv.org/abs/2508.15260.
Fu, Zhenxiao, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2025.
“Llmco2: Advancing Accurate Carbon Footprint Prediction for Llm
Inferences.” ACM SIGENERGY Energy Informatics Review 5
(2): 63–68.
Gao, Leo, Jonathan Tow, Baber Abbasi, et al. 2023. A Framework for
Few-Shot Language Model Evaluation. Version v0.4.0. Zenodo. https://doi.org/10.5281/zenodo.10256836.
George, Anjus, Andreas Dilger, Michael J Brim, et
al. 2025. “Lustre Unveiled: Evolution, Design,
Advancements, and Current Trends.” ACM Transactions on
Storage 21 (3): 1–109.
Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. 2003. “The
Google File System.” Proceedings of the Nineteenth ACM
Symposium on Operating Systems Principles, 29–43.
Goldberg, David. 1991. What Every Computer Scientist Should Know
about Floating-Point Arithmetic. ACM Computing Surveys. https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html.
Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015.
“Peeking Inside the Black
Box: Visualizing Statistical
Learning With Plots of
Individual Conditional
Expectation.” Journal of Computational and
Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.
Google. 2019. TensorBoard: TensorFlow’s Visualization Toolkit.
https://www.tensorflow.org/tensorboard.
Google. 2025. Logistic Regression: Loss and Regularization. Https://developers.google.com/machine-learning/crash-course/logistic-regression/loss-regularization.
Google Cloud. 2019. BFloat16: The Secret to High Performance on
Cloud TPUs. Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
Google Cloud. n.d. Reinforcement Learning from Human Feedback on
Google Cloud. Https://cloud.google.com/blog/products/ai-machine-learning/rlhf-on-google-cloud.
Google LLC. 2020. Coral Accelerator Module Datasheet. PDF. https://www.coral.ai/static/files/Coral-Accelerator-Module-datasheet.pdf.
Gossman, Mikaila, Avinash Maurya, Bogdan Nicolae, and Jon Calhoun. 2026.
“Understanding LLM Checkpoint/Restore i/o Strategies and
Patterns.” Proceedings of the Supercomputing Asia and
International Conference on High Performance Computing in Asia Pacific
Region Workshops, 263–73.
GrafanaLabs. Grafana. https://grafana.com/.
Grattafiori, Aaron et al. 2024. The
Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav
Jauhri, et al. 2024a. “The Llama 3 Herd of Models.”
arXiv Preprint arXiv:2407.21783.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav
Jauhri, et al. 2024b. “The Llama 3 Herd of Models.”
arXiv Preprint arXiv:2407.21783.
Grosse, Roger, Juhan Bae, Cem Anil, et al. 2023. Studying
Large Language Model
Generalization with Influence
Functions. arXiv. https://doi.org/10.48550/arXiv.2308.03296.
Gurumoorthy, Karthik S., Amit Dhurandhar, Guillermo Cecchi, and Charu
Aggarwal. 2019. “Efficient Data
Representation by Selecting
Prototypes with Importance
Weights.” 2019 IEEE
International Conference on Data
Mining (ICDM), November, 260–69. https://doi.org/10.1109/ICDM.2019.00036.
Guthmann, François. 2023. Occupancy Explained. GPUOpen. https://gpuopen.com/learn/occupancy-explained/.
Ha, V. Q. 2025. Building an RLHF Pipeline for LLMs: A
Beginner-Friendly Tutorial. Https://medium.com/@vi.ha.engr/building-an-rlhf-pipeline-for-llms-a-beginner-friendly-tutorial-21112bfcff9b.
Habib, Nathan, Clémentine Fourrier, Hynek Kydlíček, Thomas Wolf, and
Lewis Tunstall. 2023. LightEval: A Lightweight Framework for LLM
Evaluation. Version 0.11.0. https://github.com/huggingface/lighteval.
Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2020.
“Measuring Massive Multitask Language Understanding.”
arXiv Preprint arXiv:2009.03300.
Hildebrand, Dean, and Denis Serenyi. 2021. Colossus Under the Hood:
A Peek into Google’s Scalable Storage System. https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system.
HPE. About Progressive File Layout (PFL). https://support.hpe.com/hpesc/public/docDisplay?docId=a00114946en_us&page=About_Progressive_File_Layout.html&docLocale=en_US.
Hu, Edward J, Yelong Shen, Phillip Wallis, et
al. 2022. “Lora: Low-Rank Adaptation of Large Language
Models.” Iclr 1 (2): 3.
IBM. 2026a. Basic Structure of IBM Storage Scale. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=overview-basic-structure-storage-scale.
IBM. 2026b. Cache Usage. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=considerations-cache-usage.
IBM. 2026c. GPFS Architecture. https://www.ibm.com/docs/en/storage-scale/6.0.0?topic=overview-gpfs-architecture.
IEEE Standards Association. 2019. IEEE Standard for Floating-Point
Arithmetic. IEEE.
influxdata. Influxdata. https://www.influxdata.com/.
IOR. IOR. https://wordpress.cels.anl.gov/darshan/.
Ip, Jeffrey, and Kritin Vongthongsri. 2026. deepeval. Version 3.9.7. https://github.com/confident-ai/deepeval.
Jacovi, Alon, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023.
Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks. https://arxiv.org/abs/2305.10160.
Jain, Naman, King Han, Alex Gu, et al. 2024. LiveCodeBench: Holistic
and Contamination Free Evaluation of Large Language Models for
Code. https://arxiv.org/abs/2403.07974.
Jegham, Nidhal, Marwan Abdelatti, Chan Young Koh, Lassad Elmoubarki, and
Abdeltawab Hendawi. 2025. “How Hungry Is Ai? Benchmarking Energy,
Water, and Carbon Footprint of Llm Inference.” arXiv Preprint
arXiv:2505.09598.
Jennings, Joseph et al. NeMo-Curator: A Toolkit for Data
Curation. https://github.com/NVIDIA-NeMo/Curator.
Jiang, Albert Q., Alexandre Sablayrolles, Antoine Roux, et al. 2024.
Mixtral of Experts. https://arxiv.org/abs/2401.04088.
Jiang, Peng, Christian Sonne, Wangliang Li, Fengqi You, and Siming You.
2024. “Preventing the Immense Increase in the Life-Cycle Energy
and Carbon Footprints of LLM-Powered Intelligent Chatbots.”
Engineering 40: 202–10.
Jiang, Ziheng, Haibin Lin, Yinmin Zhong, et
al. 2024. “MegaScale: Scaling Large Language Model Training
to More Than 10,000 GPUs.” 21st USENIX Symposium on Networked
Systems Design and Implementation (NSDI 24), 745–60.
Kalamkar, Dhiraj, Dheevatsa Mudigere, Naveen Mellempudi, et al. 2019.
A Study of BFLOAT16 for Deep Learning Training. https://arxiv.org/abs/1905.12322.
Kaplan, Jared, Sam McCandlish, Tom Henighan, et
al. 2020. “Scaling Laws for Neural Language Models.”
arXiv Preprint arXiv:2001.08361.
Karun, A Kala, and K Chitharanjan. 2013. “A Review on Hadoop—HDFS
Infrastructure Extensions.” 2013 IEEE Conference on
Information & Communication Technologies, 132–37.
Kim, Been, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. “Examples
Are Not Enough, Learn to Criticize! Criticism for
Interpretability.” Advances in
Neural Information Processing
Systems 29. https://papers.nips.cc/paper_files/paper/2016/hash/5680522b8e2bb01943234bce7bf84534-Abstract.html.
Koh, Pang Wei, and Percy Liang. 2017. “Understanding
Black-Box Predictions via
Influence Functions.” Proceedings
of the 34th International Conference on
Machine Learning, July, 1885–94. https://proceedings.mlr.press/v70/koh17a.html.
Köpf, Andreas, Yannic Kilcher, Dimitri von Rütte,
et al. 2023. “OpenAssistant Conversations Dataset
(OASST1).” Hugging Face. https://doi.org/10.48550/arXiv.2304.07327.
Korthikanti, Vijay Anand, Jared Casper, Sangkug Lym, et al. 2023.
“Reducing Activation Recomputation in Large Transformer
Models.” Proceedings of Machine Learning and Systems 5:
341–53.
Korthikanti, Vijay, Jared Casper, Sangkug Lym, et al. 2022. Reducing
Activation Recomputation in Large Transformer Models. https://arxiv.org/abs/2205.05198.
Krashinsky, Ronny, Olivier Giroux, Stephen Jones, Nick Stam, and Sridhar
Ramaswamy. 2020. NVIDIA Ampere Architecture
in-Depth. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody
Hao Yu, Joseph E. Gonzalez, et al. 2023. Efficient Memory Management
for Large Language Model Serving with PagedAttention.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody
Hao Yu, Joseph Gonzalez, et al. 2023. “Efficient Memory Management
for Large Language Model Serving with Pagedattention.”
Proceedings of the 29th Symposium on Operating Systems
Principles, 611–26.
Kydlíček, Hynek. n.d. Math-Verify: Math Verification
Library. Version 0.6.1. https://github.com/huggingface/math-verify.
Lacoste, Alexandre, Alexandra Luccioni, Victor Schmidt, and Thomas
Dandres. 2019. “Quantifying the Carbon Emissions of Machine
Learning.” arXiv Preprint arXiv:1910.09700.
Lamy-Poirier, Joel. 2023. Breadth-First Pipeline Parallelism.
https://arxiv.org/abs/2211.05953.
Lewis, Noah, Jean Luca Bez, and Suren Byna. 2025. “I/o in Machine
Learning Applications on Hpc Systems: A 360-Degree Survey.”
ACM Computing Surveys 57 (10): 1–41.
Lhoest, Quentin, Albert Villanova Del Moral, Yacine
Jernite, et al. 2021. “Datasets: A Community Library for
Natural Language Processing.” Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, 175–84.
Lian, Wing, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet
Vong, and "Teknium". 2023. “OpenOrca: An Open Dataset of GPT
Augmented FLAN Reasoning Traces.” In HuggingFace
Repository. Https://https://huggingface.co/datasets/Open-Orca/OpenOrca;
HuggingFace.
Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu,
Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya
Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, et al. 2023. Holistic Evaluation of Language Models.
https://arxiv.org/abs/2211.09110.
Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu,
Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya
Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Alexander Cosgrove, et al. 2023. “Holistic Evaluation of Language
Models.” Transactions on Machine Learning Research. https://openreview.net/forum?id=iO4LZibEqW.
Lightly.ai. n.d. Reinforcement Learning from Human Feedback
(RLHF). Https://www.lightly.ai/blog/rlhf-reinforcement-learning-from-human-feedback.
Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of
Summarization.” Text Summarization Branches Out: Proceedings
of the ACL-04 Workshop 8.
Lin, Ji, Jiaming Tang, Haotian Tang, Shaohan Yang, Xingyu Dang, and Song
Han. 2023. “AWQ: Activation-Aware Weight Quantization for LLM
Compression and Acceleration.” arXiv Preprint
arXiv:2306.00978.
Liu, Aixin, Bei Feng, Bing Xue, et al. 2024.
“Deepseek-V3 Technical Report.” arXiv Preprint
arXiv:2412.19437.
Liu, Biao, Ning Xu, Junming Yang, Hao Xu, and Xin Geng. 2026.
“VRM: Teaching Reward Models to Understand Authentic Human
Preferences.” arXiv Preprint arXiv:2603.04974.
Liu, Hao, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention
with Blockwise Transformers for Near-Infinite Context. https://arxiv.org/abs/2310.01889.
Liu, Junnan, Hongwei Liu, Linchen Xiao, et al. 2025. Are Your LLMs
Capable of Stable Reasoning? https://arxiv.org/abs/2412.13147.
LMSYS. 2023. Chatbot Arena Conversations (Version 1.0)
[Dataset]. Https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
Lockwood, Glenn. 2017. What’s so Bad about POSIX i/o? https://www.nextplatform.com/code/2017/09/11/whats-so-bad-about-posix-i/o/1636132.
Lottick, Kadan, Silvia Susai, Sorelle A Friedler, and Jonathan P Wilson.
2019. “Energy Usage Reports: Environmental Awareness as Part of
Algorithmic Accountability.” arXiv Preprint
arXiv:1911.08354.
Lucas, Ryan, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, and
Rahul Mazumder. 2025. “Reasoning Models Can Be Accurately Pruned
via Chain-of-Thought Reconstruction.” arXiv Preprint
arXiv:2509.12464.
Lundberg, Scott M., Gabriel Erion, Hugh Chen, et al. 2020. “From
Local Explanations to Global Understanding with Explainable
AI for Trees.” Nature Machine Intelligence
2 (1): 56–67. https://doi.org/10.1038/s42256-019-0138-9.
Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to
Interpreting Model Predictions.” Proceedings of the 31st
International Conference on
Neural Information Processing
Systems (Red Hook, NY, USA), NIPS’17,
December, 4768–77. https://dl.acm.org/doi/10.5555/3295222.3295230.
Lüttgau, Jakob, Michael Kuhn, Kira Duwe, et al. 2018. “Survey of
Storage Systems for High-Performance Computing.”
Supercomputing Frontiers and Innovations 5 (1).
Mangrulkar, Sourab, Sylvain Gugger, Lysandre Debut, et al. 2022.
PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning
Methods. https://github.com/huggingface/peft.
Maurya, Avinash, Robert Underwood, M Mustafa Rafique, Franck Cappello,
and Bogdan Nicolae. 2024. “Datastates-Llm: Lazy Asynchronous
Checkpointing for Large Language Models.” Proceedings of the
33rd International Symposium on High-Performance Parallel and
Distributed Computing, 227–39.
McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018.
“An Empirical Model of Large-Batch Training.” CoRR
abs/1812.06162. http://arxiv.org/abs/1812.06162.
Micikevicius, Paulius, Dusan Stosic, Neil Burgess, Marius Cornea, Ryan
Prenger, and Hassan Mayer. 2022. “FP8 Formats for Deep
Learning.” arXiv Preprint arXiv:2209.05433.
Microsoft. 2020. DeepSpeed: Deep Learning Optimization Library.
https://github.com/microsoft/DeepSpeed.
Microsoft. 2021. DeepSpeed Elasticity. https://www.deepspeed.ai/.
Microsoft. 2024. Model Checkpointing. https://www.deepspeed.ai/tutorials/checkpointing/.
MMEngine Contributors. 2022. OpenMMLab
Foundational Library for Training Deep Learning Models. https://github.com/open-mmlab/mmengine.
Mölder, Felix, Kim Philipp Jablonski, Brice
Letcher, et al. 2021. Snakemake: A Scalable Bioinformatics
Workflow Engine. https://snakemake.github.io/.
Molnar, Christoph. 2025. Interpretable Machine Learning: A Guide for
Making Black Box Models Explainable. 3rd ed. https://christophm.github.io/interpretable-ml-book.
Moritz, Philipp, Robert Nishihara, Stephanie Wang,
et al. 2018. “Ray: A Distributed Framework for Emerging AI
Applications.” 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 18), 561–77.
Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin
Yu. 2019. “Definitions, Methods, and Applications in Interpretable
Machine Learning.” Proceedings of the National Academy of
Sciences 116 (44): 22071–80. https://doi.org/10.1073/pnas.1900654116.
NERSC. 2024. Optimizing i/o Performance. https://docs.nersc.gov/performance/io/.
Nextflow Contributors. 2023. Nextflow Documentation. https://www.nextflow.io/docs/latest/index.html.
Niu, Chenxu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025.
“Energy Efficient or Exhaustive? Benchmarking Power Consumption of
LLM Inference Engines.” ACM SIGENERGY Energy Informatics
Review 5 (2): 56–62.
NVIDIA. 2020a. Accelerating AI Training with TF32 Tensor Cores.
NVIDIA Developer Blog. https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/.
NVIDIA. 2020b. NVIDIA Ampere GA-100 GPU Architecture. NVIDIA.
https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.
NVIDIA. 2022. NVIDIA Hopper GPU Architecture in-Depth. NVIDIA
Developer Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
NVIDIA. 2023. NVIDIA System Management Interface (Nvidia-Smi).
https://developer.nvidia.com/system-management-interface.
NVIDIA. 2024a. Automatic Mixed Precision for Deep Learning. https://developer.nvidia.com/automatic-mixed-precision.
NVIDIA. 2024b. Mixed Precision Training. NVIDIA Deep Learning
Performance Documentation. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/.
NVIDIA. 2024c. Nvidia-Ml-Py: Python Bindings for the NVIDIA
Management Library. https://pypi.org/project/nvidia-ml-py/.
NVIDIA Corporation. 2020. NVIDIA Ampere GA100 GPU
Architecture. NVIDIA Corporation. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.
NVIDIA Corporation. 2022. NVIDIA Hopper H100 Tensor
Core GPU Architecture. NVIDIA Corporation. https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c?ncid=no-ncid.
NVIDIA Corporation. 2023. NVIDIA Grace Hopper Superchip
Architecture. NVIDIA Corporation. https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper?ncid=no-ncid.
NVIDIA Corporation. 2024a. NVIDIA Blackwell
Architecture. NVIDIA Data Center Architecture Page. https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/.
NVIDIA Corporation. 2024b. NVIDIA Nsight Systems User Guide. https://docs.nvidia.com/nsight-systems/.
NVIDIA Corporation. 2024c. NVIDIA TensorRT-LLM.
GitHub repository, release 0.5.0. https://github.com/NVIDIA/TensorRT-LLM.
NVIDIA Corporation. 2025a. NVIDIA DGX B200 User
Guide. NVIDIA Documentation. https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html.
NVIDIA Corporation. 2025b. NVIDIA RTX Blackwell PRO GPU
Architecture. PDF. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/pdf/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1_1.pdf.
NVIDIA Corporation. 2025c. NVIDIA RTX PRO 6000 Blackwell
Series. NVIDIA Product Page. https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000-family/.
NVIDIA Corporation. 2025d. NVIDIA RTX PRO 6000 Blackwell
Server Edition. NVIDIA Product Page. https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/.
Olmo, Team, Allyson Ettinger, Amanda Bertsch, et
al. 2025. “Olmo 3.” arXiv Preprint
arXiv:2512.13961.
Ouyang, Long, Jeffrey Wu, Xu Jiang, et al.
2022. “Training Language Models to Follow Instructions with Human
Feedback.” Advances in Neural Information Processing
Systems 35: 27730–44.
Özcan, Miray, Philipp Wiesner, Philipp Weiß, and Odej Kao. 2025.
“Quantifying the Energy Consumption and Carbon Emissions of LLM
Inference via Simulations.” arXiv Preprint
arXiv:2507.11417.
Pan, Xueting, Ziqian Luo, and Lisang Zhou. 2024. “Navigating the
Landscape of Distributed File Systems: Architectures, Implementations,
and Considerations.” arXiv Preprint arXiv:2403.15701.
Pandey, Richa, and SP Sah. 2016. “A Review on Google File
System.” International Journal of Computer Science Trends and
Technology (IJCST) 4: 177–80.
Panigutti, Cecilia, Ronan Hamon, Isabelle Hupont, et al. 2023.
“The Role of Explainable AI in the Context of the
AI Act.” Proceedings of the 2023
ACM Conference on Fairness,
Accountability, and Transparency (New
York, NY, USA), FAccT ’23, June, 1139–50. https://doi.org/10.1145/3593013.3594069.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
“BLEU: A Method for Automatic Evaluation of Machine
Translation.” Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, 311–18.
Patel, Zeeshan, Ethan He, Parth Mannan, et
al. 2025. “Training Video Foundation Models with NVIDIA
NeMo.” arXiv Preprint arXiv:2503.12964.
Pathak, Shreyasi, Jörg Schlötterer, Jeroen Veltman, Jeroen Geerdink,
Maurice van Keulen, and Christin Seifert. 2024.
“Prototype-Based Interpretable
Breast Cancer Prediction
Models: Analysis
and Challenges.” In Explainable
Artificial Intelligence, edited by Luca
Longo, Sebastian Lapuschkin, and Christin Seifert. Springer Nature
Switzerland. https://doi.org/10.1007/978-3-031-63787-2_2.
Patil, Shishir G., Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.
2023. Gorilla: Large Language Model Connected with Massive
APIs. https://arxiv.org/abs/2305.15334.
Peddie, Jon. 2023. The History of the GPU-Eras and Environment.
Springer.
Penedo, Guilherme et al. DataTrove: Large Scale Data
Processing. https://github.com/huggingface/datatrove.
Petrini, Fabrizio. 2002. “Scaling to Thousands of Processors with
Buffered Coscheduling.” Scaling to New Heights Workshop.
Petsiuk, Vitali, Abir Das, and Kate Saenko. 2018. RISE:
Randomized Input Sampling for
Explanation of Black-Box
Models. arXiv. https://doi.org/10.48550/arXiv.1806.07421.
Piotr Mazurek, Felix Gabriel. 2025. “LLM Inference Economics from
First Principles.” Substack, May. https://www.tensoreconomics.com/p/moe-inference-economics-from-first.
Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2023.
“Efficiently Scaling Transformer Inference.”
Proceedings of Machine Learning and Systems 5: 606–24.
PrimeIntellect. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/PrimeIntellect/AIME-24.
PrimeIntellect. 2025. AIME 2025 Dataset. https://huggingface.co/datasets/PrimeIntellect/AIME-25.
Pruthi, Garima, Frederick Liu, Satyen Kale, and Mukund Sundararajan.
2020. “Estimating Training Data
Influence by Tracing Gradient
Descent.” Advances in Neural
Information Processing
Systems 33: 19920–30. https://proceedings.neurips.cc/paper/2020/hash/e6385d39ec9394f2f3a354d9d2b88eec-Abstract.html.
PyTorch Contributors. 2022. TorchElastic: Fault-Tolerant Distributed
Training. https://docs.pytorch.org/docs/stable/elastic/run.html.
PyTorch Contributors. 2023a. Anomaly Detection in Autograd. https://pytorch.org/docs/stable/autograd.html#anomaly-detection.
PyTorch Contributors. 2023b. Fully Sharded Data Parallel
(FSDP). https://docs.pytorch.org/docs/stable/fsdp.html.
PyTorch Contributors. 2023c. Getting Started with Fully Sharded Data
Parallel (FSDP). https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html.
PyTorch Contributors. 2023d. Reproducibility and Randomness. https://pytorch.org/docs/stable/notes/randomness.html.
PyTorch Contributors. 2023e. Torch.distributed.checkpoint. https://pytorch.org/docs/stable/distributed.checkpoint.html.
PyTorch Contributors. 2023f.
Torch.nn.functional.scaled_dot_product_attention. https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html.
PyTorch Contributors. 2024. Automatic Mixed Precision Examples.
https://docs.pytorch.org/docs/stable/notes/amp_examples.html#typical-mixed-precision-training.
Qi, Penghui, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero
Bubble Pipeline Parallelism. https://arxiv.org/abs/2401.10241.
Qin, Yujia, Shihao Liang, Yining Ye, et al.
2023. “Toolllm: Facilitating Large Language Models to Master
16000+ Real-World Apis.” arXiv Preprint
arXiv:2307.16789.
Qwen Team. 2025. Qwen3-8B. https://huggingface.co/Qwen/Qwen3-8B.
Radford, Alec, Jeffrey Wu, Rewon Child, et
al. 2019. “Language Models Are Unsupervised Multitask
Learners.” OpenAI Blog 1 (8): 9.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D Manning,
Stefano Ermon, and Chelsea Finn. 2023. “Direct Preference
Optimization: Your Language Model Is Secretly a Reward Model.”
Advances in Neural Information Processing Systems 36: 53728–41.
Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020.
“Zero: Memory Optimizations Toward Training Trillion Parameter
Models.” SC20: International Conference for High Performance
Computing, Networking, Storage and Analysis, 1–16.
Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and
Yuxiong He. 2021a. ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning. https://arxiv.org/abs/2104.07857.
Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and
Yuxiong He. 2021b. “Zero-Infinity: Breaking the Gpu Memory Wall
for Extreme Scale Deep Learning.” Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, 1–14.
Regulation (EU) 2016/679 of the European Parliament and of the Council:
General Data Protection Regulation (GDPR), Pub. L. No. 2016/679 (2016).
https://eur-lex.europa.eu/eli/reg/2016/679/oj.
Rein, David, Betty Li Hou, Asa Cooper Stickland, et al. 2023. GPQA:
A Graduate-Level Google-Proof q&a Benchmark. https://arxiv.org/abs/2311.12022.
Ren, Jie, Samyam Rajbhandari, Reza Yazdani Aminabadi, et al. 2021.
ZeRO-Offload: Democratizing Billion-Scale Model
Training. https://arxiv.org/abs/2101.06840.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016.
“"Why Should I
Trust You?": Explaining the
Predictions of Any
Classifier.” Proceedings of the 22nd
ACM SIGKDD International
Conference on Knowledge Discovery
and Data Mining (New York, NY, USA),
KDD ’16, August, 1135–44. https://doi.org/10.1145/2939672.2939778.
Rocklin, Matthew et al. 2015. “Dask:
Parallel Computation with Blocked Algorithms and Task
Scheduling.” SciPy, 126–32.
Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning
Models for High Stakes Decisions and Use Interpretable Models
Instead.” Nature Machine Intelligence 1 (5): 206–15. https://doi.org/10.1038/s42256-019-0048-x.
Salloum, Salman, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, and
Joshua Zhexue Huang. 2016. “Big Data Analytics on Apache
Spark.” International Journal of Data Science and
Analytics 1 (3): 145–64.
Scheda, Riccardo, Domitilla Brandoni, Laura Cavalli, and Laura Morselli.
2025. “Evaluation of Distributed Asynchronous Checkpointing in
High-Performance Computing.” International Conference on High
Performance Computing, 162–71.
Schmid, Philipp. 2024. Pass@k Vs Pass^k: Understanding Agent
Reliability. https://www.philschmid.de/agents-pass-at-k-pass-power-k.
Schmuck, Frank, and Roger Haskin. 2002. “GPFS: A Shared-Disk File
System for Large Computing Clusters.” Proceedings of the 1st
USENIX Conference on File and Storage Technologies.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
Klimov. 2017. “Proximal Policy Optimization Algorithms.”
arXiv Preprint arXiv:1707.06347.
Sel, Bilgehan, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming
Jin. 2023. “Algorithm of Thoughts: Enhancing Exploration of Ideas
in Large Language Models.” arXiv Preprint
arXiv:2308.10379.
Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna
Vedantam, Devi Parikh, and Dhruv Batra. 2020.
“Grad-CAM: Visual
Explanations from Deep Networks
via Gradient-Based Localization.”
International Journal of Computer Vision 128 (2): 336–59. https://doi.org/10.1007/s11263-019-01228-7.
Shapley, Lloyd S. 1952. A Value for n-Person Games. RAND
Corporation. https://doi.org/10.7249/P0295.
Singh, Shivalika, Freddie Vargus, Daniel Dsouza, et al. 2024. Aya
Dataset: An Open-Access Collection for Multilingual Instruction
Tuning. https://arxiv.org/abs/2402.06619.
Smilkov, Daniel, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin
Wattenberg. 2017. SmoothGrad: Removing Noise by Adding
Noise. arXiv. https://doi.org/10.48550/arXiv.1706.03825.
Smith, Alan. 2024. AMD Instinct MI300X Hot Chips 2024
Presentation. Presentation at Hot Chips 2024. https://hc2024.hotchips.org/assets/program/conference/day1/23_HC2024.AMD.MI300X.ASmith(MI300X).v1.Final.20240817.pdf.
Song, Xinyuan, Keyu Wang, PengXiang Li, Lu Yin, and Shiwei Liu. 2025.
“Demystifying the Roles of Llm Layers in Retrieval, Knowledge, and
Reasoning.” arXiv Preprint arXiv:2510.02091.
Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin
Riedmiller. 2015. Striving for Simplicity:
The All Convolutional
Net. arXiv. https://doi.org/10.48550/arXiv.1412.6806.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. Axiomatic
Attribution for Deep
Networks. arXiv. https://doi.org/10.48550/arXiv.1703.01365.
Taori, Rohan, Ishaan Gulrajani, Tianyi Zhang, et al. 2023.
“Stanford Alpaca: An Instruction-Following LLaMA Model.” In
GitHub Repository. Https://github.com/tatsu-lab/stanford_alpaca; GitHub.
Tazi, Nouamane et al. 2025. The
Ultra-Scale Playbook: Training LLMs on GPU
Clusters. https://huggingface.co/spaces/nanotron/ultrascale-playbook.
Tensorflow. Tensorflow Dataset. https://www.tensorflow.org/api_docs/python/tf/data/Dataset.
TensorFlow Contributors. 2024. Mixed Precision. https://www.tensorflow.org/guide/mixed_precision.
TOP500 Project. 2025. TOP500 List. top500.org. https://www.top500.org/.
Vef, Marc-André, Vasily Tarasov, Dean Hildebrand, and André Brinkmann.
2018. “Challenges and Solutions for Tracing Storage Systems: A
Case Study with Spectrum Scale.” ACM Trans. Storage (New
York, NY, USA) 14 (2). https://doi.org/10.1145/3149376.
Wan, Borui, Gaohong Liu, Zuquan Song, et al.
2025. “Robust Llm Training Infrastructure at Bytedance.”
Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems
Principles, 186–203.
Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, et al. 2023.
“Self-Instruct: Aligning Language Models with Self-Generated
Instructions.” Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers),
13484–508.
Wang, Yunsong, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth,
and Samuel Williams. 2020. “Time-Based Roofline for Deep Learning
Performance Analysis.” 2020 IEEE/ACM Fourth Workshop on Deep
Learning on Supercomputers (DLS), 10–19.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et
al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in
Large Language Models.” Advances in Neural Information
Processing Systems 35: 24824–37.
Wei, Tianwen, Liang Zhao, Lichang Zhang, et
al. 2023. “Skywork: A More Open Bilingual Foundation
Model.” arXiv Preprint arXiv:2310.19341.
Weights & Biases. 2020. Weights & Biases. https://wandb.ai/.
Wen, Xumeng, Zihan Liu, Shun Zheng, et al.
n.d. “Reinforcement Learning with Verifiable Rewards Implicitly
Incentivizes Correct Reasoning in Base Llms, 2025.” URL
Https://Arxiv. Org/Abs/2506.14245.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009.
“Roofline: An Insightful Visual Performance Model for Multicore
Architectures.” Communications of the ACM 52 (4): 65–76.
Workshop, BigScience, Teven Le Scao, Angela Fan, et
al. 2022. “Bloom: A 176b-Parameter Open-Access Multilingual
Language Model.” arXiv Preprint arXiv:2211.05100.
Xu, Can, Qingfeng Sun, Kai Zheng, et al. 2023. “Wizardlm:
Empowering Large Language Models to Follow Complex Instructions.”
arXiv Preprint arXiv:2304.12244.
Xu, Weizheng, Youtao Zhang, and Xulong Tang. 2021. “Parallelizing
Dnn Training on Gpus: Challenges and Opportunities.”
Companion Proceedings of the Web Conference 2021, 174–78.
Yang, Charlene, Yunsong Wang, Thorsten Kurth, Steven Farrell, and Samuel
Williams. 2021. “Hierarchical Roofline Performance Analysis for
Deep Learning Applications.” Intelligent Computing:
Proceedings of the 2021 Computing Conference, Volume 2, 473–91.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, et al. 2023. “Tree of
Thoughts: Deliberate Problem Solving with Large Language Models.”
Advances in Neural Information Processing Systems 36: 11809–22.
Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2022. “React:
Synergizing Reasoning and Acting in Language Models.” The
Eleventh International Conference on Learning Representations.
Young, John W. 1974. “A First Order Approximation to the Optimum
Checkpoint Interval.” Communications of the ACM 17 (9):
530–31.
Yu, Guangba, Zirui Wang, Yujie Huang, et al. 2026. “Why Does the
LLM Stop Computing: An Empirical Study of User-Reported Failures in
Open-Source LLMs.” arXiv Preprint arXiv:2601.13655.
Yu, Qiying, Zheng Zhang, Ruofei Zhu, et al.
2025. “Dapo: An Open-Source Llm Reinforcement Learning System at
Scale, 2025.” URL Https://Arxiv. Org/Abs/2503.14476 1:
2.
Yun, Jiwoong, Sangmin Choi, François Rameau, Byeongho Kang, and Ziyan
Fu. 2025. Revisiting 16-Bit Neural Network Training: A Practical
Approach for Resource-Limited Learning. https://doi.org/10.48550/arXiv.2305.10947.
Z, Haitao. 2025. What Is the MFU for DeepSeek-V3 Training?
Medium.com. https://medium.com/@dlrover/what-is-the-mfu-for-deepseek-v3-training-0d9ea4d42eb4.
Zech, John R., Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph
J. Titano, and Eric Karl Oermann. 2018. “Variable Generalization
Performance of a Deep Learning Model to Detect Pneumonia in Chest
Radiographs: A Cross-Sectional Study.” PLOS Medicine 15
(11): e1002683. https://doi.org/10.1371/journal.pmed.1002683.
Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Franziska
Roesner. 2019. “HellaSwag: Can a Machine Really Finish Your
Sentence?” arXiv Preprint arXiv:1905.07830.
Zhang, Jixiao, and Chunsheng Zuo. 2025. “Grpo-Lead: A
Difficulty-Aware Reinforcement Learning Approach for Concise
Mathematical Reasoning in Language Models.” Proceedings of
the 2025 Conference on Empirical Methods in Natural Language
Processing, 5642–65.
Zhang, Shengyu, Linfeng Dong, Xiaoya Li, et
al. 2026. “Instruction Tuning for Large Language Models: A
Survey.” ACM Computing Surveys 58 (7): 1–36.
Zhao, R., B. Vogel, and T. Ahmed. 2019. “Adaptive Loss Scaling for
Mixed Precision Training.” arXiv Preprint
arXiv:1910.12385, ahead of print. https://doi.org/10.48550/arXiv.1910.12385.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. Judging
LLM-as-a-Judge with MT-Bench and Chatbot Arena. https://arxiv.org/abs/2306.05685.
Zhong, Tianle, Jiechen Zhao, Xindi Guo, Qiang Su, and Geoffrey Fox.
2024. “Optimizing Data i/o for LLM Datasets on Remote
Storage.” AIOps.
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. 2015. Learning Deep Features for
Discriminative Localization. arXiv. https://doi.org/10.48550/arXiv.1512.04150.
Zhou, Chunting, Pengfei Liu, Puxin Xu, et
al. 2023. “Lima: Less Is More for Alignment.”
Advances in Neural Information Processing Systems 36: 55006–21.
Zhou, Denny, Nathanael Schärli, Le Hou, et
al. 2022. “Least-to-Most Prompting Enables Complex
Reasoning in Large Language Models.” arXiv Preprint
arXiv:2205.10625.
Zmora, Neta, Hao Wu, and Jay Rodge. 2021. “Achieving Fp32 Accuracy
for Int8 Inference Using Quantization Aware Training with Nvidia
Tensorrt.” NVIDIA Technical Blog. URL:
Https://Developer.nvidia.com/Blog/Achieving-Fp32-Accuracy-for-Int8-Inference-Using-Quantization-Aware-Training-with-Tensorrt/.