Zhiyu Ding

High Performance Computing

Zhiyu Ding is from Weifang, Shandong, the "Kite Capital" of China. He is currently a 2023-grade undergraduate student majoring in Data Science and Big Data Technology at Southwest Petroleum University, currently ranked 1st out of 64 students in his major. He has a solid foundation in high-performance computing and is proficient in parallel programming technologies including C/C++, CUDA, MPI/OpenMP. Through participating in competitions such as the ASC Student Supercomputer Challenge, IndySCC@SC24, Marine Computing Challenge, Operator Development Challenge, and others, he has accumulated rich experience in performance optimization, including ocean simulation parallelization, deep learning operator development, and SpMV algorithm optimization on heterogeneous platforms. He excels at algorithm optimization and performance tuning on CPU/GPU/Sunway heterogeneous platforms, possesses strong problem analysis and solving abilities, and demonstrates excellent execution capabilities. He is continuously deepening his study of high-performance computing technologies and hopes to achieve sustained development in this field!

Education

Southwest Petroleum University

School of Computer and Software Engineering, Data Science and Big Data Technology
– Present

Academic Performance: GPA: 4.1/5.0, Major Ranking: 1/65

Courses
  • Big Data Platform Technology and Applications (98), Python (96), Object-Oriented Programming (95)

  • Statistics Principles (93), Introduction to Artificial Intelligence (92), Linear Algebra (91), Data Structures and Algorithms (90)

Projects

Parallel Computing Optimization for Oil Spill Prediction Model

Team Leader at Team: Dream Brook
Competition

This project was selected from the 2024 Marine Computing Challenge finals. Using a self-developed two-dimensional oil spill prediction model, the program was accelerated using parallel computing techniques while ensuring understanding of the Euler method for solving trajectory equations and the correctness of vector-based methods for determining whether oil particles will be adsorbed to the shore.

  • Responsibilities: Implemented hybrid MPI and OpenMP parallelization on the original serial program, utilizing load balancing techniques to fully leverage 2-node 128-core computing resources, improving algorithm execution efficiency. For memory access optimization, we used Fortran/C to rearrange data access order, utilizing memory locality to improve cache hit rates. For communication optimization, I utilized non-blocking communication and packed data communication methods. We also implemented algorithm-level optimizations using fast exclusion of non-intersecting line segments and binary search techniques to quickly determine the relationship between oil particle trajectories and coastlines.

  • Project Results: Passed correctness verification with test cases provided by the organizing committee. On the initial test case, achieved approximately 2482.14x speedup compared to baseline, ranking fifth among final teams and winning national third prize.

AlphaFold3-based Protein Structure Prediction Inference Optimization Project

Selected from the ASC25 Student Supercomputer Challenge, this project involves inference performance optimization for the AlphaFold3 protein structure prediction model developed by Google DeepMind. The project requires minimizing inference time on both GPU and CPU platforms while maintaining prediction accuracy, processing 12 protein sequence samples of different lengths, involving complex diffusion model architecture and JAX deep learning framework optimization.

  • Responsibilities: Completed AlphaFold3 environment deployment on NVIDIA A100 GPU and Intel Xeon CPU hybrid architecture, identified inference bottlenecks through cProfiler performance analysis tools, discovering that JAX framework JIT compilation consumes significant time. For GPU optimization, implemented strategies such as disabling Triton GEMM compilation and optimizing compilation bucket parameters; for CPU optimization, resolved numerical computation precision issues in the diffusion_head.py module, fixed NaN errors caused by negative square roots, and adopted epsilon numerical stability techniques to ensure computational accuracy.

  • Project Results: Successfully achieved significant AlphaFold3 inference acceleration, with GPU optimization achieving 1.2-2.4x performance improvement across different sequence lengths, and CPU optimization achieving 1.1-5.3x speedup, particularly effective on short sequences. Through systematic compilation optimization and algorithmic tuning, while ensuring protein structure prediction confidence values consistent with benchmark code, provided efficient inference solutions for AlphaFold3 applications in bioinformatics and drug design fields.

NAMD-based Molecular Dynamics Simulation Performance Optimization Challenge

Selected from IndySCC@SC24 International Student Cluster Competition, this project focuses on large-scale molecular dynamics simulation optimization for biomolecular systems. The project covers multiple levels of biological computing challenges including water molecule physical property analysis, protein folding dynamics, and thermodynamic integration free energy calculations, requiring efficient simulation of systems ranging from 100,000 to 20 million atoms within a limited 48-hour timeframe.

  • Responsibilities: Completed NAMD environment deployment and GPU acceleration configuration on Jetstream2 cloud platform, implemented various molecular dynamics simulation algorithms including extended adaptive biasing force method (eABF), replica exchange molecular dynamics, thermodynamic integration and other advanced sampling techniques, optimized time step from 2fs to 4fs through hydrogen mass repartitioning technique, and adopted GPU parallel computing strategies to handle complex task scheduling for multiple replicas running simultaneously.

  • Project Results: Successfully completed precise calculations of water molecule heat capacity and diffusion coefficients, achieved convergent calculation of deca-alanine protein α-helix folding free energy curves, reached approximately 15 nanoseconds/day simulation performance on A100 GPU, and significantly improved computational efficiency for large-scale biomolecular systems through algorithmic optimization while maintaining computational accuracy.

MLPerf Inference-based BERT Model Inference Performance Optimization

Selected from the IndySCC24 International Supercomputing Competition MLPerf Inference benchmark challenge, this project focuses on inference performance optimization for BERT-99 large language model on question-answering tasks (Squad v1.1 dataset). The project requires efficient inference implementation on CPU and GPU heterogeneous platforms, using MLCommons CM automation framework for benchmark configuration and result submission, involving core technologies such as deep learning inference optimization, parallel computing, and performance tuning.

  • Responsibilities: Deployed MLPerf environment on AMD EPYC 7713 CPU and NVIDIA A100 GPU hybrid architecture, overcame technical challenges including permission configuration and file packaging. Designed and implemented batch processing inference optimization strategies, including multi-input sample batch collection, data preprocessing pipeline reconstruction, GPU parallel inference acceleration, and result post-processing optimization. Through in-depth analysis of inference bottlenecks, reconstructed issue_queries method to implement batch data preparation, optimized process_batch method to improve GPU utilization, and achieved end-to-end inference performance optimization.

  • Project Results: Successfully achieved significant BERT inference performance improvement, with GPU inference throughput reaching 85.447 samples/second, a 26.8x improvement compared to CPU's 3.193 samples/second. Through batch processing optimization techniques, improved GPU utilization from baseline 54% to 97%, significantly reducing inference latency while maintaining 90.876% accuracy. Project results were successfully submitted to GitHub and validated by MLCommons officials.

PCG Algorithm Optimization on New Generation Sunway Supercomputer

Selected from the 7th Domestic CPU Parallel Application Challenge preliminary competition, this project focuses on many-core optimization of the Preconditioned Conjugate Gradient (PCG) algorithm.

  • Responsibilities: For the core hotspot SpMV algorithm, adopted approximately balanced row partitioning strategy and LDM space memory access adjustment methods for optimization; analyzed workflow and utilized master core to hide partial computations, fully utilizing LDM space.

  • Project Results: Passed correctness verification and achieved an average 30x speedup.

Awards

ASC2025 Student Supercomputer Challenge International Second Prize

Awarded by ASC Student Supercomputer Challenge Committee

The ASC Student Supercomputer Challenge, initiated in 2012, is the world's largest supercomputing competition, alongside Germany's ISC and the US SC, forming the world's three major supercomputing competitions. This year's AI challenge required participating teams to run and optimize AlphaFold3 structure prediction code on different computing platforms, testing teams' understanding and optimization capabilities of the AlphaFold3 inference process.

Invited to Participate in SC24 International Supercomputing Competition Online Track IndySCC

Awarded by SC Student Cluster Competition Committee

The Supercomputing Conference (SC) is the top international conference in the supercomputing field. IndySCC is the online track established by SC events. Like the SC offline track, one of the three major supercomputing competitions, it requires completing given computational tasks within 48 hours under limited conditions and achieving the highest possible computational performance.

Third Prize in National Finals of Marine Computing Challenge 2024

Awarded by China Pacific Society and Beijing Parallel Computing Technology Co., Ltd.

The Marine Computing Challenge (MCC) covers marine big data processing and analysis, marine environment simulation and prediction, marine resource development and utilization, marine disaster warning and emergency response, marine artificial intelligence applications, and other application scenarios, comprehensively assessing participants' skills in various marine application fields.

Third Prize in Tecorigin Operator Development Challenge National Finals

Awarded by Second Open Atom Contest - Open Atom Open Source Foundation

The Tecorigin Operator Development Challenge is based on the Teco-AL (Taichu Acceleration Library) unified operator library model, using SDAA C programming language for operator performance optimization on the Taichu domestic GPU platform. The competition covers optimization of deep learning core operators such as tecoalArgmax, tecoalActivationBackward, tecoalConvolutionForward, testing participants' high-performance computing technical capabilities in parallel computing, memory access optimization, vector instruction optimization on domestic GPU architectures.

Second Place in Tianyi Cloud Xirang Cup College AI Competition Sichuan Provincial Competition

Awarded by China Telecom Corporation Limited, Huawei Technologies Co., Ltd.

The Tianyi Cloud Xirang Cup College AI Competition operator optimization track, based on Ascend NPU platform using AscendC for operator development and performance optimization. The competition covers high-performance implementation of deep learning core operators such as NLLLossGrad backward operator and QuantBatchMatmul+Swiglu fusion operator, testing key technologies such as multi-core parallelization, Cube/Vector pipeline optimization, memory management under Ascend 910B architecture. The competition focuses on refined operator optimization for domestic heterogeneous computing infrastructure, promoting innovative applications of high-performance computing in the CANN ecosystem.

Third Prize in the 15th Blue Bridge Cup National Finals

Awarded by Ministry of Industry and Information Technology Talent Exchange Center, Blue Bridge Cup Competition Committee

Blue Bridge Cup Python Programming Group, this competition covers basic algorithms, data structures, dynamic programming, graph theory, string processing, mathematical calculations, and other problem types, requiring completion of multiple programming problems within limited time under OI competition format, focusing on testing participants' ability to solve algorithmic problems using Python language.

Invited to Participate in 2024 Tencent Kaiwu AI Global Open Invitational

Awarded by 2024 Tencent Kaiwu AI Global Open Competition Committee

Participated in the 2024 Tencent Kaiwu AI Global Open Invitational and successfully completed the 'AIPC High-Performance Gaming Track'.

Outstanding Student First-Class and Second-Class Scholarships

Awarded by Southwest Petroleum University

Certificates

Skills

C/C++, Fortran

CPU, GPU Architecture

CUDA, HIP

CUDA Operator Optimization

OpenMP, MPI

Sunway Supercomputer

Languages

English CET-6

CET6: 478

Mandarin Chinese

Level 2A

English CET-4

CET4: 521

Local Dialects

Limited

Interests

Travel

  • Flying
  • Exploration
  • Hotels

Photography

  • Capturing
  • Recording
  • Memories