English PDF Chinese PDF Standalone HTML

Yonghao Tan

Ph.D. Candidate in Electronic and Computer Engineering, The Hong Kong University of Science and Technology (HKUST)
HKUST, Clear Water Bay, Hong Kong, China +852 53946864 ytanaz@connect.ust.hk
yonghaot1017@gmail.com yonghao-tan.github.io ORCID 0000-0001-5372-5863

Education

The Hong Kong University of Science and Technology (HKUST): Hong Kong, China

Supervisor: Prof. Tim Kwang-Ting Cheng

Southern University of Science and Technology (SUSTech): Shenzhen, Guangdong, China

Supervisor: Prof. Fengwei An

5nm UCIe-Enabled Multi-Chiplet Generalizable Rendering Processor: AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong, China; Mar. 2024 - Sept. 2025

Designed the architecture and execution scheduler for a 5nm four-chiplet GeNeRF processor, and built an early-stage simulation framework to evaluate buffer management, cross-die source-view caching, dense/sparse D2D transfer, multi-level sparsity control, and hybrid GeNeRF-SR dataflow.
Developed algorithm-hardware co-design simulation models for projection-area-driven source-view placement, dynamic patch grouping, dense/sparse D2D modes, source-view pruning, tile-level fine-stage sharing, and patch-grained SR routing; used them to validate accuracy-sensitive sparsity and SR decisions and guide hardware-aware end-to-end quantization for silicon deployment.
Contributed to the implementation of inter-chiplet transfer, chip-level control, and nonlinear computation modules; participated in chip testing and system validation of the fabricated 5nm MCM processor, which integrates the four-chiplet GeNeRF engine in a 45 mm x 45 mm package and achieves 91.43 TOPS/W and 55.43 FPS throughput.

55nm ReRAM-on-Logic Stacked LLM Accelerator: AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong, China; Apr. 2024 - Aug. 2025

Contributed to algorithm-hardware co-design optimizations for a 55nm ReRAM-on-logic stacked edge LLM accelerator, targeting decoding-stage memory and scheduling bottlenecks with local-rotation-based W4A8 quantization, ReRAM-resident block-clustered codebook reconstruction, and adaptive speculative decoding.
Implemented and validated the end-to-end algorithm flow, including layer-wise quantized LLM evaluation, codebook-based draft-model weight reconstruction, acceptance/rejection-aware speculative decoding analysis, and hardware-mapping studies that balance target-model EMA reduction against rejected-draft overhead.
Designed RTL for nonlinear computation modules and related control/datapath logic; assisted chip testing and validation of the 55nm logic die stacked with four ReRAM dies via face-to-face bumps, achieving 14.08 to 135.69 token/s on a 55.98 mm² logic die.

28nm CNN-Transformer Accelerator for Semantic Segmentation: AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong, China; Nov. 2021 - Sept. 2024

Contributed to the development of algorithm-hardware co-design optimizations for a 28nm CNN-Transformer semantic-segmentation accelerator, including hybrid attention processing, data-reuse-oriented layer fusion, and cascaded feature-map pruning for high-resolution ConvFormer workloads.
Built a hardware energy simulation framework to quantify optimization impact and guide architecture and tape-out decisions; implemented the algorithm validation flow for VA/LA hybrid attention, KV/weight reuse scheduling, non-overlap layer fusion, and mask-based cascaded pruning.
Designed RTL for attention/layer-fusion control and pruning-related datapath logic; assisted silicon testing and validation of the 13.93 mm² 28nm chip, achieving 0.22 uJ/token and up to 52.90 TOPS/W peak efficiency.

A 5nm 91.43 TOPS/W 4-Chiplet Generalizable-Rendering-Processor with UCIe-Enabled Cross-Die-Cache and Balance-Aware Progressive Multi-Level Sparsity: Tan, Y.*, Ma, S.*, Dong, P., Luo, P., Lei, Z., Lu, W., Ying, G., ... & Cheng, K. T.; 2026 IEEE Custom Integrated Circuits Conference (CICC), IEEE.

A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding: Dong, P., Tan, Y., Liu, X., Luo, P., Liu, Y., Pang, D., Ma, S., ... & Cheng, K. T.; 2026 IEEE International Solid-State Circuits Conference (ISSCC), IEEE.

A 28nm 0.22uJ/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation: Dong, P.*, Tan, Y.*, Liu, X., Luo, P., Liu, Y., Liang, L., ... & Cheng, K. T.; 2025 IEEE International Solid-State Circuits Conference (ISSCC), IEEE.

APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design: Tan, Y.*, Dong, P.*, Wu, Y., Liu, Y., Liu, X., Luo, P., Liu, S. Y., Huang, X., Zhang, D., Liang, L., & Cheng, K. T.; 2025 62nd ACM/IEEE Design Automation Conference (DAC), IEEE.

Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers: Dong, P.*, Tan, Y.*, Zhang, D., Ni, T., Liu, X., Liu, Y., ... & Cheng, K. T.; 2024 61st ACM/IEEE Design Automation Conference (DAC), IEEE.

A Reconfigurable Coprocessor for Simultaneous Localization and Mapping Algorithms in FPGA: Tan, Y.*, Deng, H.*, Sun, M., Zhou, M., Chen, Y., Chen, L., ... & An, F.; IEEE Transactions on Circuits and Systems II: Express Briefs, 70(1), 286-290, 2022.

A Reconfigurable Visual-Inertial Odometry Accelerator with High Area and Energy Efficiency for Autonomous Mobile Robots: Tan, Y.*, Sun, M.*, Deng, H., Wu, H., Zhou, M., Chen, Y., ... & An, F.; Sensors, 22(19), 7669, 2022.

* Authors marked with an asterisk contributed equally to the corresponding work.

Best Teaching Assistant Award, Department of Electronic and Computer Engineering, HKUST: Aug. 2025

First-Class Outstanding Students Scholarship with the highest score: Sept. 2022

Shenzhen Longsys Electronics Company Award (Top 2% in the School of Microelectronics): Dec. 2021

First Prize, 2021 National College Students FPGA Innovation Design Competition (Top 22 out of 1,341 teams): Dec. 2021

First Prize, 2021 International Competition of Autonomous Running Robots (1st place out of 34 finalist teams): Oct. 2021

Research Interests: Software/Hardware Co-Design, Model Compression, 3D Processing

Programming Languages: C, C++, Java, Python, SystemVerilog, Verilog HDL, VHDL

Professional Software: AutoCAD, Cadence, Design Compiler, IC Compiler II, MATLAB, Multisim, Silvaco

Languages: English (fluent), Mandarin (native), Cantonese (native)