About
  • Systems/AI engineer specializing in C++ kernels and LLM inference on NPU/GPU.
  • At Microsoft (AI Frameworks, Vancouver): work on MAIA kernels and inference pipelines; build high-performance AI software in C/C++/Python; collaborate with HW/ML teams to optimize large-scale training/inference on Microsoft accelerators.
  • Previously at Huawei: co-developed sgmv/bgmv LoRA operators for vLLM (Ascend) and led database-engine performance work.
  • Core skills: profiling-driven SIMD, tiling, memory-layout design, throughput tuning across kernels/runtime/I/O.
  • MASc, University of Waterloo (persistent memory).
  • Open source: Ascend-vLLM LoRA kernels
  • Tutoring: Algorithms, Data Structures, C++, Python.
Basic Information
Email:
moridi@uwaterloo.ca
Location:
Vancouver, British Columbia, Canada
Experience
Microsoft Logo
Software Engineer II

Nov 2024 - Present | Microsoft — AI Frameworks | Vancouver, Canada

  • Optimize C++ LLM kernels and Python bindings for Microsoft AI accelerators (MAIA); improved inference throughput/latency in internal benchmarks.
  • Implemented MAIA-specific kernels, delivering ∼40% lower latency versus prior Triton implementations across targeted ops.
  • Co-designed microbenchmarks, productionized changes with CI, and profiled end-to-end paths (kernels, runtime, I/O); reduced stalls via SIMD, tiling, cache-friendly layouts, and double buffering.
Huawei Logo
Senior Research Engineer

Sep 2023 - Nov 2024 | Huawei Technologies Canada | Burnaby, Canada

LLM Inference (May 2024 - Nov 2024):

  • Co-developed Ascend NPU operators (sgmv, bgmv) to accelerate LoRA in vLLM; validated on multi-card nodes; subsequently open-sourced by the team under the vLLM Ascend repo.
  • Achieved 3-5x speedups over a BMM baseline across diverse LoRA distribution scenarios.
  • Built latency/throughput regression tests and CI checks; profiled kernels and runtime to remove bottlenecks.

Database Systems (Sep 2023 - May 2024):

  • Led R&D for an in-memory OLAP DB; designed and implemented asynchronous scan operators.
  • Increased CPU utilization by 20-40% on TPC-H queries via vectorized execution, cache-aware algorithms, and I/O overlap.

Open source: the Ascend kernels were released at vllm-project/vllm-ascend , documenting how the NPU-specific sgmv/bgmv paths integrate with vLLM to speed up LoRA serving on Ascend hardware.

University of Waterloo Logo
Graduate Research Assistant

Sep 2021 - Sep 2023 | University of Waterloo | Waterloo, Canada

Research experience on shared memory algorithms.

Designed and implemented snapshotting mechanisms for persistent memory-mapped files and detectable objects for persistent memory.

Graduate Teaching Assistant

Jan 2022 - Apr 2023 | University of Waterloo | Waterloo, Canada

Helped in holding courses in Algorithm Design, Database Systems, Systems Programming, and Data Abstraction.

Huawei Logo
Blockchain Researcher Intern

May 2022 - Aug 2022 | Huawei Technologies Canada Co., Ltd. | Waterloo, Canada

Worked on a collaborative project called Towards a High-velocity Permissioned Blockchain.

University of Tehran Logo
Undergraduate Teaching Assistant

Oct 2018 - Jul 2021 | University of Tehran | Tehran, Iran

Assisted in more than 15 courses such as Advanced Programming and Operating Systems.

Research Data Scientist

Oct 2019 - Apr 2021 | University of Tehran Science & Technology Park | Tehran, Iran

Analyzed social network behavior in the Ethereum network using graph analysis.

Research Data Science Intern

Jul 2019 - Sep 2019 | University of Tehran Science & Technology Park | Tehran, Iran

Participated in several Kaggle competitions and developed solutions using Scikit-Learn.

Top Skills
C/C++
Python
Low-Level Programming
Multi-Core Programming
Linux
Java
Education
University of Waterloo Logo
Master's degree, Computer Software Engineering

Sep 2021 - Sep 2023 | University of Waterloo | Waterloo, Canada

Grade: 94 / 100. Thesis on snapshotting mechanisms for persistent memory-mapped files. View Thesis

University of Tehran Logo
Bachelor's degree, Computer Engineering

Sep 2016 - Jul 2021 | University of Tehran | Tehran, Iran

Grade: 90 / 100. Thesis on social network analysis within the Ethereum ecosystem.

Publications
Snapshotting Mechanisms for Persistent Memory-Mapped Files

Published in ApPLIED@PODC 2024. DOI: 10.1145/3663338.3665832

Investigates ways to enhance the reliability of persistent memory systems, focusing on snapshotting mechanisms and their role in system resilience. Introduces new snapshotting consistency models and mechanisms to improve performance and enhance system responsiveness. Provides experimental analysis demonstrating throughput and latency improvements.

A Closer Look at Detectable Objects for Persistent Memory

Published in ApPLIED@PODC 2022. DOI: 10.1145/3524053.3542749

Explores the adaptation of multi-core algorithms to persistent memory, introducing the "Unified Detectable Sequential Specification" (UDSS), which simplifies interfaces and coding. Experiments conducted using Intel Optane memory demonstrate the performance implications of the implementation.

Open-Source Contributions
vLLM Ascend Kernels (sgmv, bgmv)

Co-developed NPU-specific operators to accelerate LoRA serving on Ascend hardware and integrated them into vLLM's execution path. Validated on multi-card nodes and later open-sourced by the team.

github.com/vllm-project/vllm-ascend/tree/main/csrc/kernels

  • 3-5x speedups over BMM baselines for diverse LoRA distribution scenarios.
  • Latency/throughput regression tests and CI checks to prevent performance regressions.
  • Documented NPU-specific integration for kernel, runtime, and I/O paths.

Montage (UW) - Online Snapshotting

Research code and experiments for online snapshotting used in persistent-memory work at the University of Waterloo.

gitlab.uwaterloo.ca/mmoridi/montage-uw/-/tree/online-snapshotting

UDSS (Unified Detectable Sequential Specification)

Implementation supporting detectable objects / persistent-memory queue experiments with Prof. Golab's group.

git.uwaterloo.ca/wgolab/DSSQueue


These contributions align with my current work on C++ kernels, LLM inference, and profiling-driven optimization (SIMD, tiling, cache-friendly layouts) across kernels, runtime, and I/O.

Certificates
University of Waterloo Logo
ExpecTAtions Teaching Assistant Workshop

Issued by University of Waterloo in Oct 2021

IEEE University of Tehran Student Branch Logo
IEEE Data Science Winter School as Machine Learning Instructor

Issued by IEEE University of Tehran Student Branch in Jan 2020

IEEE University of Tehran Student Branch Logo
IEEE Data Science Winter School as Statistical Inference Instructor

Issued by IEEE University of Tehran Student Branch in Jan 2020

University of Tehran Logo
Workshop on Teaching Assistant Training

Issued by University of Tehran in Apr 2019

Honors and Awards
Huawei Logo
2024 Q2 President's Spot Award

Issued by Huawei Technologies Canada in Oct 2024.

Was awarded the Best Team Award for outstanding contributions to LLM serving performance, by the President of the Canadian Research Center.

ETHGlobal Logo
ETHGlobal Waterloo 2023 Finalist

Issued by ETHGlobal in Jun 2023.

Project: Smarter Contract. View Project

ETHGlobal Logo
Winner of Best Usecase of Hyperlane

Issued by Hyperlane in Jun 2023.

Project: Smarter Contract.

Waterloo Blockchain Logo
First Place in the Axelar Track at OlympiHacks

Issued by Waterloo Blockchain in May 2023.

Project: Glue. View Project

University of Waterloo Logo
International Master's Award of Excellence

Issued by University of Waterloo in Sep 2021.

View Description

University of Waterloo Logo
Graduate Research Studentship

Issued by University of Waterloo in Sep 2021.

View Description

References
Contact Me

Location

Vancouver, British Columbia, Canada

Tutoring

Online/In-Person Tutoring