Mohammad Moridi's CV

About

Systems/AI engineer specializing in C++ kernels and LLM inference on NPU/GPU.
At Microsoft (AI Frameworks, Vancouver): work on MAIA kernels and inference pipelines; build high-performance AI software in C/C++/Python; collaborate with HW/ML teams to optimize large-scale training/inference on Microsoft accelerators.
Previously at Huawei: co-developed sgmv/bgmv LoRA operators for vLLM (Ascend) and led database-engine performance work.
Core skills: profiling-driven SIMD, tiling, memory-layout design, throughput tuning across kernels/runtime/I/O.
MASc, University of Waterloo (persistent memory).
Open source: Ascend-vLLM LoRA kernels
Tutoring: Algorithms, Data Structures, C++, Python.

Basic Information

Email:

moridi@uwaterloo.ca

Location:

Vancouver, British Columbia, Canada

LinkedIn:

www.linkedin.com/in/m-moridi

Tutoring:

Online/In-Person Tutoring

GitHub:

github.com/moridi

Experience

Software Engineer II

Nov 2024 - Present | Microsoft — AI Frameworks | Vancouver, Canada

Optimize C++ LLM kernels and Python bindings for Microsoft AI accelerators (MAIA); improved inference throughput/latency in internal benchmarks.
Implemented MAIA-specific kernels, delivering ∼40% lower latency versus prior Triton implementations across targeted ops.
Co-designed microbenchmarks, productionized changes with CI, and profiled end-to-end paths (kernels, runtime, I/O); reduced stalls via SIMD, tiling, cache-friendly layouts, and double buffering.

Senior Research Engineer

Sep 2023 - Nov 2024 | Huawei Technologies Canada | Burnaby, Canada

LLM Inference (May 2024 - Nov 2024):

Co-developed Ascend NPU operators (sgmv, bgmv) to accelerate LoRA in vLLM; validated on multi-card nodes; subsequently open-sourced by the team under the vLLM Ascend repo.
Achieved 3-5x speedups over a BMM baseline across diverse LoRA distribution scenarios.
Built latency/throughput regression tests and CI checks; profiled kernels and runtime to remove bottlenecks.

Database Systems (Sep 2023 - May 2024):

Led R&D for an in-memory OLAP DB; designed and implemented asynchronous scan operators.
Increased CPU utilization by 20-40% on TPC-H queries via vectorized execution, cache-aware algorithms, and I/O overlap.

Open source: the Ascend kernels were released at vllm-project/vllm-ascend , documenting how the NPU-specific sgmv/bgmv paths integrate with vLLM to speed up LoRA serving on Ascend hardware.

Graduate Research Assistant

Sep 2021 - Sep 2023 | University of Waterloo | Waterloo, Canada

Research experience on shared memory algorithms.

Designed and implemented snapshotting mechanisms for persistent memory-mapped files and detectable objects for persistent memory.

Graduate Teaching Assistant

Jan 2022 - Apr 2023 | University of Waterloo | Waterloo, Canada

Helped in holding courses in Algorithm Design, Database Systems, Systems Programming, and Data Abstraction.

Blockchain Researcher Intern

May 2022 - Aug 2022 | Huawei Technologies Canada Co., Ltd. | Waterloo, Canada

Worked on a collaborative project called Towards a High-velocity Permissioned Blockchain.

Undergraduate Teaching Assistant

Oct 2018 - Jul 2021 | University of Tehran | Tehran, Iran

Assisted in more than 15 courses such as Advanced Programming and Operating Systems.

Research Data Scientist

Oct 2019 - Apr 2021 | University of Tehran Science & Technology Park | Tehran, Iran

Analyzed social network behavior in the Ethereum network using graph analysis.

Research Data Science Intern

Jul 2019 - Sep 2019 | University of Tehran Science & Technology Park | Tehran, Iran

Participated in several Kaggle competitions and developed solutions using Scikit-Learn.

Top Skills

C/C++

Python

Low-Level Programming

Multi-Core Programming

Linux

Java

Education

Master's degree, Computer Software Engineering

Sep 2021 - Sep 2023 | University of Waterloo | Waterloo, Canada

Grade: 94 / 100. Thesis on snapshotting mechanisms for persistent memory-mapped files. View Thesis

Bachelor's degree, Computer Engineering

Sep 2016 - Jul 2021 | University of Tehran | Tehran, Iran

Grade: 90 / 100. Thesis on social network analysis within the Ethereum ecosystem.

Pictures

Smarter Contract

Optimizes smart contract performance by delegating functions to cost-efficient blockchains in real-time.

Glue

Simplifies the management of dApp smart contracts across various chains.

ETHGlobal Waterloo 2023

Finalist in ETHGlobal Waterloo 2023, project: Smarter Contract.

Best Usecase of Hyperlane

Winner for Best Usecase of Hyperlane, project: Smarter Contract.

Publications

Snapshotting Mechanisms for Persistent Memory-Mapped Files

Published in ApPLIED@PODC 2024. DOI: 10.1145/3663338.3665832

Investigates ways to enhance the reliability of persistent memory systems, focusing on snapshotting mechanisms and their role in system resilience. Introduces new snapshotting consistency models and mechanisms to improve performance and enhance system responsiveness. Provides experimental analysis demonstrating throughput and latency improvements.

A Closer Look at Detectable Objects for Persistent Memory

Published in ApPLIED@PODC 2022. DOI: 10.1145/3524053.3542749

Explores the adaptation of multi-core algorithms to persistent memory, introducing the "Unified Detectable Sequential Specification" (UDSS), which simplifies interfaces and coding. Experiments conducted using Intel Optane memory demonstrate the performance implications of the implementation.

Open-Source Contributions

vLLM Ascend Kernels (`sgmv`, `bgmv`)

Co-developed NPU-specific operators to accelerate LoRA serving on Ascend hardware and integrated them into vLLM's execution path. Validated on multi-card nodes and later open-sourced by the team.

github.com/vllm-project/vllm-ascend/tree/main/csrc/kernels

3-5x speedups over BMM baselines for diverse LoRA distribution scenarios.
Latency/throughput regression tests and CI checks to prevent performance regressions.
Documented NPU-specific integration for kernel, runtime, and I/O paths.

Montage (UW) - Online Snapshotting

Research code and experiments for online snapshotting used in persistent-memory work at the University of Waterloo.

gitlab.uwaterloo.ca/mmoridi/montage-uw/-/tree/online-snapshotting

UDSS (Unified Detectable Sequential Specification)

Implementation supporting detectable objects / persistent-memory queue experiments with Prof. Golab's group.

git.uwaterloo.ca/wgolab/DSSQueue

These contributions align with my current work on C++ kernels, LLM inference, and profiling-driven optimization (SIMD, tiling, cache-friendly layouts) across kernels, runtime, and I/O.

Certificates

ExpecTAtions Teaching Assistant Workshop

Issued by University of Waterloo in Oct 2021

IEEE Data Science Winter School as Machine Learning Instructor

Issued by IEEE University of Tehran Student Branch in Jan 2020

IEEE Data Science Winter School as Statistical Inference Instructor

Issued by IEEE University of Tehran Student Branch in Jan 2020

Workshop on Teaching Assistant Training

Issued by University of Tehran in Apr 2019

Honors and Awards

2024 Q2 President's Spot Award

Issued by Huawei Technologies Canada in Oct 2024.

Was awarded the Best Team Award for outstanding contributions to LLM serving performance, by the President of the Canadian Research Center.

ETHGlobal Waterloo 2023 Finalist

Issued by ETHGlobal in Jun 2023.

Project: Smarter Contract. View Project

Winner of Best Usecase of Hyperlane

Issued by Hyperlane in Jun 2023.

Project: Smarter Contract.

First Place in the Axelar Track at OlympiHacks

Issued by Waterloo Blockchain in May 2023.

Project: Glue. View Project

International Master's Award of Excellence

Issued by University of Waterloo in Sep 2021.

View Description

Graduate Research Studentship

Issued by University of Waterloo in Sep 2021.

View Description

References

Danial Maleki

Machine Learning Engineer | Software Engineer | PhD

Mohammad and I crossed paths when we both arrived in Waterloo and from our first interaction, I knew he was a special individual. His warm and engaging personality immediately puts everyone at ease, creating an atmosphere of camaraderie and genuine connection.

One aspect that continuously leaves me in awe is Mohammad's passion and expertise in the field of research. His deep understanding of complex problems and ability to articulate insightful ideas truly set him apart. Our discussions on various research topics have been nothing short of intellectually stimulating, and I am constantly impressed by the depth of his knowledge.

In conclusion, I am confident that Mohammad's contributions will prove invaluable to any team fortunate enough to have him. If you need any additional information or have any questions about Mohammad's qualifications, please do not hesitate to reach out.

Chien-Chih (Joseph) Chen

Web 3 Researcher | PhD Candidate in Blockchain @UWaterloo

In today's fast-paced technology landscape, it's a rare privilege to work with professionals who excel not only in their core domain but also display an unparalleled passion for learning and innovation. Mohammad is one such gem.

Mohammad really knows his stuff when it comes to Persistent Memory. He doesn't just skim the surface; he gets deep into the details and always makes sure he's following the best guidelines. He has a knack for making complicated things about Persistent Memory clear and easy to understand. On top of that, his skills in C++ are amazing. Many of us, myself included, have learned a lot from him and appreciate how he makes tough topics simpler.

Outside of these areas, Mohammad's enthusiasm for blockchain technology sets him apart. Participating in multiple hackathons, he hasn't just showcased his skills, but he's clinched victories. These achievements aren't just a testament to his individual prowess, but they also speak volumes about his ability to collaborate, innovate, and lead teams towards a common goal.

In conclusion, having the pleasure to work alongside Mohammad has been one of the most enriching experiences of my professional journey. His blend of expertise, enthusiasm, and empathy makes him an invaluable asset to any team. I genuinely look forward to witnessing the continued trajectory of his success and breakthroughs in the future. Mohammad's potential knows no bounds, and his future is luminous.

Contact Me

Email

moridi@uwaterloo.ca

LinkedIn

www.linkedin.com/in/m-moridi

Location

Vancouver, British Columbia, Canada

Tutoring

Online/In-Person Tutoring

vLLM Ascend Kernels (sgmv, bgmv)

Montage (UW) - Online Snapshotting

UDSS (Unified Detectable Sequential Specification)

vLLM Ascend Kernels (`sgmv`, `bgmv`)