Omid Mashayekhi

I am a graduate student at Stanford University, currently, a Ph.D. candidate in Electrical Engineering, and a Ph.D. minor candidate in Computer Science. My interests include cloud computing, distributed systems, and networking systems. I work with Professor Philip Levis as a member of Stanford Information Networks Group (SING). I received M.S. degree from Stanford University in 2013, and B.S. degree from Sharif University in 2011, both in Electrical Engineering.


Main Research Projects


Nimbus is a general purpose cloud computing system specially designed for computations with short tasks. Nimbus is developed in C++ and the API offers a data model similar to Spark and Naiad. The key difference between Nimbus and its counterparts is the novel control plane abstraction called Execution Templates. For jobs with short tasks the runtime overhead becomes comparable to the task itself. For example for an application with 4ms tasks, Sparks runtime overhead is about 98%. Execution templates enable Nimbus to schedule and drive jobs with tasks as short as 100us over large number of workers at scale with negligible runtime overhead. For more information visit Nimbus home page.


In this project, we investigate novel techniques to design MAC protocol for the Full Duplex radio. This project follows SING’s earlier paper on single channel full duplex wireless network (please see here), which explores feasible implementation of the physical layer for efficient simultaneous transmission and reception. We consider the very fundamental characteristics of a full duplex network and try to design a MAC protocol which perfectly takes the advantages of the freedom provided by the physical layer. Specially, the goal is to improve the throughput as much as two times of the ordinary half duplex systems while remaining fair to all the users. You can find our paper here.

Older Research Projects

Overloaded Codes

As my undergraduate research thesis, I worked on developing overloaded codes with efficient decoders for CDMA systems. These codes not only have really simple decoder but also show the bit-error-rate (BER) performance almost as good as the maximum-likelihood (ML) decoder. The novel coding matrices are called logical signature matrices (LSM), which results in efficient decoding on the receiver side with only few comparisons. We have patented the results, and you can find a copy of it here.

Selected Short Projects

x86 Runtime Prediction

Predicting x86 program runtime for Intel processor. As a part of the machine learning course at Stanford (CS229), we developed a series of supervised learning methods for predicting the runtime length of serialized x86 programs for Intel processors. Our generative model predicts runtime for programs with arbitrary numbers of instructions with 12% error.

Packet Classification

Novel and optimized methods for packet classification when both rules and keys contain wildcard expressions. This work introduces a software based algorithm which is memory efficient, does not need expansion for rules and keys containing wildcards, is scalable, and provides multi-match results.

DCell with OpenFlow

Simulating DCell topology and routing for data centers using Mininet and OpenFlow controller. Each year, advanced networking course at Stanford (CS244) publishes a blog on reproducing the results reported in the literature by students. Yo can find our report here .

Cognitive Radio Devices

This work is a literature survey on cognitive radio devices as a part of wireless networking course at Stanford (EE359).



Fast, Distributed Computations in the Cloud
Omid Mashayekhi
Philip Levis (primary advisor), Mendel Rosenblum (advisor), Alex Aiken (advisor)
Stanford University Doctoral Dissertation, 2017
Today, cloud computing frameworks adopt one of two strategies to schedule their computations over hundreds to thousands of machines. In the centralized strategy, a controller dispatches computation units, called tasks, to worker nodes. Centralization allows a framework to quickly reschedule, respond to faults, and mitigate stragglers. However, a centralized controller can only schedule a few thousand tasks per second and becomes a bottleneck. In the distributed strategy, there is no central node. These systems install data flow graphs on each node, which then independently execute and exchange data. By distributing the control plane and turning it into data flow, these frameworks can execute hundreds of thousands of tasks per second, and do not have a control plane bottleneck. However, data flow graphs describe a static schedule; even small changes, such as migrating a task between two nodes, requires stopping all nodes, recompiling the flow graph and reinstalling it on every node. This leads to high latency or wasteful resource utilization for rescheduling, fault recovery, or straggler mitigation. This dissertation presents a new, third strategy, called execution templates. Execution templates leverage a program’s repetitive control flow to cache control plane decisions in templates. A template is a parametrisable block of tasks: it caches some information (e.g., task dependencies) but instantiation requires some parameters (e.g., task identifiers). Executing the cached tasks in a template requires sending a single message that loads the new parameters. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Execution templates are not bound to a static control flow and efficiently capture nested loops and data dependent branches. Evaluation of execution templates in Nimbus, a cloud computing framework, shows that they provide the fine-grained scheduling flexibility of centralized control planes while matching the performance of the distributed ones. Execution templates in Nimbus support not only the traditional data analytics, but also complex, scientific applications such as hybrid graphical simulations.


Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly. We present execution templates, a control plane abstraction that can schedule hundreds of thousands of tasks per second while supporting fine-grained, per-task scheduling decisions. Execution templates leverage a program’s repetitive control flow to cache blocks of frequently-executed tasks. Executing a task in a template requires sending a single message. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Evaluations of execution templates in Nimbus, a data analytics framework, find that they provide the fine-grained scheduling flexibility of centralized control planes while matching the strong scaling of distributed ones. Execution templates support complex, real-world applications, such as a fluid simulation with a triply nested loop and data dependent branches.
Janus: A Novel MAC Protocol for Full Duplex Radio
Jae Young Kim, Omid Mashayekhi, Hang Qu, Maria Kazandjieva, and Philip Levis
Stanford CSTR 2013-02, 2013
This paper presents Janus, a novel MAC protocol for full–duplex wireless networks. Unlike other full-duplex MACs, Janus allows partially interfering nodes to cooperate by finding the appropriate transmission rates based on interference levels, making better use of the channel. Computing the optimal schedule and transmission rates is NPComplete, so Janus uses a cheaper heuristic approach. Janus also ensures that channel access time is shared fairly between all nodes. Janus has lower per-packet overhead compared to CSMA/CA because it eliminates random back-off and lets nodes transmit multiple packets with a single set of control packets. We show that for a setup with one access point and three nodes, Janus achieves 2.5X the throughput of half–duplex system based on CSMA/CA.
Uniquely Decodable Codes with Fast Decoder for Overloaded Synchronous CDMA Systems
Omid Mashayekhi, Farokh Marvasti
IEEE Transactions on Communication, vol. 60, no. 11, pp. 3145-3149, November 2012
In this paper, we introduce a new class of signature matrices for overloaded synchronous CDMA systems that have a very low complexity decoder. While overloaded systems are more efficient from the bandwidth point of view, the Maximum Likelihood (ML) implementation for decoding is impractical even for moderate dimensions. Simulation results show that the performance of the proposed decoder is very close to that of the ML decoder. Indeed, the proposed decoding scheme needs neither multiplication nor addition and requires only a few comparisons . Furthermore, the computational complexity and the probability of error vs. Signal to Noise Ratios (SNR) are derived analytically.


Scalable, Fast Cloud Computing with Execution Templates
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
Poster presentation at 2016 USENIX Symposium on Operating Systems Design and Implementation (OSDI'16)

Technical Reports

Distributed Graphical Simulation in the Cloud
Omid Mashayekhi, Chinmayee Shah, Hang Qu, Andrew Lim, Philip Levis
arXiv:1606.01966 [cs.DC], 2016
Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphical simulations in the cloud that decouples control logic, computations and data exchanges. This allows a central controller to balance load by redistributing computations, and recover from failures. Evaluations show that the architecture can run existing, state-of-the-art simulations in the presence of stragglers and failures, thereby enabling this large class of applications to use the computing cloud for the first time.
Scalable, Fast Cloud Computing with Execution Templates
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
arXiv:1606.01972 [cs.DC], 2016
Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (4-5x) speedups because their control planes cannot schedule work fast enough. This paper presents execution templates, a control plane abstraction for CPU-bound cloud applications, such as machine learning. Execution templates leverage highly repetitive control flow to cache scheduling decisions as templates. Rather than reschedule hundreds of thousands of tasks on every loop execution, nodes instantiate these templates. A controller's template specifies the execution across all worker nodes, which it partitions into per-worker templates. To ensure that templates execute correctly, controllers dynamically patch templates to match program control flow. We have implemented execution templates in Nimbus, a C++ cloud computing framework. Running in Nimbus, analytics benchmarks can run 16-43 times faster than in Naiad and Spark. Nimbus's control plane can scale out to run these faster benchmarks on up to 100 nodes (800 cores).
Canary: A Scheduling Architecture for High Performance Cloud Computing
Hang Qu, Omid Mashayekhi, David Terei, Philip Levis
Stanford CSTR 2016-01, 2016
We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally. We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9−90× speedup over Spark and up to 4× speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores.


You can find a copy of my resume here.