Zhen Cao

Zhen Cao 

PhD Candidate
File systems and Storage Lab (FSL)
Advisor: Erez Zadok

Email: zhccao@cs.stonybrook.edu

Research Interests: File Systems, Storage Systems, Operating Systems

About Me

I am a Ph.D. candidate (4th year) at Computer Science Department of Stony Brook University. I joined FSL in May, 2014 and my advisor is Prof. Erez Zadok. My research interests lie primarily in the area of Operating Systems, File and Storage Systems.

Education

Industry Experience

  • Software Engineering Intern, Google (May 2018 - August 2018)
  • Software Engineering Intern, Google (May 2017 - August 2017)
  • Research Summer Intern, IBM Research, Almaden (May 2016 - August 2016)
  • Research Summer Intern, IBM Research, Almaden (June 2015 - August 2015)

Research Projects

  • Evolutionary Storage Systems (May 2014 - Present)
  • Computers are becoming ubiquitous nowadays and they consume more than 10% of world’s energy use, which makes optimizing the performance and energy efficiency of computer systems urgent and promising. Among all the system components, storage systems that persistently store and retrieve data are of especial importance. In many production systems the amount of time that applications spend waiting for I/O operations is the main contributor to the total execution time. Our previous work has shown that just tuning a small subset of parameters can have a large impact on the performance and energy efficiency. However, manually tuning storage systems is impractical due to the large search space and difficulty of evaluation; moreover, many metrics are sensitive to the hardware and the workloads running on it. In this project we propose to develop an auto-tuning algorithm for optimizing storage systems. By exploring various (intelligent) optimizing techniques and running experiments, we found several that have potentials in optimizing storage systems effectively and efficiently, including Meta-Heuristics, Supervised Machine Learning, Reinforcement Learning, etc. We argue that all these promising optimization algorithms actually work by balancing the trade-off among exploration, exploitation and history information. Our long-term goal is to develop an auto-tuning algorithm, based on any helpful existing techniques, that is general enough to auto-tune any storage system for various optimization goals. Works of this project in current stage include:
    • Benchmark tens of thousands of file system configurations under multiple workloads.
    • Theoretical work on understanding essence of optimization algorithms.
    • Automatic parameter selection of complex software systems.
    • Workload characterization and recognition.
    • Understanding instable behaviors of many file system configurations under certain workloads.

Publication

Presentation

Poster

  • On the Performance Variation in Modern Storage Stacks
    Zhen Cao, Vasily Tarasov, Hari Prasath Raman, Dean Hildebrand and Erez Zadok
    15th USENIX Conference on File and Storage Technologies (FAST 17), February 2017, Santa Clara, CA.
  • Parametric Optimization of Storage Systems
    Erez Zadok, Aashray Arora, Zhen Cao, Akhilesh Chaganti, Arvind Chaudhary, and Sonam Mandal
    7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15), July 2015, Santa Clara, CA.
  • Evolutionary Storage Systems
    Akhilesh Chaganti, Aashray Arora, Zhen Cao, Arvind Chaudhary, Sonam Mandal, and Erez Zadok
    13th USENIX Conference on File and Storage Technologies (FAST 15).

Academic Projects

  • Mining Temporal Pattern in Online Social Media (September 2014 - December 2014)
  • Online media has become really dynamic because of many reasons,and much work have been done to find temporal patterns of these online content. Though researchers have found "idiom"-based and political contents follow different patterns, more still remain to be done. In this project, we mainly used time series clustering techniques to analyze data collected from Weibo. We are aiming at finding different patterns that exist in users' hashtag and retweeting behaviors. We showed that conventional time series clustering techniques are generally not suitable for such pattern mining task. We used a shapelet-based clustering method and we proposed a new metric to evaluate the shapelet candidates. We also provide several versions of optimizations over the naïve algorithm. We found multiple patterns that have never been seen before and what's more, we provided interpretations for nearly all the patterns that we found. Language: Python, Matlab, Java.
  • 64-bit JOS (September 2014 - November 2014)
  • Implemented the core part of a real operating system including a bootloader, memory protection, memory relocation, multiprogramming, a rudimentary file system, and a command shell. Language: C, Assembly.
  • Authorship Attribution Using Graph-Mining Techniques (February 2014 - May 2014)
  • Many work have been done with authorship attribution. However, most previous research for computational stylometric analysis has relied on shallow lexico-syntactic patterns. Recent work proposed that PCFG models can detect distributional difference in syntactic styles. In this project, I directly use the parsing trees to do classification, i.e. finding frequent subgraph patterns from the parsing trees without even consider the real linguistic meanings behind these subgraph patterns. Experiments have shown that this approach can achieve about 80% classification accuracy. In the end I present some deep insights into those subgraph features. Language: Python, Java.
  • x86_64 Simulator (February 2014 - May 2014)
  • Implemented a functional processor core for a subset of the x86_64 ISA, including an in-order 7-stage pipeline and 2-way set-associative caches. Language: SystemVerilog, C++.

Past Projects

  • Characterizing High-frequency Subscriber Sessions in Cellular Data Networks
  • The tremendous growth in the cellular data network usage brings operators with unprecedented signaling overloads and threatens the stability of the network. Understanding the characteristics of the subscribers who are resource inefficient has an important significance of capacity planning and optimal allocation of resources. In this work, we perform large-scale investigation of session characteristics of a particular category of mobile subscribers, called high-frequency subscribers who access network frequently based on anonymized traces of an city-wide operational 3G network. We found that high-frequency subscribers can be extremely inefficient in signaling resource usage. We also propose a novel approach for discovering their session patterns and study the apps that generate such session patterns from semantic level. We observed that the frequency of subscribers’ session activation shows positive correlation with the periodicity of session intervals. We also found periodic session activations have a certain correlation with abnormal behaviors. We demonstrate that our findings have significant implications on network optimization.
  • Privacy in Social-Driven Location Sharing
  • Sharing location in SNSs has witnessed rapid development in recent years. Privacy is undoubtedly a barrier to the adoption of such location sharing services. In this work, we investigate what factors affect users' privacy concerns and how privacy concerns may in turn affect users' location sharing behaviors. We conducted one quantitative survey and 8 interviews on Chinese popular SNSs. We identified 1) the factors that users care more about their privacy when posting location information on social networks, 2) what types of privacy information do they think might be exposed through location sharing, and 3) users' behavior patterns in cope with privacy concerns. We also discuss the theoretical and practical implication on the improvement of location sharing services.
  • GenRe: A General Replication Scheme over an Abstraction of DHTs
  • In P2P (peer-to-peer) systems, file replication technology is widely used to reduce hot spots and achieve high availability. In this work, we propose a General Replication (GenRe) scheme for structured P2P systems. GenRe is built on the abstraction of DHTs and consequently can be applied to any implementations of DHTs. GenRe chooses replica nodes by performing several bitwise exclusive operations and all replica nodes can directly reply to queries, achieving load-balancing in both data placing and data queries. For any given data object, all the replicas are updated through a virtual binary tree, achieving a high efficiency and scalability. Simulation results demonstrate the efficiency and effectiveness of GenRe.

Courses

Teaching Assistant