Barret Zoph

I am currently a co-founder and CTO at Thinking Machines.

Previously I was a VP of Research (Post-Training) at OpenAI. My team worked on alignment, tool use, evaluations, ChatGPT, search, multi-modality and trained the models shipped into ChatGPT and the API.

Before OpenAI I was a Staff Research Scientist at Google Brain working on training large sparse language models and applying them to various applications. I was also a research TL for MuM and worked extensively on AutoML through research and product, where my work on Neural Architecture Search (NAS) was an impetus.

Prior to Google Brain, I worked in at the Information Sciences Institute with Kevin Knight and Daniel Marcu on statistical machine translation.

(Last update: February 2023)

Email / Google Scholar / Twitter / LinkedIn

Talks and Media

Yannic Kilcher's Podcast on Sparsity for Large Language Models
Towards Data Science Podcast on Sparsity for Large Language Models
The Data Exchange Podcast on Efficiently Scaling Large Language Models
ICCV 2019 Neural Architects Workshop Talk
UC Berkeley Lecture on Deep Reinforcement Learning
Koç University Lecture on Transformer Sparsity
Featured in NYT article on AutoML
Technology Review Article on Google's AutoML/Neural Architecture Search

Research

Representative papers are highlighted. The * denotes equal author contribution.

Recent

ST-MoE: Designing Scalable and Transferable Sparse Expert Models
Barret Zoph*, Irwan Bello*, Sameer Kumar, Nan Du, Yanping Huang, Noam Shazeer, William Fedus*.
[ArXiv 2022] [Yannic Kilcher's Tutorial]

Sparse Mixture of Experts models suffer from training instabilities and finetuning issues at scale. We design improved methods for modeling, pretraining and finetuning sparse models and successfully finetune the largest sparse encoder-decoder model ever trained. State-of-the-art results on many NLP benchmarks like SuperGLUE and ARC Easy / ARC Challenge.

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui.
[ArXiv 2021]

Successfully train a 1.2 trillion parameter mixture-of-expert language model. It requires 1/3 the energy to train compared to GPT-3, uses half the computational FLOPs for inference and still achieves better zero/one-shot performance on 29 NLP tasks.

Multi-Task Self-Training for Learning General Representations
Golnaz Ghiasi*, Barret Zoph*, Ekin D Cubuk*, Quoc V Le, Tsung-Yi Lin.
[ICCV 2021] (Spotlight Presentation)

Introduce multi-task self-training (MuST), which harnesses the knowledge in independent teacher models into a general student model. Scalable multi-task algorithm that works well with unlabeled and partially labeled data.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus*, Barret Zoph*, Noam Shazeer.
[ArXiv 2021] [Yannic Kilcher's Tutorial]

Achieved 4-7x pre-training speedups over T5 models and successfully trained the first trillion parameter language model through model sparsity. Achieved state-of-the-art results on NLP benchmarks like ANLI, Natural Questions, WebQuestions and TriviaQA.

Simple Training Strategies and Model Scaling for Object Detection
Xianzhi Du*, Barret Zoph*, Wei-Chih Hung, Tsung-Yi Lin.
[ArXiv 2021]

Expand the ResNet-RS work in image classification to object detection and instance segmentation. Finds that significant amount of improvements in object detection are due to training and scaling.

Revisiting ResNets: Improved Training and Scaling Strategies
Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph.
[Neurips 2021] [Github] [Google Cloud] [Blog posts 1, 2, 3] (Spotlight Presentation)

Disentangle the impact of architectures vs training and scaling. Reveals that improvements in image classification have been primarily driven by improved training and scaling. Identifies general scaling strategies that improve vision models across training setups.

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
Golnaz Ghiasi*, Yin Cui*, Aravind Srinivas*, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V. Le, Barret Zoph.
[CVPR 2021]

Systematically study and scale up a simple copy-paste data augmentation method for instance segmentation.

Rethinking Pre-training and Self-training
Barret Zoph*, Golnaz Ghiasi*, Tsung-Yi Lin*, Yin Cui, Hanxiao Liu, Ekin D Cubuk, Quoc V. Le.
[NeurIPS 2021] (Oral Presentation)

Improved semi-supervised learning using pseudo labeling for object detection and semantic segmentation. Scaled up results to achieve state-of-the-art on Pascal segmentation.

2019 & 2020

Randaugment: Practical automated data augmentation with a reduced search space
Ekin D Cubuk*, Barret Zoph*, Jonathon Shlens, Quoc V. Le.
[NeurIPS 2020] [Henry AI Labs Video]

Significantly simplify and speedup learned data augmentation procedures in computer vision and achieves state-of-the-art ImageNet accuracy.

Learning data augmentation strategies for object detection
Barret Zoph*, Ekin D Cubuk*, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, Quoc V Le.
[ECCV 2020]

Expands previous learned data augmenation methods to object detection. Achieves state-of-the-art on COCO object detection and gets strong improvements across a range of models and datasets.

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan.
[ICLR 2019]

Proposes AugMix, which is a data augmentation technique that improves model robustness and uncertainty metrics on several image classification benchmarks.

Specaugment: A simple data augmentation method for automatic speech recognition
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, Quoc V. Le.
[Interspeed 2019]

Introduce SpecAugment, which is a simple data augmentation method for speech recognition. It operates directly on the feature inputs of the neural network and masks blocks of frequency channels and timesteps.

Attention Augmented Convolutional Networks
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le.
[ICCV 2019]

Trained the first fully attentional image classifier and showed that self-attention is a competitive replacement to convolutions for image classification. Hybrid architectures which combine self-attention and convolution yields sizable improvements on image classification and object detection.

AutoAugment: Learning Augmentation Policies from Data
Ekin D. Cubuk*, Barret Zoph*, Dandelion Mane, Vijay Vasudevan, Quoc V. Le.
[CVPR 2019] [CVPR Talk] [Google Blog Post] [Henry AI Labs Tutorial] (Oral Presentation)

Introduce AutoAugment, which automatically searches for good data augmentation methods to improve generalization for a given problem and dataset. AutoAugment achieved state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, and ImageNet.

2017 & 2018

Efficient Neural Architecture Search via Parameter Sharing
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean.
[ICML 2018]

Faster neural architecture search by training and evaluating subgraphs in a larger network.

Progressive Neural Architecture Search
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy.
[ECCV 2018]

Develop a new neural network architecture search algorithm using sequential model-based optimization (SMBO). Gets up to 5x speedups over reinforcement learning approaches.

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathan Shlens, Quoc V. Le.
[CVPR 2018] (Spotlight Presentation)

Learn modular neural network architures through reinforcement learning on CIFAR-10, which also transfer well to ImageNet to obtain state-of-the-art accuracy.

Backprop evolution
Maximilian Alber*, Irwan Bello*, Barret Zoph, Pieter-Jan Kindermans, Prajit Ramachandran, Quoc Le.
[ArXiv 2018]

Starting from random or known propagation rules, evolution searches for backpropagation variants that maximize generalization performance.

Searching for activation functions
Prajit Ramachandran, Barret Zoph, Quoc V. Le.
[ArXiv 2017]

Do a large scale exploration of different activation functions and discover interesting trends and new variants that perform well across many computer vision benchmarks.

Intriguing Properties of Adversarial Examples
Ekin D. Cubuk, Barret Zoph, Samuel S. Schoenholz, Quoc V. Le.
[ArXiv 2017]

Argue that adversarial examples arise from inherent uncertainty that neural networks have about their predictions. Furthermore, we study if we can find good neural network architectures that are more adversarially robust than others.

Neural Optimizer Search with Reinforcement Learning
Irwan Bello*, Barret Zoph*, Vijay Vasudevan, Quoc V. Le.
[ICML 2017]

Automated discovery of optimization methods by generating update rules with an RL-trained controller. Discovered two new optimizers and learning rate schedules which experimentally lead to faster convergence in image classification and machine translation.

Neural Architecture Search with Reinforcement Learning
Barret Zoph, Quoc V. Le.
[ICLR 2017] [Google Blog Post] [Wikipedia] (Oral Presentation)

Use reinforcement learning to automatically design good neural network architectures for computer vision and language modeling.

2015 & 2016

Transfer Learning for Low-Resource Neural Machine Translation
Barret Zoph, Deniz Yuret, Jonathan May, Kevin Knight.
[EMNLP 2016]

Develope a transfer learning algorithm to improve low resource machine translation.

Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies
Barret Zoph*, Ashish Vaswani*, Jonathan May, Kevin Knight.
[NAACL 2016]

Design an algorithm to train large vocabulary language models efficiently on GPUs using Noise Contrastive Estimation (NCE).

Multi-Source Neural Translation
Barret Zoph, Kevin Knight.
[NAACL 2016] (Oral Presentation)

Built a new neural architecture than can take in n translations in order to produce a better translation in the n+1 language.

How Much Information Does a Human Translator Add to the Original?
Barret Zoph, Marjan Ghazvininejad, Kevin Knight.
[EMNLP 2015] [Microsoft Research Talk] [EMNLP Talk] (Oral Presentation)

How much information does a human translator add to an original text? We provide a bound using compression and develop new text compression algorithms and a benchmark.

(website template credits)