Currently, I am visiting the University of Edinburgh working with Philipp Koehn. My home department is the Language Technologies Institute at Carnegie Mellon where I am a PhD student advised by Alon Lavie. My interests are machine translation, machine learning, distributed systems, and theoretical computer science.
I work on efficient language model intersection, particularly for machine translation. Language models are widely applied in natural language modeling and make output more fluent. Language model performance (speed, memory, and accurancy) substantially impacts overall system performance. My open-source code, dubbed KenLM, is simultaneously faster, smaller, and at least as accurate compared to other packages in common cases.
Previously, I worked on system combination for machine translation. System combination builds on top of other translation systems (i.e. Babelfish and Google Translate) to produce one improved translation. The 2011 Workshop on Machine Translation invited system combination teams at various universities to submit translations and asked human judges to rank their quality. The workshop found that human judges prefer my submission in six of eight language pairs. The code is open-source.
Before Carnegie Mellon, I worked at Google on Book Search and Picasa, at Caltech in Netlab and GALEX while earning a BSc in Mathematics and Computer Science, and in Bangalore at Infosys as a research intern. My Curriculum Vitæ is available in html and pdf.
Publications
- Paper, Poster
- Heafield, Hoang, Koehn, Kiso, and Federico. Left Language Model State for Syntactic Machine Translation. Proc. International Workshop on Spoken Language Translation, San Francisco, CA, December 8-9, 2011.
- Paper, Talk, and Code
- Heafield. KenLM: Faster and Smaller Language Model Queries. Proc. EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, July 30-31, 2011.
- Paper, Poster
- Heafield and Lavie. CMU System Combination in WMT 2011. Proc. EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, July 30-31, 2011.
- Paper
- Heafield and Lavie. Voting on N-grams for Machine Translation System Combination. Proc. Ninth Conference of the Association for Machine Translation in the Americas, Denver, Colorado, October 31—November 5, 2010.
- Paper, Poster, Boaster, and Evaluation
- Heafield and Lavie. CMU Multi-Engine Machine Translation for WMT 2010. Proc. ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, Uppsala, Sweden, July 15—16, 2010. In the evaluation, my submission (cmu-heafield-combo) received 6 wins, more than any other submission received.
- Paper, Presentation, and Code
- Heafield and Lavie. Combining Machine Translation Output with Open Source: The Carnegie Mellon Multi-Engine Machine Translation Scheme. The Prague Bulletin of Mathematical Linguistics 93, pages 27—36, 2010. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0008-4.
- Description, Presentation, and Evaluation
- Heafield. CMU-StatXfer Group System Combination. Proc. NIST Open MT Workshop 2009 at MT Summit XII, Ottawa, Canada, August 31—September 1, 2009. I also did Arabic and formal system combination; the system descriptions for these are similar.1
- Paper, Poster, and Evaluation
- Heafield, Hanneman, and Lavie. Machine Translation System Combination with Flexible Word Ordering. Proc. EACL 2009 Fourth Workshop on Statistical Machine Translation, Athens, Greece, March 30—31, 2009.
- Patent
- Curtis and Heafield, 2008. Systems and Methods for Identifying Similar Documents. US Patent 7,958,136.
- Paper and Patent Application
- Rama, Sarkar, and Heafield. Mining Business Topics in Source Code using Latent Dirichlet Allocation. Proc. 1st India Software Engineering Conference, pages 113—120, Hyderabad, India, February 19—22, 2008.2
- Poster
- Browne, Wheatley, Welsh, Seibert, Heafield, Rich, and the GALEX Science Team. RR Lyrae Stars in the Far Ultraviolet: GALEX Observations Compared with Theoretical Predictions. Bulletin of the American Astronomical Society, January, 2006.
- Journal Paper
- Welsh, Wheatley, Heafield, Seibert, et al. The GALEX Ultraviolet Variability Catalog. The Astronomical Journal 130, pages 825—831. 2005.
- Poster
- Welsh, Wheatley, Heafield, Seibert, Browne, and the GALEX Science Team. The Flaring UV Sky. Bulletin of the American Astronomical Society, January, 2005.
Reports
National Science Foundation Graduate Research Fellowship

Since August 2008, I am a National Science Foundation Graduate Research Fellow.3
- Past Research
- Application essay about my past research
- Desire
- Application essay about wanting to be a graduate student
- Plan
- A viable research plan in natural language processing
Google


From March 2007 to August 2008, I worked at Google as a Software Engineer on Picasa Web Albums and Google Book Search. To share Google's approach to distributed systems, I lectured on the Hadoop MapReduce framework as part of a 3-day class at MIT. I wrote and delivered the introduction, basic join, and entropy lectures.4 Involved employees received a Site Award and a Peer Bonus.
- Intro
- Intended to follow a lecture on MapReduce theory, this introduces basic Hadoop programming
- Diff
- A few slides to explain reducers as joining data from separate sources
- k-Means
- Run through of the Hadoop API followed by k-means clustering
- Entropy
- Introduces an entropy-based word weighting scheme and uses it to motivate performance strategies
Netlab


In 2005, I worked for Netlab at Caltech as a Richard and Dena Krown Summer Undergraduate Research Fellow. Professor Low hired me after the summer and I continued until my Infosys internship in June 2006. These reports were prepared for the fellowship.
- Paper and Presentation
- Heafield, 2005. Detecting Network Anomalies With Kernel Principal Component Analysis.
- Proposal
- Heafield and Low, 2005. Locality Preservation in Manifolds to Reduce Dimensionality. Accepted for Summer Undergraduate Research Fellowship 2005.
Galaxy Evolution Explorer
Galaxy Evolution Explorer (GALEX) is a NASA satellite observatory with science operations at Caltech. Starting in 2004 as a Summer Undergraduate Research Fellow, I found about 90 variable stars and asteroids in their 193 million measurements. They hired me to continue working with their data until I graduated in March 2007. Results are published and therefore listed under Publications, above.
- Presentation
- Heafield and Seibert, 2004. Transiting and Variable Objects: A Search Through Galaxy Evolution Explorer Observations.
Information Management Systems and Services
I worked for Caltech's IT department as a student representative and later as a security tester. They hired me as a security tester after I sent them this video:
- Exploit
- As part of a class project to make a course registration system, I found a simple hole in Caltech's production system. This shows how to use my roommate's login to read my grades. It has been patched.
- 1
- NIST serves to coordinate the NIST Open MT evaluations in order to support machine translation research and to help advance the state-of-the-art in machine translation technologies. NIST Open MT evaluations are not viewed as a competition, as such results reported by NIST are not to be construed, or represented, as endorsements of any participant's system, or as official findings on the part of NIST or the U.S. Government.
- 2
- © ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the Proceedings of the 1st India Software Engineering Conference, Hyderabad, India, February 19-22, 2008.
- 3
- This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- 4
- © Google, 2008. Except as otherwise noted, this presentation is released under the Creative Commons Attribution 2.5 license.
