Supervision
I will be supervising Part II, Part III, and M.Phil. projects for students currently at Cambridge. If you are interested in working with me, please feel free to email me with your CV and transcript (email: XYZ, where X=yc632, Y=@ and Z=cam.ac.uk). Before sending the email, please take a look at my Google Scholar to better understand my research area.
Research Interests
Broadly speaking, I am interested in Natural Language Processing (NLP). Specific research of my interests includes: Automatic Fact-Checking, Text Summarisation (particularly controllable text summarisation and dialogue summarisation), and Large Language Models (in particular, evaluating hallucination in LLMs and improving their factuality).
Note that I will NOT supervise projects outside the scope of NLP/AI, but I am open to interdisciplinary projects such as AI for Science (AI4Science) and NLP for Education (NLP4Education).
Potential Topics
There are some potential topics for Part II projects.
Watermarking for LLMs: Since the proposal of Large language models (LLMs), humans have abused them for generating texts in such as coursework and academic papers, which can further lead to problems such as copyright and misconduct. Watermarking is a technology that can insert unique, imperceptible patterns or codes into text or images to indicate the source or verify the authenticity of the content. In the context of LLMs, watermarking can serve as a tool to improve the detection of LLM-generated text, helping to address issues like academic dishonesty, copyright infringement, and the unethical use of AI-generated content.
In this project, students will be asked to implement different watermark methods and detection methods as described in this paper. They will apply the methods to generative models such as BART, GPT2 or T5. Also, they will conduct a human evaluation to judge whether watermarked text can be detected by humans. As an extension, they can explore how robust the watermark is, e.g., corrupt the watermarked text and see whether it can still be detected by the model.
There is already one student who will be doing this topic. Of course, it is allowed that multiple students work on the same topic. But new students are encouraged to find some other topics if possible.
For Part III and M.Phil. projects, they are more open research questions. You can either advance the above project or propose your own idea. Please drop me an email for further discussion.
Students/Mentees
I am fortunate to work/have worked with a number of amazing students, with many of whom have progressed to D.Phil. programmes at world-leading institutes.
Past Students
Shiduo Qian (Co-supervised with Prof. Yue Zhang at Westlake University [Jul 2020 - Jan 2021], working on dialogue summarisation and financial data analysis. Undergraduate and postgraduate at Imperial College London [2017 - 2021]. Now Modeling/Forecasting Senior Analyst at TD Bank. Incoming M.Sc. student at Georgia Institute of Technology)
Liang Chen (Co-supervised with Prof. Yue Zhang at Westlake University [Jun 2020 - Jan 2021], working on dialogue summarisation. Undergraduate at Jinlin University [2018 - 2022]. Now D.Phil. student at Peking University)
Yijie Zhou (Co-supervised with Prof. Yue Zhang at Westlake University [Sep 2022 - Jan 2023], working on cross-lingual summarisation. Undergraduate at Zhejiang University [2020 - 2024]. Incoming D.Phil. student at University of Cambridge)
Yinghao Yang (Co-supervised with Prof. Yue Zhang at Westlake University [Jul 2023 - Dec 2023], working on evaluating LLMs. Undergraduate at Westlake University [2023 - present])
Pingchuan (Maestro) Yan (Co-supervised with Prof. Yue Zhang at Westlake University [Jul 2023 - Dec 2023], working on meta-evaluation of LLMs for MT. Undergraduate at University College London [2023 - present])