I am currently a Lecturer (Assistant Professor) of the University of Glasgow, based within the world-leading Information Retrieval Group and IDA section of School of Computing Science. I am also an Affiliated Lecturer at the Language Technology Lab (LTL) of University of Cambridge. Prior to that, I was a Postdoctoral Researcher (Research Associate) at the Language Technology Lab (LTL) of the University of Cambridge; and a Postdoctoral Researcher (Research Assistant) at the Information Retrieval group of the University of Glasgow.
I am current leading a small research team consisting of PhDs, masters and undergrad students working on Natural Language Processing (particularly ChatGPT), Knowledge Extraction, Representation & Reasoning Learning (particularly some BioMedical applications). Students who are interested in working with me, or want to do your PhD with me, please read this post for some information.
PhD in Computer Science, 2018
Sun Yat-sen University
2023-10-9: Two papers were accepted by EMNLP 2023 on Unsupervised Biomedical NER and Multimodal Generative Language Model.
2023-09-26: Our survey paper on Knowledge Graph Embedding was accepted by ACM Computing Surveys.
2023-08-14: Our survey paper on Multimodal Language Modelling was accepted by ACM Transactions on Multimedia Computing, Communications, and Applications
2023-08-05: One paper was accepted by CIKM 2023 on Knowledge-enhance Passage Ranking.
2023-06-26: Invited to serve as one of the Area Chairs for the track “Interpretability, Interactivity and Analysis of Models for NLP” of EMNLP 2023.
2023-06-09: I gave an invited talk on “Probing and Infusing Biomedical Knowledge for Pre-trained Language Models” in “NLP for Social Good (NSG) Symposium 2023” hosted by Dr Procheta Sen at the University of Liverpool.
2023-05-22: One paper was accepted by Matching ACL 2023 on Generative Event Extraction.
2023-05-02: One paper was accepted by ACL 2023 on Few-shot NER.
2023-03-02: One paper was accepted by Transactions on the Web (TWEB) on Conversational Recommendation Systems.
2023-02-10: I gave a guest lecture on “Words Sense and WordNet” for the “LI18 - Computational Linguistics” course offered by Professor Nigel Collier at the University of Cambridge.
2023-01-24: I am an Area Chair for the “Interpretability and Analysis of Models for NLP” track of ACL 2023.
2022-10-25: I will be attending EMNLP 2022 (Abu Dhabi, UAE 🇦🇪) in person.
2022-10-06: One paper was accepted by EMNLP 2022 on Parameter-Efficient Tuning.
2022-10-04: Our paper entitled Graph Neural Pre-training for Recommendation with Side Information was accepted at ACM TOIS.
2022-09-19: Our paper entitled Enhancing Conversational Recommendation Systems with Representation Fusion was accepted at ACM TWEB.
2022-09-02: Given a short talk University of Glasgow Computational Biology Conference on topic of Biomedical knowledge probing and infusing with pretrained language models.
2022-07-12: One paper entitled Dynamic Co-embedding Model for Temporal Attributed Networks was accepted at IEEE TNNLS.
Acquiring factual knowledge with Pretrained Language Models (PLMs) has attracted increasing attention, showing promising performance in many knowledge-intensive tasks. Their good performance has led the community to believe that the models do possess a modicum of reasoning competence rather than merely memorising the knowledge. In this paper, we conduct a comprehensive evaluation of the learnable deductive (also known as explicit) reasoning capability of PLMs. Through a series of controlled experiments, we posit two main findings. (i) PLMs inadequately generalise learned logic rules and perform inconsistently against simple adversarial surface form edits. (ii) While the deductive reasoning fine-tuning of PLMs does improve their performance on reasoning over unseen knowledge facts, it results in catastrophically forgetting the previously learnt knowledge. Our main results suggest that PLMs cannot yet perform reliable deductive reasoning, demonstrating the importance of controlled examinations and probing of PLMs’ reasoning abilities; we reach beyond (misleading) task performance, revealing that PLMs are still far from human-level reasoning capabilities, even for simple deductive tasks.
BioCaster was launched in 2008 to provide an ontology-based text mining system for early disease detection from open news sources. Following a 6-year break, we have re-launched the system in 2021. Our goal is to systematically upgrade the methodology using state-of-the-art neural network language models, whilst retaining the original benefits that the system provided in terms of logical reasoning and automated early detection of infectious disease outbreaks. Here, we present recent extensions such as neural machine translation in 10 languages, neural classification of disease outbreak reports and a new cloud-based visualization dashboard. Furthermore, we discuss our vision for further improvements, including combining risk assessment with event semantics and assessing the risk of outbreaks with multi-granularity. We hope that these efforts will benefit the global public health community.
Leveraging the side information associated with entities (i.e. users and items) to enhance the performance of recommendation systems has been widely recognized as an important modelling dimension. While many existing approaches focus on the integration scheme to incorporate entity side information – by combining the recommendation loss function with an extra side information-aware loss – in this paper, we propose instead a novel pre-training scheme for leveraging the side information. In particular, we first pre-train a representation model using the side information of the entities, and then fine-tune it using an existing general representation-based recommendation model. Specifically, we propose two pre-training models, named GCN-P and COM-P, by considering the entities and their relations constructed from side information as two different types of graphs respectively, to pre-train entity embeddings. For the GCN-P model, two single-relational graphs are constructed from all the users’ and items’ side information respectively, to pre-train entity representations by using the Graph Convolutional Networks. For the COM-P model, two multi-relational graphs are constructed to pre-train the entity representations by using the Composition-based Graph Convolutional Networks. An extensive evaluation of our pre-training models fine-tuned under four general representation-based recommender models, i.e. MF, NCF, NGCF and LightGCN, shows that effectively pre-training embeddings with both the user’s and item’s side information can significantly improve these original models in terms of both effectiveness and stability.
Parameter-efficient tuning (PETuning) methods have been deemed by many as the new paradigm for using pretrained language models (PLMs). By tuning just a fraction amount of parameters comparing to full model finetuning, PETuning methods claim to have achieved performance on par with or even better than finetuning. In this work, we take a step back and re-examine these PETuning methods by conducting the first comprehensive investigation into the training and evaluation of PETuning methods. We found the problematic validation and testing practice in current studies, when accompanied by the instability nature of PETuning methods, has led to unreliable conclusions. When being compared under a truly fair evaluation protocol, PETuning cannot yield consistently competitive performance while finetuning remains to be the best-performing method in medium- and high-resource settings. We delve deeper into the cause of the instability and observed that model size does not explain the phenomenon but training iteration positively correlates with the stability.
Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as biomedical domain are vastly under-explored. To catalyse the research in this direction, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, which is constructed based on the Unified Medical Language System (UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to 28%, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain.
Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.
Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.