I am Tommy Chien (Hongjin Qian), a postdoctoral researcher specializing in Natural Language Processing (NLP), jointly affiliated with Peking University and the Beijing Academy of Artificial Intelligence (BAAI). I earned my PhD in 2024 from the Gaoling School of Artificial Intelligence (GSAI) at Renmin University of China, under the supervision of Prof. Zhicheng Dou and Prof. Ji-Rong Wen. I hold a Master’s degree from the University of Sydney (2019) and a Bachelor’s degree from Nankai University (2017). My experience includes research internships at Huawei and WeChat Group, as well as contributions to AI startups.
Oct 2024 - Present, Beijing, China
Nov 2023 - Present, Beijing, China
Beijing Academy of Artificial Intelligence (BAAI) is a non-profit research institute dedicated to promoting collaboration among academia and industries, as well as fostering top talents and a focus on long-term research on the fundamentals of AI technology.
Jun 2023 - Oct 2023, Beijing, China
Apr 2022 - May 2023, Beijing, China
Sept 2020 - Jun 2024, Beijing, China
The Gaoling School of Artificial Intelligence (GSAI) at Renmin University of China (RUC) is a prestigious institution dedicated to shaping the future of AI. GSAI has consistently ranked first in Information Retrieval worldwide, according to CSRankings, from 2022 to 2024.
Elensdata is a start-up company which offers high-calibre data science/AI solutions that help real businesses, in media, finance, etc.
23 Papers in Total
15 Conference Papers
10 First-Author Papers
20 Patents in Total
18 Granted Patents
6 First-Inventor Patents
Reviewer:
Neurips, ICLR, ACL, EMNLP,
EACL, ACL ARR, SIGKDD, theWebConf, TOIS
This paper explores two fine-tuning phenomena in Large Language Models (LLMs): the superior performance of optimizing only the Q and K matrices over full parameter optimization, and the benefit of distinct learning rates for faster convergence. Through theoretical and empirical analysis, the authors propose a new, efficient fine-tuning strategy that enhances generalization, memory efficiency, and optimization speed, validated on benchmark datasets.
This survey examines recent advancements in conversational search, a next-generation paradigm that uses natural language dialogue and LLMs to enable intuitive, multi-turn information retrieval, highlighting critical modules, challenges, and future directions for enhancing user experience and system intelligence in search engines.
This survey introduces a unified framework to evaluate the trustworthiness of Retrieval-Augmented Generation (RAG) systems across six key dimensions—factuality, robustness, fairness, transparency, accountability, and privacy—offering a structured benchmark and comprehensive evaluations to guide future research and enhance RAG reliability in real-world applications.
RAG-Studio is an efficient self-aligned training framework that adapts general RAG models to specialized domains solely through synthetic data, producing a domain-specific RAG system that outperforms the use of human-annotated data for fine-tuning.
MemoRAG is a novel retrieval-augmented generation framework that incorporates long-term memory to handle tasks with ambiguous information needs, which standard RAG systems struggle with. By using a dual-system architecture to form global memory and generate draft answers for guiding retrieval, MemoRAG outperforms conventional RAG in both complex and straightforward tasks.
This work challenges the necessity of long-LLMs for long-context tasks by introducing LC-Boost, a framework that enables short-LLMs to effectively handle long-context tasks by adaptively accessing and utilizing relevant context, achieving improved performance with fewer resources.
We successfully extended the context length of Llama-3-8B-Instruct from 8K to 80K using QLoRA fine-tuning, achieving superior performance across various tasks while preserving short-context capabilities, with the entire process efficiently completed in 8 hours on an 8xA800 GPU machine, driven by only 3.5K synthetic samples from GPT-4, highlighting the underexplored potential of LLMs for context extension.
This paper introduces a Chunking-Free In-Context (CFIC) retrieval method for Retrieval-Augmented Generation (RAG) systems, improving evidence retrieval accuracy and efficiency by eliminating the need for document chunking and utilizing advanced decoding strategies.
DKGen is a novel framework that improves factual accuracy in text generation by iteratively generating short text segments, dynamically selecting relevant references to avoid knowledge mix-up, and leveraging cross-attention distribution for better use of external knowledge, outperforming baseline models in experiments.
This research explores techniques for improving conditional question answering by learning from structured documents.
EdiRCS is a highly efficient conversational query rewriting model that enhances search performance by selecting most rewrite tokens from the dialogue and generating minimal new tokens, supplemented by search-oriented objectives, outperforming state-of-the-art models in benchmarks with low latency and robustness to varied dialogues.
LLM4CS is a prompting framework that leverages large language models to enhance conversational search by generating multiple query rewrites and hypothetical responses, integrating them into a robust representation of users’ contextual search intent, and significantly outperforming existing methods on key benchmarks.
This paper introduces WebBrain, a new NLP task focused on generating short, factually-correct articles with references by mining supporting evidence from the web, and presents ReGen, a framework that enhances factual accuracy through improved evidence retrieval and task-specific pre-training, outperforming state-of-the-art methods on newly constructed large-scale datasets.
LeCoRE is a sparse lexical-based conversational retriever that enhances conversational search by generating denoised and interpretable session representations through knowledge distillation and external query rewrites, significantly improving performance on public datasets compared to existing methods.
TopReC is a topic-enhanced personalized retrieval-based chatbot that deconstructs long and noisy dialogue histories into topic-dependent segments, filtering out irrelevant data to learn a more accurate and consistent user personality, significantly outperforming previous state-of-the-art methods on large datasets.
CRDR is a unified framework for conversational search that combines query rewriting and context modeling, enhancing accuracy and efficiency by making minimal modifications to the original query and improving contextualized query embeddings through explicit term highlighting, outperforming baseline models in experiments on TREC CAsT-19 and CAsT-20 datasets.
ConvTrans is a data augmentation method that automatically transforms web search sessions into conversational search sessions, addressing the data scarcity problem in conversational dense retrieval by eliminating gaps in session quality and query form, enabling models trained on ConvTrans-generated data to achieve comparable performance to those trained on expensive, manually-created datasets.
The MSP model refines and utilizes extensive dialogue history to enhance personalized response generation by extracting key information and leveraging data from similar users, outperforming existing methods in generating more informative and personalized responses.
This paper introduces a pre-training approach that leverages the hierarchical structure of HTML web pages and their DOM trees to enhance language models for information retrieval, demonstrating significant improvements in ranking performance over traditional pre-trained models by incorporating structured web data.
COTED is a novel framework for few-shot conversational dense retrieval that enhances context denoising through curriculum contrastive learning, progressively training the model to filter out noisy conversational turns, and demonstrating superior performance on CAsT-19 and CAsT-20 datasets compared to state-of-the-art baselines.
IMPChat is a retrieval-based personalized chatbot model that learns an implicit user profile by separately modeling the user’s personalized language style and preferences, dynamically weighting context-relevant history, and fusing these signals for response ranking, outperforming baseline models in experiments on large datasets.
Pchatbot is a large-scale Chinese dialogue dataset, significantly larger than existing datasets, that has been meticulously normalized and includes anonymized user IDs and timestamps, enabling the development of personalized dialogue models that learn implicit user personality from dialogue history, with preliminary benchmarks provided for future comparisons.
The Initiative-Imitate model addresses the challenge of overly proactive dialogue agents by balancing the chatbot’s role between speaker and listener, enhancing conversational fluency and engagement through adaptive initiative, and showing competitive results in both automatic and manual evaluations.
This patent proposes a semantic parsing method that combines rules and learning-based approaches.
MemoRAG is a next-generation retrieval-augmented generation system with long-term memory, enabling superior context-aware information retrieval and enhanced performance on complex tasks where traditional RAG systems struggle.