Publications

Manuscript

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Jinyan Su, Claire Cardie

Preprint

Reasoning RL LLM

Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces — even for simple queries — leading to increased inference costs and latency. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" — producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy.

Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and Correctness in LLMs

Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie

Preprint

Reasoning RL LLM

Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Our findings highlight generation length as a meaningful signal for reasoning behavior.

Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control

Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie

Preprint

Reasoning RAG Human-Centered LLM

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model (LLM) hallucinations by incorporating external knowledge retrieval. However, existing RAG frameworks often apply retrieval indiscriminately, leading to inefficiencies — overretrieving when unnecessary or failing to retrieve iteratively when required for complex reasoning. Recent adaptive retrieval strategies predict only based on query complexity and lack user-driven flexibility, making them infeasible for diverse user application needs. In this paper, we introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two classifiers: one trained to prioritize accuracy and another to prioritize retrieval efficiency. Via an interpretable control parameter α, users can seamlessly navigate between minimal-cost retrieval and high-accuracy retrieval based on their specific requirements. We empirically demonstrate that our approach effectively balances accuracy, retrieval cost, and user controllability, making it a practical and adaptable solution for real-world applications.

Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under Adversarial Poisoning Attacks

Jinyan Su, Jinpeng Zhou, Zhengxin Zhang, Preslav Nakov, Claire Cardie

Preprint

RAG Safety LLM

Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to mitigate LLM hallucinations and enhance their performance in knowledge-intensive domains. However, these systems are vulnerable to adversarial poisoning attacks, where malicious passages injected into the retrieval corpus can mislead models into producing factually incorrect outputs. In this paper, we present a rigorously controlled empirical study of how RAG systems behave under such attacks and how their robustness can be improved. On the generation side, we introduce a structured taxonomy of context types — adversarial, untouched, and guiding — and systematically analyze their individual and combined effects on model outputs. Our findings reveal that skeptical prompting can activate LLMs' internal reasoning, enabling partial self-defense against adversarial passages, though its effectiveness depends strongly on the model's reasoning capacity.

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov

Preprint

Human-Centered Alignment LLM

Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.

2025

Corpus Poisoning via Approximate Greedy Gradient Descent

Jinyan Su, Preslav Nakov, Claire Cardie

Findings of ACL 2025

RAG Safety LLM

Dense retrievers are widely used in information retrieval and have also been successfully extended to other knowledge intensive areas such as Retrieval-Augmented Generation (RAG) systems. Unfortunately, they have recently been shown to be vulnerable to corpus poisoning attacks. In this work, we propose Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval systems based on the widely used HotFlip method for efficiently generating adversarial passages. We demonstrate that AGGD can select a higher quality set of token-level perturbations than HotFlip by replacing its random token sampling with a more structured search. Notably, our method achieves attack success rates that are 15.24% and 17.44% higher on the NQ and MS MARCO datasets, respectively, compared to HotFlip.

MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight

Jinyan Su, Rohan Banerjee, Jiankai Sun, Wen Sun, Sarah Dean

RLC 2025

Human-Centered RL Safety Theory

The integration of AI into high-stakes decision-making domains demands safety and accountability. Traditional contextual bandit algorithms must balance exploration and exploitation, posing significant risks when applied to critical environments where exploratory actions can lead to severe consequences. We propose MixUCB, a flexible human-in-the-loop contextual bandit framework that enhances safe exploration by incorporating human expertise and oversight with machine automation. Based on the model's confidence and the associated risks, MixUCB intelligently determines when to seek human intervention. Theoretically, we analyze the regret and query complexity. Empirically, we validate the effectiveness through extensive experiments on both synthetic and real-world datasets.

2024

Learning from Streaming Data when Users Choose

Jinyan Su, Sarah Dean

ICML 2024

Human-Centered Alignment Theory

In digital markets comprised of many competing services, each user chooses between multiple service providers according to their preferences, and the chosen service makes use of the user data to incrementally improve its model. The service providers' models influence which service the user will choose at the next time step, and the user's choice, in return, influences the model update, leading to a feedback loop. In this paper, we formalize the above dynamics and develop a simple and efficient decentralized algorithm to locally minimize the overall user loss. Theoretically, we show that our algorithm asymptotically converges to stationary points of the overall loss almost surely.

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov

ACL 2024

Safety LLM

The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. In this work, we introduce a new benchmark based on a multilingual, multi-domain and multi-generator corpus — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one needs to identify which particular model generated the text; and (3) mixed human-machine text detection. We see that obtaining good performance in MGT detection usually requires access to training data from the same domain and generators.

Adapting Fake News Detection to the Era of Large Language Models

Jinyan Su, Claire Cardie, Preslav Nakov

Findings of NAACL 2024

Safety LLM

In the age of large language models (LLMs) and the widespread adoption of AI-driven content creation, robustly and effectively discerning the veracity of news articles has become an intricate challenge. A significant gap exists in understanding the interplay between machine-paraphrased real news, machine-generated fake news, human-written fake news, and human-written real news. Our experiments reveal that detectors trained exclusively on human-written articles can perform well at detecting machine-generated fake news, but not vice versa. Building on our findings, we provide a practical strategy for the development of robust fake news detectors.

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji, Preslav Nakov

EACL 2024 Resource paper award

Safety LLM

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. We introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. These results show that the problem is far from solved and that there is a lot of room for improvement.

2023

DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text

Jinyan Su, Terry Yue Zhuo, Di Wang, Preslav Nakov

Findings of EMNLP 2023

Safety LLM

With the rapid progress of Large language models (LLMs) and the huge amount of text they generate, it becomes impractical to manually distinguish whether a text is machine-generated. In this paper, we introduce two novel zero-shot methods for detecting machine-generated text by leveraging the Log-Rank information. One is called DetectLLM-LRR, which is fast and efficient, and the other is called DetectLLM-NPR, which is more accurate. Our experiments on three datasets and seven language models show that our proposed methods improve over the state of the art by 3.9 and 1.75 AUROC points absolute.

Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision

Jinyan Su, Peilin Yu, Jieyu Zhang, Stephen H Bach

IEEE BigData 2023

LLM

Prompted weak supervision (PromptedWS) applies pre-trained large language models (LLMs) as the basis for labeling functions (LFs) in a weak supervision framework to obtain large labeled datasets. We further extend the use of LLMs to address one of the key challenges in weak supervision: learning the statistical dependency structure among supervision sources. We propose a Structure Refining Module, a simple yet effective approach based on the similarities of the prompts by taking advantage of the intrinsic structure in the embedding space. We show that our Structure Refining Module improves the PromptedWS pipeline by up to 12.7 points on the benchmark tasks.

Differentially Private Stochastic Convex Optimization in (Non)-Euclidean Space Revisited

Jinyan Su, Changhong Zhao, Di Wang

UAI 2023

Theory

In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) in Euclidean and general ℓp spaces. We focus on three settings: (1) DP-SCO over a constrained and bounded set in Euclidean space; (2) unconstrained DP-SCO in ℓp space; (3) DP-SCO with heavy-tailed data. For both convex and strongly convex loss functions, we propose methods whose outputs achieve excess population risks dependent on the Gaussian width of the constraint set rather than the dimension. We also show the bound for strongly convex functions is optimal up to a logarithmic factor.

2022

Privacy Model with Public Unlabeled Data

Jinyan Su, Jinhui Xu, Di Wang

ACML 2022 Best paper award

Theory

In this paper, we study the problem of PAC learning halfspaces in the non-interactive local differential privacy model (NLDP). To breach the barrier of exponential sample complexity, previous results studied a relaxed setting where the server has access to some additional public but unlabeled data. We propose two approaches based on the Massart noise model and self-supervised learning and show that it is possible to achieve sample complexities that are only linear in the dimension and polynomial in other terms, which significantly improve the previous results.

Faster Rates of Private Stochastic Convex Optimization

Jinyan Su, Lijie Hu, Di Wang

ALT 2022

Theory

In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. We study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC). We show that under mild assumptions, there is an algorithm whose output achieves improved upper bounds. We also establish matching lower bounds, showing our results are tight.