https://doi.org/10.1140/epjds/s13688-025-00533-1
Research
Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach
1
Shandong University, Wenhuaxi Road, 264209, Weihai, Shandong, China
2
The National Police University for Criminal Justice, Qiyizhong Road, 071000, Baoding, Hebei, China
Received:
29
September
2024
Accepted:
14
February
2025
Published online:
19
February
2025
Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing availability of judicial data, such as legal documents, machine learning approaches hold great potential in this domain. However, a significant challenge is the lack of comprehensive datasets to train these models, and the distribution of crime motive categories in publicly available legal texts often exhibits a long-tailed imbalance. This imbalance can lead to model bias, where the model tends to predict more common criminal motives. To address these challenges, we collected 11,589 legal documents from China Judgements Online (2019–2024) to create a crime motive text dataset. To mitigate the long-tailed issue, we propose a Category-Aware Balanced Contrastive Learning (CA-BCL) method, which effectively enhances the model’s representation of long-tailed data. Specifically, CA-BCL first balances the sampling process to alleviate the class imbalance during prototype construction and then applies balanced contrastive learning to improve the model’s ability to generalize to long-tailed categories, leading to better overall classification performance. Our experimental results demonstrate that CA-BCL significantly outperforms existing text classification models in crime motive classification, while also showing strong generalization capabilities on standard text classification benchmark.
Key words: Criminal motive / Text classification / Long-tailed distribution / Contrastive learning
© The Author(s) 2025
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.