FORMING AN AUTHORSHIP PROFILE THROUGH N-GRAM TRACING
Keywords:
Authorship Analysis, Forensic Linguistics, Linguistic Fingerprinting, Multilingual Text Analysis, and N-Gram TracingAbstract
In forensic linguistics, authorship profiles have become an important tool for identifying and characterizing individuals based on their writing style and authorship attribution. This study employed N-gram tracing, a corpus linguistics mixed computational method that investigates recurring patterns of sequences of words or characters in text. By analyzing WhatsApp (WA) messages extracted from case evidence, this study examines how N-gram patterns can reveal specific linguistic features associated with author identity. The dataset consists of personal texts and microblogs containing approximately 2.1 tokens. To ensure data integrity, the text was cleaned of non-traditional elements such as hyperlinks and media files during a preprocessing phase. N-grams on both character and word level, including N1-N5, were extracted and examined for diversity, frequency, distribution, and contextual usage patterns. To discern stylistic consistency across texts attributed to a particular individual, a machine learning model was used to calculate the similarity index and evaluate these linguistic fingerprints. Initial results suggest that certain N-gram patterns, such as orthographic selection and lexical choice, are highly indicative of individual influence. Profiling is also enhanced with linguistic markers such as abbreviations, code-switching, and unique styles present in informal communication. This study demonstrates that N-gram tracing is not only effective in identifying authorship but can also provide information on demographic and psychological features such as age, gender, and communication preferences. The fields of forensic linguistics and computational authorship analysis benefit from this study as it provides a robust and scalable technique for profiling authors based on real-world data. Furthermore, it highlights the ramifications in the legal context, emphasizing the potential for N-gram search to aid investigations where digital communication is critical. Extending the analysis to multilingual data and integrating semantic-level profiling to improve accuracy are future steps.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Devi Ambarwati Puspitasari

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.