Authorship Attribution of Morsi Gameel Aziz’s Lyrics A Clustering-Based Stylometry Approach

Abdulfattah Omar

Abstract


Numerous studies have addressed the issue of the authorship of Morsi Gameel Aziz’s lyrics. These studies have traditionally been based on chronological criteria for determining the real authors of disputed lyrics. To date, there is no agreement on the real authors of these disputed lyrics. This can mainly be attributed to both selectivity and the lack of empirical evidence in such studies, raising questions about the reliability of such approaches. With the advent of machine learning systems and data mining techniques, it is now possible to process thousands of texts using replicable methods. Thus, this study seeks to address the issue of the authorship of Morsi Gameel Aziz’s lyrics making use of these advances by applying a clustering-based stylometry approach. The hypothesis is that lyrics grouped or clustered together are more likely to be written by the same poet. A corpus of 1,089 lyrics was built, including all known lyrics attributed to Aziz and the lyrics of the poets thought to be the real authors of the disputed lyrics. The lyrics were clustered using the Gibbs sampling Dirichlet multinomial mixture (GSDMM) technique, and were assigned to 4 main classes, with the 12 disputed lyrics clustered within Aziz’s class. Based on this, it is clear that the GSDMM model is effective and reliable in clustering short documents in Arabic. The results of the study show that machine learning systems and stylometric authorship techniques can be used in resolving many authorship questions that remain controversial and unanswered in Arabic literature.


Keywords


authorship attribution; clustering; Gibbs sampling Dirichlet multinomial mixture; letter pairs; lyrics; Morsi Gameel Aziz; stylometry

Full Text:

PDF

References


Abu Rabiah, E. (2020). Lexical measures for testing progress in Hebrew as Arab students’ L2. Journal of Language and Linguistic Studies, 16(3), 1096-1114.

Aggarwal, C. C., & Reddy, C. K. (2018). Data Clustering: Algorithms and Applications: CRC Press.

Al-Falahi, A., Ramdani, M., & Bellafkih, M. (2017). Machine learning for authorship attribution in Arabic poetry. International Journal of Future Computer and Communication (IJFCC), 1(6), 42-46.

Al-Falahi, A., Ramdani, M., & Bellafkih, M. (2019). Arabic Poetry Authorship Attribution using Machine Learning Techniques. Journal of Computer Science, 15(7), 1012.1021.

Amensisa, A. D., Patil, S., & Agrawal, P. (2018). A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques. Paper presented at the 2018 2nd International Conference on Inventive Systems and Control (ICISC).

Bagavandas, M., & Manimannan, G. (2008). Style Consistency and Authorship Attribution: A Statistical Investigation. Journal of Quantitative Linguistics, 15(1), 100-110. doi:10.1080/09296170701803426

Balossi, G. (2014). A Corpus Linguistic Approach to Literary Language and Characterization: Virginia Woolf's The Waves: John Benjamins Publishing Company.

Burges, C. J. C. (2010). Dimension Reduction: A Guided Tour: Now Publishers.

Burrows, J. (2002). ‘Delta’—A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.

Burrows, J. (2007). All the Way Through: Testing for Authorship in Different Frequency Strata Literary and Linguistic Computing, 22(1), 27–47. doi:doi.org/10.1093/llc/fqi067

Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-y-Gómez, M., & Rosso, P. (2006). Authorship Attribution Using Word Sequences. In C. O. J. A. Martínez-Trinidad J.F., Kittler J. (Ed.), Progress in Pattern Recognition, Image Analysis and Applications (Vol. 4225). Berlin, Heidelberg: Springer.

Crabb, P., Antonia, A., & Craig, H. (2014). Who wrote ‘A Visit to the Western Goldfields’? Using Computers to Analyse Language in Historical Research. History Australia, 11(3), 177-193. doi:10.1080/14490854.2014.11668539

Craig, H. (1999). Authorial attribution and computational stylistics: If you can tell authors apart, have you learned anything about them? Literary and Linguistic Computing, 14, 103–113.

Craig, H., & Greatley-Hirsch, B. (2017). Style, Computers, and Early Modern Drama: Beyond Authorship. Cambridge Cambridge University Press.

Craig, H., & Kinney, A. F. (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press.

Daelemans, W. (2013). Explanation in Computational Stylometry. In G. A. (Ed.), Computational Linguistics and Intelligent Text Processing (pp. 451-462). Berlin, Heidelberg: Springer.

Dauber Jr, E. G. (2020). Stylometric Authorship Attribution Techniques and Analysis for Collaborative Platforms. Drexel University.

Denis, D. J. (2020). Univariate, Bivariate, and Multivariate Statistics Using R: Quantitative Tools for Data Analysis and Data Science: Wiley.

Eder, M., Piasecki, M., & Walkowiak, T. (2017). An open stylometric system based on multilevel text analysis. Cognitive Studies [Études cognitives], 17. doi:doi.org/10.11649/cs.1430

El Bakly, A. H., Darwish, N. R., & Hefny, H. A. A Survey on Authorship Attribution Issues of Arabic Text.

Everitt, B. (2009). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences: CRC Press.

Everitt, B., & Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R: Springer New York.

Gan, G., Ma, C., & Wu, J. (2020). Data Clustering: Theory, Algorithms, and Applications, Second Edition: SIAM.

Gómez-Adorno, H., Posadas-Duran, J.-P., Ríos-Toledo, G., Sidorov, G., & Sierra, G. (2018). Stylometry-based Approach for Detecting Writing Style Changes in Literary Texts. Computación y Sistemas, 22(1), 47-53. doi:doi.org/10.13053/cys-22-1-2882

Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (2015). Handbook of Cluster Analysis: CRC Press.

Holmes, D. (1995a). Authorship Attribution. Computers and the Humanities, 28, 87-106.

Holmes, D. (1995b). The Federalist revisited: new directions in autorship attribution. Literary and Linguistic Computing, 10, 111–127.

Holmes, D. (1998). The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13(3), 111-117. doi:doi:10.1093/llc/13.3.111

Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18(4), 341–360. doi:doi.org/10.1093/llc/18.4.34

Hoover, D. L. (2004). Testing Burrows’ delta. Literary and Linguistic Computing, 19(4), 453–475.

Hoover, D. L., Culpeper, J., & O'Halloran, K. (2014). Digital Literary Studies: Corpus Approaches to Poetry, Prose, and Drama. London; New York: Routledge.

Hussein, T. (1927). On the Pre-Islamic Poetry (2nd ed.). Cairo, Egypt: Hindawi Foundation for Education and Culture.

Iqbal, F., Debbabi, M., & Fung, B. C. M. (2020). Machine Learning for Authorship Attribution and Cyber Forensics: Springer International Publishing.

Ison, D. C. (2020). Detection of Online Contract Cheating through Stylometry: A Pilot Study. Online Learning, 24(2), 142-165.

Jackson, J. E. (2005). A User's Guide to Principal Components: Wiley.

Jockers, M. L. (2014). Text Analysis with R for Students of Literature: Springer International Publishing.

Jolliffe, I. T. (2006). Principal Component Analysis: Springer New York.

Juola, P. (2008). Authorship Attribution: Published, sold, and distributed by now Publishers.

Lagutina, K., Boychuk, E., Vorontsova, I., & Paramonov, l. (2019). A Survey on Stylometric Text Features. Paper presented at the 25th Conference of Open Innovations Association (FRUCT), Helsinki, Finland.

Linwei Li, Guo, L., He, Z., Jing, Y., & Wang, S. (2019). X-DMM: Fast and Scalable Model Based Text Clustering. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 4197-4204.

Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.

Mathews, R. A., & Merriam, T. V. (1993). Neural Computation in Stylometry I: An Applicationto the Works of Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), 203-209.

Mazarura, J., & Waal, A. d. (2016, 30 Nov.-2 Dec. 2016). A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. Paper presented at the 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech).

Moisl, H. (2008). Data Normalization for Variation in Document Length in Exploratory Multivariate Analysis of Text Corpora. INFOS 2008: The 6th International Conference on Informatics and Systems Special Track On Natural Language Processing, 85-92.

Moisl, H. (2015). Cluster Analysis for Corpus Linguistics: De Gruyter.

Moisl, H., & Maguire, W. (2008). Identifying the Main Determinants of Phonetic Variation in the Newcastle Electronic Corpus of Tyneside English. Journal of Quantitative Linguistics, 15(1), 46-69. doi:10.1080/09296170701794302

Omar, A. (2021). Identifying Themes in Fiction: A Centroid-Based Lexical Clustering Approach. Journal of Language and Linguistic Studies, 17(Special Issue 1), 580-594.

Omar, A., & Aldawsari, B. D. (2019). Towards a Linguistic Stylometric Model for the Authorship Detection in Cybercrime Investigations. International Journal of English Linguistics, 9(5), 182-192.

Omar, A., Elghayesh, B. I., & Kassem, M. (2019). Authorship Attribution Revisited: The Problem of Flash Fiction A morphological-based Linguistic Stylometry Approach. Arab World English Journal (AWEJ), 10(3), 318-329.

Omar, A., & Hamouda, W. I. (2020). The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic. International Journal of Advanced Computer Science and Applications(IJACSA), 11(1), 116-121. doi:10.14569/IJACSA.2020.0110114

Pask, K. (2002). Plagiarism and the Originality of National Literature: Gerard Langbaine. ELH, 69(3), 727-747.

Pyle, D. (1999). Data Preparation for Data Mining: Elsevier Science.

Roelleke, T. (2013). Information Retrieval Models: Foundations and Relationships: Morgan & Claypool Publishers.

Rudman, J. (1997). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31(4), 351-365.

Rudman, J. (2012). The State of Non-Traditional Authorship Attribution Studies—2012: Some Problems and Solutions. English Studies, 93(3), 259-274. doi:10.1080/0013838X.2012.668785

Savoy, J. (2012). Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages. Journal of Quantitative Linguistics, 19(2), 132-161. doi:10.1080/09296174.2012.659003

Schreibman, S., Siemens, R., & Unsworth, J. (2016). A New Companion to Digital Humanities (2nd ed.): Wiley-Blackwell.

Smith, M. W. A. (1992). Shakespeare, Stylometry and "Sir Thomas More"'. Studies in Philology, 89(4), 434-444.

Srivastava, A. N., & Sahami, M. (2009). Text Mining: Classification, Clustering, and Applications: Taylor & Francis.

Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by Authorship Within and Across Documents. Paper presented at the CLEF 2016, Évora, Portugal.

Strome, E. (2013). “Raked from the Rubbishâ€: Stylometric Authorship Attribution and the 1795 American Philosophical Society Education Contest. In J. B. (Ed.), The Founding Fathers, Education, and “The Great Contest†(pp. 45-65). New York: Palgrave Macmillan.

Timm, N. H. (2007). Applied Multivariate Analysis: Springer New York.

Varela, P. J., Albonico, M., Justino, E. J. R., & de Assis, J. L. V. (2020). Authorship Attribution in Latin Languages using Stylometry. IEEE Latin America Transactions, 18(04), 729-735.

Watt, J., Borhani, R., & Katsaggelos, A. K. (2020). Machine Learning Refined: Foundations, Algorithms, and Applications: Cambridge University Press.

Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2010). Text Mining: Predictive Methods for Analyzing Unstructured Information: Springer New York.

Wu, W., Xiong, H., & Shekhar, S. (2013). Clustering and Information Retrieval: Springer US.

Yang, S., Haung, G., & Cai, B. (2019). Discovering Topic Representative Terms for Short Text Clustering. IEEE Access, 7, 92037-92047. doi:10.1109/ACCESS.2019.2927345

Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-based Approach forShort Text Clustering. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 233–242. doi:doi.org/10.1145/2623330.2623715

Zhao, S., Li, S., Qi, L., & Da Xu, L. (2020). Computational intelligence enabled cybersecurity for the internet of things. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(5), 666-674.

Zheng, L., & Zheng, H. (2020). Authorship Attribution via Coupon-Collector-Type Indices. Journal of Quantitative Linguistics, 27(4), 321-333. doi:10.1080/09296174.2019.1577939

Zhiguo, G., Luo, X., Chen, J., Wang, F. L., & Lei, J. (2011). Emerging Research in Web Information Systems and Mining: International Conference, WISM 2011, Taiyuan, China, September 23-25, 2011. Proceedings: Springer Berlin Heidelberg.

Zhu, H., Lei, L., & Craig, H. (2020). Prose, Verse and Authorship in Dream of the Red Chamber: A Stylometric Analysis. Journal of Quantitative Linguistics, 1-17. doi:10.1080/09296174.2020.1724677


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Journal of Language and Linguistic Studies
ISSN 1305-578X (Online)
Copyright © 2005-2022 by Journal of Language and Linguistic Studies