Ensembles of machine learning models for detecting writing style changes at the sentence level
Abstract/ Overview
Establishing the exact number of authors collaborating in writing a document is the focus of writing styles change detection models. However, existing writing style change detection models fail to adequately detect writing style changes in documents where each author writes very short texts in form of sentences, which are randomly distributed in the document. In addition, a number of features have been used in detecting writing styles but few studies have determined their suitability for this task. For writing style change detection models to remain relevant, there is need for models that can detect writing styles changes at the sentence level. The aim of this study was to develop ensembles of machine learning models for detecting writing style changes at the sentence level. The specific objectives were; to design ensembles of machine learning models for detecting writing style changes in documents, to implement ensembles of machine learning models for detecting writing style changes, to determine optimal feature sets for detecting writing style changes, and to evaluate the effectiveness of the ensemble models on detecting writing style changes at the sentence level. The study variables were the ensembles of machine learning models, while the dependent variable was the detection of writing style changes at the sentence level. Other variables looked at were the feature sets, model evaluation at the sentence level and performance of the model on detecting writing style changes at the sentence level. Mixed research design was used in this study, where exploratory design was used to identify stylometric features for use in the study. Features whose importance scores were greater than zero were considered optimal and were used to carry out experiments. Under experimental design, four experiments were performed: first to select the optimal document features and second to select the optimal sentence level features using feature importance scores. The third experiment was designed to classify documents as either single authored or multi-authored. The last experiment was used to detect the number of writing style changes in documents classified as multi-authored. The Pan at Clef 2019 style change date set was used to train, validate and test the models. The corpus consisted of 5088 documents out of which 50% was used for training, 25% for validation and 25% for testing. Half of the documents were single authored while the other half were multi-authored. Results show that 19 features were optimal at the document level while twenty two features were optimal sentence level. The models were able to classify single authored documents and multi-authored documents with an accuracy of 0.91 and an F1score of 0.90. Similarly, the study achieved an Ordinal Classification Index of 0.731 in detecting the number of writing style changes in multi-authored documents outperforming state-of-the-art models which achieved 0.808. The better performance is attributed to the use of optimal feature sets, ensembles learning models and sentence level representation. The main contribution of this study is ensembles of machine learning models able to detect writing style changes at the sentence level. In addition, the study identified two sets of features; the optimal document and sentence level feature sets which can be used for writing style change detection with improved performance.