Features and machine learning classification of connected speech samples from patients with autopsy proven Alzheimer's disease with and without additional vascular pathology.
Rentoumi V., Raoufian L., Ahmed S., de Jager CA., Garrard P.
Mixed vascular and Alzheimer-type dementia and pure Alzheimer's disease are both associated with changes in spoken language. These changes have, however, seldom been subjected to systematic comparison. In the present study, we analyzed language samples obtained during the course of a longitudinal clinical study from patients in whom one or other pathology was verified at post mortem. The aims of the study were twofold: first, to confirm the presence of differences in language produced by members of the two groups using quantitative methods of evaluation; and secondly to ascertain the most informative sources of variation between the groups. We adopted a computational approach to evaluate digitized transcripts of connected speech along a range of language-related dimensions. We then used machine learning text classification to assign the samples to one of the two pathological groups on the basis of these features. The classifiers' accuracies were tested using simple lexical features, syntactic features, and more complex statistical and information theory characteristics. Maximum accuracy was achieved when word occurrences and frequencies alone were used. Features based on syntactic and lexical complexity yielded lower discrimination scores, but all combinations of features showed significantly better performance than a baseline condition in which every transcript was assigned randomly to one of the two classes. The classification results illustrate the word content specific differences in the spoken language of the two groups. In addition, those with mixed pathology were found to exhibit a marked reduction in lexical variation and complexity compared to their pure AD counterparts.