FINDING THE SMOKE SIGNAL: Smoking Status Classification with a Weakly Supervised Paradigm in Sparsely Labelled Dutch Free Text in Electronic Medical Records
Smoking status is a clinical variable in (primary) healthcare, defining whether or not a patient smokes or has ever smoked cigarettes or cigars. However, it is currently under-reported by GP (General Practitioner) offices in the Netherlands. GPs experience a heavy documentation load, and often opt for describing complaints in the free text of the consultation (the 'SOEP' text) rather than formally documenting it in a variable - while a documented variable is more retrievable for clinical professionals needing this information later. This thesis attempts to use Natural Language Processing (NLP) and Machine Learning (ML) to automatically classify smoking statuses recorded in the free text of consultation reports in Dutch GPs. We found a specific problem: smoking status is under-documented and sparsely labelled in EMRs (Electronic Medical Records), while modern NLP approaches require large labelled datasets. We use a weak supervision as well as a Transfer learning approach to combat this "small dataset problem". We attempt to answer the following question: "How can we best automatically detect and classify the smoking status of primary care patients' EMR (Electronic Medical Record) on the basis of the free text in GP doctor’s notes?" We worked with medical data storage company Topicus to obtain 17.873 EMRs from 6 GP offices in the Netherlands, of which only a sub-set is labelled for smoking status (4.978 training examples, 651 development examples and 628 test examples) into three classes: non-smoker, ex-smoker, and smoker. Our results indicate Transfer learning is a potentially fruitful approach to smoking status classification. We found a fine-tuned pre-trained Transformer model BERTje model performs well (F1 (micro) = .79), and out-performs our rule-based baseline (F1 = .55). Our results however do not match earlier work's results, where rule-based methods already obtain high performance scores (F1 = .91) on similar smoking status tasks in English. We cannot replicate these high-performing rule-based methods, but our Transfer learning approach with BERTje is relatively effective at correctly detecting especially the non-smoker and ex-smoker class in EMRs. Increasing the training set size in a weak supervision approach with a generative labelling model does not increase performance of BERTje (F1 (micro) = .79), though does lead to a better classification of ex-smoker and non-smoker examples. Thus, we find a Transfer learning approach with BERTje a potentially interesting approach for smoking status classification in Dutch EMRs even with small datasets. These now popular pre-trained models could be a step for research into smoking status classification away from rule-based methods.
Faculteit der Letteren