Solving the current challenges around conducting research with administrative data
Work with patients and the public
Looking forward and reflecting back
Practical questions around using DataLoch data
Looking forward and reflecting back
Practical questions around using DataLoch data
At a most basic level, creatures that live in groups, from insects to humans, share information between themselves such as the presence of danger and the location of food. This is because it is a good method of protecting the group and helping it to flourish. Whilst living in caves, our gestures, grunts and groans gradually became more sophisticated allowing us to share more detailed information that evolved into language. However, even today ninety percent of our communication (and therefore information) is still non-verbal. You can tell things about a person just by such things as their facial expressions, how they sit or move their body, their tone and volume of voice, the level of eye contact. We all sub-consciously and consciously do this to enhance the communication of our thoughts and feelings. It helps us to form relationships and friendships. Surely, acquiring information is the reason we send our children to school and why we study. We exchange information about our thoughts and our feelings when we socialise.
But when it comes to personal medical information this is, of course, a little bit different – or is it? Whilst many of us like to share some of this information, there may have some aspects that we feel we want to keep to ourselves. Of course, it is our right to keep that information to ourselves if we wish or tell a trusted person in confidence.
Medical data is just bits of information held electronically. But information, when held as data, can be easily shared with others for both benefit and, potentially, disadvantage. However, if that data is anonymised (in other words all information is removed that might identify us) and it is added to information from thousands of other people, might we hold a more relaxed view? And if that data was only accessible by trusted people, authorised to access that information only for a very specific and approved purpose should we have any substantial concern?
As current or future patients, we benefit from improved treatments and services because previous patients shared their medical information. Do we not, in turn, have a moral obligation to share our information to benefit our children, grand-children and future generations of humanity?
I believe that, providing the current legally required data security controls in place and those that hold the data are open and transparent (about who data is accessed by and why), there is no logical reason why we should not share our anonymised medical data – for the benefit of us all.
I’ve yet to do much of work with our main variables of interest, as we only recently were granted access to a few of the data sets we requested. However, while we were working on obtaining and waiting for access we followed some side avenues in part to prepare ourselves for working with the data, and in part because we thought of research questions that we thought were interesting in their own right. For example, we are interested in how early life socioeconomic conditions, commonly represented by the father’s occupational social class, relate to mental health later on in life. However, our data set is based on the participants of the Scottish mental survey 1947; these individuals were all born in 1936, and because of World War II, reports of fathers’ occupations from censuses carried out during participants’ early lives are unreliable, not representative, and often missing. In order to improve on our data set, we dug deeper into the data we were aiming to link, pulling out additional, historical occupation information, and coding these data ourselves. This in turn lead to a machine learning approach to classifying historical social class data, which can be used in the future by people working with historical social class data. So it goes to show how much interesting, useful work you can wind up doing along the way!
The process is long and convoluted, and at seemingly every turn. I was fortunate because I joined the project relatively late, although when I joined we thought we would have access to the data in a few months’ time, rather than two years later. I did what I could to help with the application processes, but ultimately this work predominantly falls on the shoulders of a single person, and most of one’s time in this area is not spent working on forms, but waiting for other people to get back to you.
A large amount of time and effort goes into processing and preparing data before linkage, but that does not mean that the data are clean and easy to work with once you get a hold of them. You are likely going to need to spend significant time cleaning and otherwise processing your data before you can analyse them.
There are advantages to having to layout analyses in advance during the application process: essentially, this forces you to pre-register your work, which is an important step in doing reproducible science. However, a run-of-the-mill pre-registration has considerable flexibility, and this is not so much the case with the analyses we plan for our data. All output must be checked for privacy and security concerns, so if we want to tweak an analysis or run a sensitivity analysis, for instance at the request of a reviewer, every different analysis that we want to take out of the safe haven environment needs to be checked, and that process can take weeks.
You ought to think very carefully about timing, in particular you ought to expect significant delays. If possible, try to plan for multiple scenarios, and make sure you have meaningful work you can do while you wait out the access process. The processes for accessing data are supposedly being streamlined and improving, but it is worth investing in your relationships with the people along the data access pipeline, as they are best served to help you manage your expectations.
It can be a difficult and frustrating area to work in, but there are big potential payoffs, including large sample sizes and long-term follow-up, sometimes across many decades. These are types of data that sometimes cannot be obtained in any other way, and this allows for novel, meaningful research questions to be asked and answered.
If you want to jump to a specific question you can click on the questions below. Otherwise, just keep scrolling!
Patient data, for example hospital records and GP records, is collected as part of routine National Health Service (NHS) care. They constitute one of the largest sources of health data in existence. Over the years, researchers, policy makers and others have sought to harness their potential in carrying out evidence-based research, seeking to enhance our understanding of disease, improve patient care and service delivery. At no other time has using patient data for research been more in the spotlight than under the current COVID-19 pandemic.
As researchers, we have a duty to ensure that we recognise the individuals who sit behind that data. But even more than that, we should seek to involve patients in our research, because really, who understands what they have experienced better than them?
As an Early Career Researcher (ECR) working with patient data, and coming from a non-clinical background, appreciating the individuals behind the ‘numbers’ is not something that my training in Econometrics prepared me for. Of course, my primary motivation for pursuing my career in health research is to make a difference to individuals lives. Nonetheless, it is all too easy to become buried in the methods, producing fancy charts and output displaying significance stars, that the people behind the numbers become blurred in the background.
The use of patient data by social scientists and ECRs – who are often limited in resources, contacts and time- is becoming more common. With this comes an increased need to ensure that those researchers know how to recognise and include the patient voice in their research, and how to be transparent about their uses of patient data.
In what follows are some questions and answers from UPD’s Communications Officer Grace Annan-Callcott, who kindly agreed to talk to eCRUSADers about using patient data in research and in particular about public/patient engagement.
How much and what sort of public/patient engagement work does UPD do?
Grace pointed me to a couple of recent things they have been working on. Firstly, the Fair Partnerships Report, one of UPD’s “largest pieces of engagement work to date, which looked into what the public thinks about different kinds of businesses and organisations using NHS data”.
So, how do patients feel about the use of their data?
The Fair Partnerships work was a mixed methods public engagement programme consisting of round table discussions, citizen’s juries and an online survey (completed by just over 2,000 adults from across the UK). A key finding of the report was that “all NHS data partnerships must aim to improve health and care”. I believe this point will resonate with many ECRs, who often have difficulty in demonstrating “How will your research benefit the public?” Will our PhDs or first post-doc research projects actually translate into patient/public benefit? We can get ourselves all worked up when writing applications to use patient data, trying to demonstrate and perhaps exaggerate the public/patient benefit of our research. Could making false promises undermine trust further?
Whilst we are entirely motivated by the hope that our early career research will translate into public/patient benefit, it is likely that it will not, at least not to begin with. But as ECRs working with administrative health records, we discover things that we did not set out to, we develop skills in analysing complex data sets, we generate new research questions, all of which could have patient/public benefit in the future. That being said, the responsibility lies with us to be both realistic and transparent about the aims of our research and the potential public/patient benefit that it could have. After we have carried out our research, we must be transparent and document what we have learned and how that learning will go on to contribute towards patient/public benefit at a later stage. We need today’s ECRs to be trained in analysing patient data, otherwise tomorrow’s patient/public benefit might not emerge.
Have you done any public/patient engagement with Scottish patients?
It is great to hear that UPD are hoping to do work with Scottish patients. I am not aware of any groups in Scotland who are carrying out similar work with public/patients across the board (do get in touch if you are!). For now, can we assume the views from the Fair Partnerships participants would also hold for the Scottish population? As Research Data Scotland (RDS) looms on the horizon, it appears Scotland has much further to go in terms of gathering views from the public on how their data is used.
Should all researchers working with administrative health data do public/patient involvement?
In an ideal world, we would carry out public/patient involvement in our PhDs and post-docs. However, ECRs may have limited contacts, resources and time, meaning it might not be feasible to do so. In particular, if you are working with a large national dataset, would it be realistic to capture representative views of the country on how you plan to use their data?
Well maybe not, but there are other things we can do. For one, Grace pointed out that “use MY data have created a data citation to help researchers acknowledge the contribution patients make to research”. This citation is a means to show gratitude to patients for allowing researchers access to their data, as well as enhancing the visibility of that use.
Another thing that crossed my mind was getting someone you know, with no knowledge about the research you are doing, to read your research proposal. Can they see the public/patient benefit in what you are proposing to do?
The outbreak of COVID-19 has clearly pushed the use of patient data into the headlines and accelerated the use of patient data in research (see the OpenSAFELY project in England). I asked Grace if UPD feel this presents an opportunity to demonstrate how we can safely and successfully use patient data in research or a challenge to maintain public trust in the use of their data?
Are there any other UPD resources that you would recommend to eCRUSADers working with Scottish administrative health data?
Thanks very much for taking the time to answer these questions Grace. There’s clearly some great work going on at UPD and there is definitely a lot that researchers who are working with patient data can learn from that work. It would be great to see more public and patient engagement work on using patient data in Scotland- if anyone reading is familiar with any then do get in touch!
Look out for our next People Make Data post where we will be hearing from useMYdata.
My role involves a variety of tasks – however, primarily my role is the statistical reporting of trials run from within ECTU. I typically have up to eight active trials throughout the year. My role varies on these – I am Trial Statistician for approximately half of them, and the ‘reporting’ statistician for the other half. When I have my reporting statistician hat on, I’m responsible for the statistical programming and generating the analysis and results.
Since I joined ECTU in 2014, I have worked on three trials using administrative data. Two of them used solely routine healthcare data and the third one is running currently, based on a blend of routine data plus data captured within the trial.
The use of administrative data in the trials setting is definitely becoming more common since clinical trials are known to be expensive and time-consuming. The use of administrative healthcare data is viewed as a more efficient means of understanding the health of the population using readily available data. However, there is a trade-off in terms of the quality of the data being captured.
High- Sensitivity Troponin in the Evaluation of patients with suspected Acute Coronary Syndrome (High-STEACS) was a step wedge, cluster- randomised control trial. In plain English this means…
It’s a relatively recent study design that’s increasingly being used to evaluate service delivery type interventions. The design involves crossover of clusters (usually hospitals or other healthcare settings) from control (standard care) to an alternative intervention until all the clusters are exposed to the intervention. This differs to traditional parallel studies where only half of the clusters will receive the intervention and the other half will receive the control. This diagram helps to demonstrate the difference in designs:
The population of interest were patients presenting in hospital with heart attack symptoms. The trial sought to test a new high-sensitivity cardiac troponin assay against the standard care contemporary assay. Specifically, to test if the new assay could detect heart attacks earlier and with a more accurate diagnosis.
Step wedge trials usually randomise at a cluster (hospital) level, rather than randomising patients individually, so this was the main difference to a standard trial. So patients were enrolled rather than randomised into the trial. Standard trials require patient consent before randomisation, but in this context, individual patient consent was not needed due to the randomisation being performed at hospital level. Appropriate approvals for consent were sought through the hospitals.
If patients presenting with heart attack symptoms at any of the hospitals were eligible for the trial (based on our pre-specified inclusion/exclusion criteria), then we had permission (at hospital level) to include them in the study and use their securely anonymised data.
Approximately 48,000 patients were enrolled from 10 hospital sites in NHS Lothian (3 sites) and NHS Greater Glasgow and Clyde (7 sites), over a period of just under three years.
We used a total of 12 distinct data sources which were a combination of general administrative datasets and datasets more specific to our area of research from locally held electronic health care records. Prescribing data was obtained from the Prescribing Information System, also ECG data, plus general patient demographics. Trial-specific outcome data was obtained from the Scottish Morbidity Record (SMR01) and also from the register of deaths (National Records of Scotland).
All data were captured separately for each Health Board – there is currently no amalgamated data source which holds all data. Health Boards are the owners of their own data.
The main linking mechanism for these 12 data sources was the patient CHI (Community Health Index) number. To ensure patient anonymity, CHI numbers were securely encrypted prior to use.
Approvals were required at a number of levels. We required ethics approval, approval to use patient data without consent and Health and Social Care approval (through the Privacy Approvals Committee, predecessor to the Public Benefit Privacy Panel). There were also health board specific approvals required for local data to be released. In addition, we required data supplier approval. Finally, approval was needed for the data to be hosted on the Safe Haven platform.
This process was long! This was ongoing throughout the duration of the trial. Although the data was being captured automatically via routine records, the final dataset wasn’t confirmed until relatively late on in the process due to complexities of mapping locally held healthcare records. One of the advantages of the national datasets is that they are the same across all health boards.
Datasets from NHS Lothian and NHS GG&C were supplied separately in their own Safe Havens. The combined dataset was hosted on the NHS Lothian Safe haven space on the National Safe Haven analysis platform .
The data sources from both health boards were combined and hosted on the National Safe Haven analysis platform. This wasn’t a straightforward process. Although we’d anticipated capturing exactly the same patient data across both health boards, the reality was quite different.
Data were captured in different formats with different variable names and different definitions. So there was an unexpected element of data cleaning required before the data could effectively be merged into one large analysis dataset.
The final linkage was done using the securely encrypted CHI number for each patient.
Use of administrative data in this context is a more efficient process – less resource spent on the administrative aspects of trial enrolment e.g. capturing demographic details such as age, sex, postcode or medical history.
Using administrative data also gave us the opportunity to research a large representative patient population in comparison to the setting of an RCT where a strict pre-specified population, not necessarily representative of the target population, are studied.
From the data side of things, ensuring the correct data was extracted was difficult. The diagram above is very over-simplified view of what happened! The reality of picking up the required variables from two separate health boards which capture data very differently was difficult.
Another challenging aspect was ensuring that a patient wasn’t enrolled more than once in the study. Patients can present in any hospital with heart attack symptoms more than once, so we needed to ensure they weren’t included in the study each time they came to hospital. This required a de-duplication algorithm using encrypted and de-identified patient data.
However, I think the biggest challenge was for those in the team tasked with obtaining the correct approvals. It was underestimated how complex this would be. While approval for the national datasets was straightforward and the eDRIS team were very helpful, processes for locally held data at the time of trial set up were not established. Legislation around patient data confidentiality was continually changing, so we were faced with keeping abreast of new legislation as time progressed. The safe haven networks are now more established and hopefully, the processes are more straight forward.
I think the data validation aspect of the trial is crucial. Ideally we would have had more time spent on this in order to ensure the data was as correct as possible. Involving the clinical team much sooner in this process would have helped – they have a really important role to play in terms of ensuring the data picked up makes sense from a clinical perspective.
For High-STEACS, the access to the data was highly restricted and did not include the clinical team. Many of the data discrepancies were only picked up at the final review stage once data and results had been released out of the Safe Haven area.
Working within the Safe Haven environment creates time lags on both sides of the process – data being imported into the Safe Haven and also results exported out at the end take time. We hadn’t considered this time lag when working to tight timelines.
The High-STEACS trial was directly followed by the HiSTORIC trial, addressing similar research questions and using many of the same data sources. So we have been through the loop again which has made for a more streamlined process.
Other trials within ECTU are also making use of the learnings from High-STEACS, particularly from the governance and approvals side of things.
Thanks for sharing this with us Catriona! It is great to see that administrative data are being utilised alongside clinical trials in Scotland. It is also interesting to hear that despite being part of a trials unit like ECTU, the High-STEACS team still faced many of the same challenges that we eCRUSADers have experienced when using administrative data for research. In particular, we can relate to the issues of permissions, timing and working within the Safe Haven environment. Overall, it seems that the timing issues were due to the use of the locally held data rather than using the national data.
Using the linked data set described above, the focus of my research has been investigating the association between multimorbidity (more than one long-term condition) and social care receipt. I am also analysing interactions between health and social care services, with a particular interest in unscheduled care.
Good social care data has been difficult to come by in the past – not just in Scotland, but internationally. I have been lucky to be one of the first group of researchers to get access to the Social Care Survey collected by the Scottish Government in a format that can be linked to health-based data sources.
So far, provisional results show us that increasing age and severity of multimorbidity are associated with higher social care receipt. This was anticipated, but we have never been able to show it empirically before the cross-sectoral linkage.
We have also been able to describe the receipt of social care by socioeconomic position (SEP) using the Scottish Index of Multiple Deprivation (SIMD). This is new and, to my knowledge, hasn’t been described elsewhere on such a large scale. Here we find that those with lower SEP are more likely to receive social care. (All these patterns are shown in the figure below). However, due to a lack of good measures, we can’t tell if the provision of care matches need for care.
My latest piece of work has been looking at whether receipt of social care influences unplanned admission to hospital. Using time-to-event (survival) analysis we can see that, for those over 65, people who receive social care are twice as likely to have an unplanned admission (again these results are provisional at the moment).
The barriers I have faced are, no doubt, similar to others using linked data -the main one being time. Approvals, extraction, linkage etc. all takes considerable time and as a researcher you are not in control of these timescales. A good example is shown by a sub-project for my PhD which was to use social care data from one local authority area only. The council in question were exceptionally helpful and keen to share data. They were very patient whilst I organised ethics and approvals on the academic side. However, by the time I was ready to talk data sharing agreements they had operational pressures (specifically the 2017 local elections) which tied up their legal team. After this we were all hopeful about making progress, but a certain Prime Minister went for a walk in the woods at Easter and decided to call a general election! Cue another 6-week delay until the legal team could start negotiating an agreement. We eventually got there but this illustrates that the data controllers are at the mercy of higher forces as well and it is impossible to set meaningful deadlines.
I am very fortunate to be in a position to keep working with my PhD data in my current role and keep asking questions of the large amount of data we have. However, I have moved university in order to this. This means I now have to repeat the process of ethics, data sharing agreements, privacy impact assessments etc. This is absolutely necessary as my current employers need to make sure that all legal aspects are covered, but there is nothing more soul-destroying than recreating the (significant) amount of work that goes into the required forms (initially completed two years previously). Fortunately, work is afoot at the Scottish Government to make this process obsolete and centralise access to research data sets – however this is still in early stages and we are currently unsure as when this will be operational or what exactly will be available. For now, the pain must endure!
Although there are difficulties in using administrative data for research purposes and delays can be frustrating at times, it is still (incredibly) a really rewarding process. The ability to gain new insights from previously unseen data is something that should excite any researcher. More importantly, data linkage offers the potential to improve society by answering questions that can’t be asked with traditional methods. Well worth an extra ethics form (even if I grumble about it!).