Course Round Up: An Introduction to Data Science for Administrative Data Research
Dates of course: May 2021 Organised by:Scottish Centre for Administrative Data (SCADR) Post summary: I was lucky enough to bag myself a spot on this year’s SCADR course “An introduction to data science for administrative data research”. In this post, I present an overview of the course and its content. I also provide my thoughts on the parts I felt to be most useful for fellow eCRUSADers as well as where I felt there were things missing.
In a rush? Skip ahead in the contents to find out about:
Overall, the course is designed to introduce researchers to the world of administrative data, with a particular focus on Scotland. The course includes a series of lectures and practical sessions, allowing participants to get some hands on experience working with a synthetic administrative dataset.
The course was very well structured and organised. The content was extremely useful and the course instructors were incredibly knowledgeable. The course provided participants with a lot of very helpful information. For anyone starting out doing research using administrative data I would most definitely recommend it.
Week 1: Week 1 kicked off with an introduction to the course, the learning outcomes and to administrative data. Chris Dibben, SCADR Director, welcomed the participants and gave a very useful summary of the history of research using administrative data in Scotland and the UK. He also outlined the key stages a researcher will go through when carrying out research using administrative data (see figure below). The following lectures, delivered by experienced SCADR researchers, also covered some of the different sources of administrative data in the UK, the benefits and limitations of working with administrative data and the Five Safes Framework. There was also a brief introduction to programming in R to assist with the practical sessions.
Week 2: The focus of week 2 was to introduce participants to some of the administrative datasets available for research in Scotland (and the UK). These included:
Scottish Government Education and Analytical Services data and the education datasets (Scotland focus)
Department for Work and Pensions data (UK focus)
National Records of Scotland data including Scottish Longitudinal Study (Scotland focus)
Health data (Scotland focus)
Each lecturer discussed the datasets available, the type of information contained in them, how to access them and links/email addresses to find out more information.
A further lecture in week 2 covered record linkage, the different methods of linkage and some of the implications of incorrect linkage on research. The final two lectures to assist with the practical sessions looked at working with dates and times, and indexing, linking and joining data.
Week 3: Week 3 explored data provenance, law, trust and public engagement. The data provenance lecture, delivered by SCADR co-Director Iain Atherton, outlined the importance of understanding where your data come from because ultimately this will affect what you get out of it. He provided a useful outline of steps researchers should take when working with administrative datasets:
The lecture on law, trust and public engagement was incredibly helpful for any researcher starting out in administrative data research (I would go as far to say that attendance at this lecture should be mandatory for anyone who wants to access administrative data for research!) The session highlighted how important it is to develop the social licence when carrying out research in on unconsented personal information. Some useful questions to ask ourselves were considered when thinking about public engagement. For example, to whom does your research relate? Who will be impacted from the outcomes of your research? What conditions or social issues does your project explore? Doe the findings have the potential to impact a wide proportion of the public?
The final week 3 lecture explored data visualisation in R.
Week 4: The lectures in week 4 gave participants some insight into some existing research that has used administrative data in Scotland. These included Scotland’s maternity dataset (SMR02), linked survey and DWP data, drug consignment data, and Scottish Government data on looked after children.
Each of the speakers also talked about the main challenges faced during those research projects and in particular imparted their words of wisdom for other researchers. The lecturers also pointed towards some published work using Scottish administrative data. For example:
Week 1: The practical sessions started in week 1 with an introduction to the R environment. Note that the course is designed for people who have used R before. However, even if you have no experience in R, the course instructors were very willing to help and the instructions for carrying out the data cleaning and analysis were extremely thorough.
The practical sessions use a synthetic Not in Employment or Education (NEET) dataset. This dataset was created to mimic the relationships between variables in the original NEET data but the observations do not pertain to real individuals.
Week 2: The focus of week 2 was first to handle the date and time variables in the dataset, create new variables and to check for any anomalies. All of which are typical exercises you will do when you work with administrative datasets. In the second part, participants had to convert the dataset from long to wide format, another common issue when working with administrative datasets, and merge datasets together.
Week 3: Week 3 was all about visualisation to explore the data to look for patterns and to check for any anomalies.
Week 4: The final practical session was focussed on modelling the data and producing tables of output.
The course is most suitable for those who are new to the world of administrative data research, particularly in Scotland. However, the course would also be useful for anyone working with administrative data in the UK, as many of the lessons learned will translate.
8. How much does it cost and what time commitment is involved?
In 2021, the course cost £120 per person. The course runs over a course of four weeks with both lectures and practical sessions. In total, SCADR estimate that you will need around 4-6 hours per week to watch the lectures and read the teaching materials and a further 3-4 hours per week to join the practical sessions.