CIM students use a range of software in class and for their research projects, including the programming language called R. Whether students are scraping data from web sites, modelling spatial patterns or exploring data through graphs and maps, R has become an essential toolbox that supports the demands of data processing, analysis and visualisation. It’s not just CIM that uses R. A wide range of potential employers use R to make sense of data, from businesses and government, to charities and the media. R is used at Google and Facebook, and also at media organisations such as the BBC.
We spoke to Christine Jeavans and colleagues in the BBC Data Journalism team to find out more about how they use R, and other software, in their work. CIM students are keen to learn software and coding skills that are relevant to careers in a world influenced by data. Our modules use R, but also software such as QGIS for spatial analysis and Gephi for network visualisation. Christine and her team have some excellent insights into working with data and how to get started (and how to keep going!).
CIM: What software do you use at the BBC?
We use several data analysis and visualisation tools in the BBC Data Journalism team. These include R for analysis, mapping and charts; QGIS and Carto for maps; Excel, Tableau, Python and other tools. We also work with designers and web developers for bespoke visualisations and interactive content.
CIM – How does working with R and code assist your work at BBC news?
In lots of ways: it means a much more reproducible workflow as all the code and data can be reused and reviewed, which makes us better at checking each other’s work and saves us heaps of time when a project needs updating.
With the ggplot2 package we can make production-ready charts in the same workspace we analyse our data in, which is a more convenient way of working, and again, the charts are more easily reproducible.
Code documents each step of the analytic process. Each line of code captures the steps followed to produce a given output (graphic, data table etc). This makes the process of passing a project between team members seamless.
It also provides scalability; a script written to process a single data file can be leveraged with minimal effort to be run over hundreds of similar data files. And we can automate scripts to perform repetitive tasks such as checking election results as they come in.
CIM – Many of our students will learn to use R for the first time at CIM. How did you learn to code?
As someone with a solid spreadsheet background, I found the book R for Excel Users useful as an introduction to thinking in terms of R and “translating” existing Excel skills.
R for Journalists also tackles learning it from a journalist’s point of view rather than a programmer’s, which is pretty useful, and has a lot of projects which I think are good as inspiration. Also, I’d really recommend the R package called swirl() which teaches you how to use R in R. You can give it a go at https://swirlstats.com/.
I learned all I know on the job. I had covered slightly beyond the basics of Excel at university but after a few introductory sessions at work, taught by a colleague who did know how to code, the rest has been self-taught, solving problems as they arise using Stack Overflow (https://stackoverflow.com/questions/tagged/r), other resources and helpful colleagues.
Bookdown now contains a lot of free good R books by renowned contributors like Hadley Wickham, an avid R package author and Roger D. Peng, a machine learning expert just to name a couple. https://bookdown.org/.
CIM – There are a lot of great suggestions there and a lot of free resources. It sounds like the R package called ‘swirl’ could be a great place to start. Our students take introductory workshops in R and the teaching materials are freely available at https://warwick.ac.uk/fac/cross_fac/cim/people/james-tripp/teaching/rworkshop.
CIM – Learning to code can be challenging! What would you tell CIM students when it the going gets tough?
If you’re new to coding there are a lot of new concepts which can feel very alien, and it takes lots of repetition before these concepts sink in properly. Don’t worry if you feel like you can’t understand what you’re doing when following a tutorial. Just keep repeating the steps and understanding will come in time.
Also, don’t worry if you can’t remember how to do things – that’s what Stack Overflow is for. You don’t need to memorise every function, you just need to know that something is possible to do.
Just keep going. Stack Overflow is your friend. Most importantly do not be intimidated by red error messages. Befriend them. They are your only guide as you try to debug your scripts and seek help when things don’t work as you expect.
Always think back to how far you have come and what you have actually learnt. There is never a stage where you can feel you know all there is to know and on top of things, knowing that there are so many things I still do not understand is something I have had to come to terms with.
Don’t try to learn too much at once and have in mind where you want to be in a week and what it will take to be there. The best way I have found to learn is learning to make things or create projects, not learning in an abstract way.
It is great that R is modular and you are working in chunks of code, so you gradually build up your knowledge. Try to solve a specific problem or visualisation and just keep chipping away at it.
Make sure you write descriptive notes in your #comments so that when you come back to a project you can understand what you were doing. It does get easier!
CIM – Students studying ‘Urban Science’ use GIS for spatial analysis and digital maps. How does the BBC use R for mapping?
We’ve begun mapping in R from start to finish. These are mainly (but not exclusively) choropleth maps which shade or colour regions in maps to show data values (see below). We’ve managed to speed up the process through style templating and storing commonly-used shapefiles in a shared drive.
The basic process involves loading, simplifying and preparing the shapefiles, joining them to the data we want to map, and producing the map polygons through ggplot. Once labels and annotations are added, the map is saved as a .png or .pdf file using functions which also add the BBC logo and styling.
Producing maps in R has several advantages: We can do our data analysis and map-making all in the same place, meaning data doesn’t have to be prepared in a different program before being sent to mapping software.
By exporting from R to pdf, we can also quickly get maps produced as Adobe Illustrator/Photoshop files ready for language translation if necessary.
CIM – In our ‘Big Data Research’ module’ one of the coding labs uses R for web scraping. How do you use R to gather and organise data?
Everyone on the team probably uses R differently, but in terms of data gathering I use packages like rvest to scrape data. Otherwise we all use R a lot for data manipulation, so packages like dplyr and tidyr are extremely useful.
We use Python and R for getting data from various sources including websites and APIs which are apps that allow you to interface R with other applications or data resources. APIs present us with the least effort. The choice of which language to use depends on the team member’s fluency with a given language. R is usually the language of choice given more team members use it.
When using the Python language we use a package called Beautiful Soup for parsing html pages. The equivalent in R is the rvest package.
The process of scraping websites with either R or Python involves a level of understanding how html is structured. We use tools like the DOM inspection functionality provided by the browser’s console (you can press F12 in google chrome or Internet explorer to see what this looks like). This enables us to learn the html tags to target when doing our scraping.
R is not the only thing we use to gather data, conduct analyses or visualise. Microsoft Excel and other spreadsheet programmes are always handy and we also use Tabula to extract data tables from PDFs.
For visualising geographic data, the team is experienced in using open source mapping software QGIS, which we have used to produce maps for many news stories.
I use R to conduct data analysis, gathering data as well as writing and sharing code with colleagues with whom I am collaborating on projects. A lot of the work I do in R uses the tidyverse set of packages, but not exclusively. I also use ggplot to visualise data once it has been cleaned and analysed.
CIM – What challenges and opportunities do you face for visualising your data? How does R help? CIM students use, investigate or create visualisations across many of our modules to understand how we interact with data, information, media and ‘facts’.
We use R for exploration, so for example to try to identify patterns in the data and to see if we can spot interesting points that we should investigate. So being able to visualise our data really helps us to find stories.
In the past, when styling charts in Adobe Illustrator after doing analysis in programs like R, we’d have a problem with elements moving around when resizing the image. If we make something start to end in R, we can be confident that the data is going to be plotted how we intended it to be.
A good example is this that we made for the World Cup - https://www.bbc.co.uk/sport/football/44388118. Without being able to do all the styling in R, we would have had a job on our hands making sure that the plot of where the shots came from was placed on the pitch-map template in exactly the right place.
CIM – Thank you for your time, it was great to talk to you about what you do and how you do it. There are lots of useful tips, resources and inspiration that our students can put to good use. Thanks again!