Useful Software for Statistics PhDs
This document lists a number of computer tasks which are useful for PhD students in Statistics, and itemizes software solutions for each of the tasks. Acknowledgment: it was adapted from a list first produced by David Firth, Wilfrid Kendall and Ioannis Kosmidis in the early years of APTS operation. We hope you will find the list useful, not only because it gives an impression of the range of possible software solutions for each task, but also because you may find it helpful to note the tasks themselves as useful ways to help you get the most out of your time as a PhD student – and beyond. The list was current when it was prepared; no doubt the range of possibilities will change as time passes so please feel free to let us know of software which you have found useful.
Email · Calendar/Diary · Computation · Office software · Scientific document preparation · Bibliography · Graphics · Back-up and synchronization · Version control · Source code editors (SCE)/Integrated development environments (IDE) · Further tasks
We have organized each of the task-lists in four columns, corresponding to Windows, Linux (using Ubuntu as a prototypical distribution; it's a good choice for a first-time linux user), Mac, and web-applications. These columns correspond to the basic choice of operating system which one makes when establishing a computer system. Much has been written about this choice, some of it helpful. Here are some basic pros and cons:
- Windows. Pro: often already installed. Con: not free, not open-source; applications often not well-integrated to command-line; updating and adding new software is often clumsy.
- Linux. Pro: free, open-source; excellent software repositories make it easy to update and to add most software; unix-based so good command-line integration. Con: some say Linux can lack final design polish; not many games.
- Mac. Pro: excellent design; close linkage to hardware means it often “just works”; unix-based so good command-line integration. Con: not free, not open-source; close linkage to hardware means it is expensive.
- Web. Pro: doesn't depend on operating system; available via web connection wherever you are; someone else's responsibility to backup, maintain, and update. Con: is the privacy policy acceptable to you? How secure is the application in question? No access when offline?
Be aware of mix-and-match possibilities. Dual- or even triple-booting allows something of the best of all worlds. Linux systems can use wine to run Windows applications within a Linux environment (often a good solution for business software, can be rather less successful for games). The virtualbox application allows one to run complete operating systems within other operating systems. Windows users can install cygwin to access a huge number of unix programs, well-integrated into a unix command-line. But this is enough about operating systems; let us consider some of the fundamental tasks which you will need your computer to perform.
We have noted below where software is free, and where it is open-source. Free software has the obvious advantage of zero-cost; however a further advantage is that upgrades are usually free(!) and it can be installed on several machines without concern for licence conditions. Open-source software has some security advantages; in principle security holes can be publicly exposed and assessed, while the public availability of source-code gives some assurance that the software is really doing what it says it is doing. (However, both these advantages are often overstated.) We have not given web links for these software packages. As ever, Google is your friend!
Windows | Linux | Mac | Web |
Outlook Express | Evolution (free; open source) | Gmail (free) | |
Outlook | kmail (free; open source) | Microsoft Entourage | |
Thunderbird (free; open source); also Alpine (free; open source) and Mutt (free; open source) |
The first three columns list email clients, which can be used to read and / or download your email from your university or other email provider. It is often convenient to mix and match between the Web column and one of the other columns, as most email clients can be also configured to download email from a Web email application.
Calendar/Diary
Windows | Linux | Mac | Web |
Outlook | Evolution (free; open source) | iCal | Google Calendar (free) |
Korganizer (free; open source) | Microsoft Entourage | ||
Thunderbird (free; open source) |
Do not underestimate the usefulness of running a computerized calendar / diary for forthcoming events! Good choices can email you reminders, display appointments for the month ahead (to check for deadline clashes), and serve as a record of what you did and when (vital at some later stage when you have to write activity reports). The habit of maintaining a calendar will serve you well throughout your professional life.
Computation
Windows | Linux | Mac | Web |
Mathematica | Wolfram alpha | ||
Matlab | |||
Maple | |||
R (free; open source) | |||
Python (free; open source) | |||
Also: octave (like matlab) and maxima (like maple/mathematica) both free and open source |
R is by far the most widely used of these by statisticians and consequently has an excellent range of available packages. Python is good for scripting and can interface with R; it is more widely used outside of statistics than is R. In general, computational speed-ups can often be obtained by investigating libraries / functions which work on matrices and vectors, rather than individual numerical quantities (and for more experienced programmers the Rcpp and RcppArmadillo libraries can be invaluable). Note that useful environments for R exist: in particular the celebrated cross-platform editor Emacs (free, opensource) can supply a very good R environment.
There are many other possibilities too. Do not underestimate the merits of open-source software for scientific computation. Eventually you are likely to come to the point of needing to know exactly how a certain calculation is achieved, and then the ability to inspect the source-code can be invaluable.
Office software
Windows | Linux | Mac | Web |
Microsoft Office | Microsoft Office | Google docs (free) | |
iWork | |||
Libreoffice (free; open source) |
If careful attention is paid to installing and working with the correct fonts, then Libreoffice can be a very successful substitute for non-free alternatives.
Scientific document preparation
Windows | Linux | Mac | Web |
Miktex (free; open source) | TexLive (free; open source) | mactex (free; open source) | overleaf (free) |
winedt (shareware) | Kile (free; open source) | TexShop (free; open source) | |
emacs [with various addons] (free; open source) | |||
lyx (free; open source) |
Get used to using latex to prepare scientific documents. Any document containing any amount of mathematics will look far better in latex! Using pdflatex and a latex package such as beamer, you can prepare pdf-based presentations which match and surpass Powerpoint in quality, and are easily accessible to others. Finally, a latex environment (as listed in last three rows for the three operating systems) pays dividends (a) when you can't remember the exact latex command, (b) when you want access to a template to start a new document, (c) when spell-checking, (d) at the compilation stage, when the environment can take care of the multiple latex runs needed to resolve labelling issues and so forth.
Overleaf is a relatively recent innovation which allows for collaborative writing of latex documents. However, you should check that your collaborators wish to use such a thing: they may be less enthusiastic and many other good ways of collaborating exist, particularly by making use of version control software.
Bibliography
Windows | Linux | Mac | Web |
bibtex (free; open source) | MR Lookup (free) | ||
jabref (free; open source) | Google Scholar (free) | ||
Mendeley (free) |
Speaking generally, get used to using bibtex to keep track of references to scientific papers. This uses a bibtex source file (which lists scientific papers in a curious format) to insert references and bibliography in a latex document. In practice you can build up a single bibtex source file containing all the articles you ever read, and use this again and again as you write various papers. The bibtex format is cumbersome; however you can typically download bibtex entries directly from the web (MR Lookup, Google scholar, and other sources), and bibliography managers such as Mendeley, Jabref can be used to generate databases with graphical interfaces linking directly to PDFs of the relevant articles on your computer.
Note also the major online bibliographic databases that can export to BibTeX format, notably ISI Web of Knowledge (subscription service, most major universities subscribe; translation to BibTeX via a published Perl script, isi2bibtex) and Current Index to Statistics (subscription service, but with free access to records older than about 5 years), as well as Zetoc (free for scholars at a wide range of institutions) and Google Scholar (free).
Graphics
Windows | Linux | Mac | Web |
gimp (free; open source) | |||
dia (free; open source) | |||
xfig (free; open source) | |||
inkscape (free; open source) |
Gimp has capabilities approaching those of Photoshop. Diagrams for papers can be generated quickly in dia or xfig, while inkscape produces vector graphics. Other tools are useful for specific tasks, graphviz for drawing graphs (the sort with vertices and edges) and tikz for embedding vector graphics within LaTeX.
Back-up and synchronization
Windows | Linux | Mac | Web |
unison (free; open source) | dropbox (free for 2GB) | ||
duplicati (free; open source) | duplicity (free; open source) | Time Machine | SpiderOak (free for 4GB; encrypted) |
One day your computer's hard-disk is going to fail. What will you do then? Wise researchers ensure their work is backed-up onto an external disk, or via some web-service. The application unison will enable you to specify directories which you wish to be kept identical between computer and hard- disk, or indeed between two computers. Meanwhile duplicity and Time Machine can keep encrypted incremental backups, so that a given file can be accessed as it was last Tuesday evening. (NB: duplicati is a re-write of duplicity of which we have no direct experience).
Version control
Windows | Linux | Mac | Web |
rcs (free; open source) | bitbucket | ||
git & gitzilla or sourcetree (free; open source) | github | ||
subversion (free; open source) |
It is convenient to keep a record of the changes made to the paper or code on which you are currently working. In principle this can be recovered from a sequence of incremental back-ups; however it is simpler to use revision control. Every time one finishes a session of work on a document, one checks it in to the revision control system. Should one discover that the last week's work has involved accidentally deleting a magic paragraph, then this is easily recoverable by checking out an earlier version. Moreover one can use file comparison utilities to determine what parts have been changed recently. Whilst less general than the above, for comparing the output of two different latex files, the perl script latexdiff can also be useful. Once set up, these systems are easy to use, and invaluable when needed.
Source code editors (SCE), integrated development environments (IDE)
Windows | Linux | Mac | Web |
emacs (free; open source) | |||
vi, vim, gvim (free; open source) | |||
Notepad++ (free; open source) | gedit (free; open source) | xcode (free) | |
Microsoft Visual Studio | kate (free; open souce) | ||
eclipse (free; open source) | |||
Rstudio (free; open source) |
Most of you at some point in your careers will need to write some computer program. A source code editor (SCE) is an application that facilitates the process of editing source code for computer programs and in many cases also automates many steps in the programming process. Many SCEs are standalone applications (like the vi family, Notepad++, gedit, Kate, Kwrite, Kdevelop) while others are part of an Integrated Development Environment (IDE), which is an application that provides a comprehensive collection of tools (compilers, debuggers, source code editors, etc) that make one programmer's life easier. The table above provides a list of SCEs and IDEs that we have used in the past. One of us uses Emacs for all of his source code editing and scripting needs (R, C, C++, Python) and also for Latex. Actually, Emacs is so feature rich that it may be used for everything – email, web-browsing, file manager, ....
There is no such thing as the best SCE or IDE (although many people that you meet might try to convince you that their personal favourite is exactly that). We have met people who find the minimalistic interface of the vi family attractive, others who prefer using the feature-rich Emacs editor for as many things as possible and others that use a full-blown IDE for their programming needs. The general advice is to choose a SCE or IDE that fits your programming needs and try to understand why it does things the way it does. Then it will serve you well possibly for the rest of your career.
Further tasks
We have chosen not to cover a number of other tasks, for which we do not have much experience. For instance, the researcher of the future is likely to make much more use of blogging; will keep track of developments in the literature using an RSS newsreader, and will establish a professional presence using social websites. Be ready to make the most of such opportunities.