Reinstalling software and configuring settings on a new computer is a pain. After my latest hard drive failure set the stage for yet another round of download-extract-install and configuration file twiddling, it was time to overhaul my approach. “Enough is enough!”
This post walks through
- how to back up and automate the installation and configuration process
- how to set up a minimal framework for data science
We’ll use a dotfiles repository on Github to illustrate both points in parallel.
Dotfiles are named after the configuration files that start with a dot in Unix-based systems. These files are hidden from view in your home directory, but visible with a $ ls -a
command. Some examples are .bashrc
(for configuring the bash shell), .gitconfig
(for configuring git), and .emacs
(for configuring the Emacs text editor).
Let’s provide a concrete example of a customization: suppose you have a hard time remembering the syntax to extract a file (“Is it tar -xvf, -jxvf, or -zxvf?”). If you’re using a bash shell, you can define a function, extract()
in your .bashrc
file that makes life a little easier:
extract() {
if [ -f "$1" ]; then
case "$1" in
*.tar.bz2) tar -jxvf "$1" ;;
*.tar.gz) tar -zxvf "$1" ;;
*.bz2) bunzip2 "$1" ;;
*.dmg) hdiutil mount "$1" ;;
*.gz) gunzip "$1" ;;
*.tar) tar -xvf "$1" ;;
*.tbz2) tar -jxvf "$1" ;;
*.tgz) tar -zxvf "$1" ;;
*.zip) unzip "$1" ;;
*.ZIP) unzip "$1" ;;
*.pax) cat "$1" | pax -r ;;
*.pax.Z) uncompress "$1" --stdout | pax -r ;;
*.Z) uncompress "$1" ;;
*) echo "'$1' cannot be extracted/mounted via extract()" ;;
esac
else
echo "'$1' is not a valid file to extract"
fi
}
So the next time you have to extract a file some_file.tar.bz2
, just type extract some_file.tar.bz2
in bash. (This example was found in this dotfiles repo.)
The structure of my dotfiles takes after the repo described by Lars Kappert in the article “Getting Started With Dotfiles”. However, my repo is pared down significantly, with minor modifications for my Linux Mint system (his is OS X) and a focus on packages for data science.
A framework for data science
This starter environment only has a few parts. We need a text editor — preferably one that can support multiple languages encountered in data science — and a way to manage scientific/statistical software packages.
Components
The setup consists of:
- Emacs — a powerful text editor that can be customized to provide an IDE-like experience for both python and R, while providing syntax highlighting for other languages, e.g. markdown, LaTeX, shell, lisp, and so on. (More on customizing Emacs in a future post.)
- Conda — both a package manager and environment manager. Advantages:
- Packages are easy to install compared to pip, e.g. see a post by the author of numpy.
- Conda is language agnostic in terms of both managing packages and environments for different languages (as opposed to pip/virtualenv/venv). This feature is great if you use both python and R.
- Standard python scientific computing libraries like numpy, scipy, matplotlib, etc. are available in the conda repository.
I use the system package manager (i.e. apt-get install ...
) to install a few packages like git, but otherwise rely on Conda to install Python (multiple versions are okay!), R, and their libraries.
I like how clean the conda installation feels. Any packages installed by Conda, as well as different versions of Python itself, are neatly organized under the miniconda3
directory in my home directory. In contrast, my previous Linux setups were littered with software installations from various package managers, along with sometimes unsuccessful attempts to compile software from source.
Workflow
My workflow with Conda follows this helpful post by Tim Hopper. Each project gets its own directory and is associated with an environment whose dependencies are specified by an environment.yml
file.
For example, create a folder for a project, my_proj. Within the project folder, create a bare-bones environment.yml
file to specify a dependency on python 3 and matplotlib:
name: my_proj
dependencies:
- python=3
- matplotlib
Then, to create the conda environment named after that directory, run $ conda env create
inside the my_proj directory. To activate the virtual environment, run $ source activate my_proj
.
Activating a conda environment can be further automated with autoenv. Autoenv automatically activates the environment for you when you $ cd
into a project directory. You just need to create a .env
file that contains the command to activate your environment, e.g. source activate my_proj
, under the project directory.
Tim has written a convenient bash function, conda-env-file
(see below), for generating a basic environment.yml
file and .env
file, which I’ve incorporated into my own dotfiles, along with autoenv. The order of commands that I type in bash then follows:
mkdir my_proj
# create project foldercd my_proj
# enter project directoryconda-env-file
# execute homemade function to create environment.yml and .envconda env create
# conda creates an environment “my_proj” that is named after the project directory (using environment.yml)cd ..
cd my_proj
# autoenv automatically activates environment (using the file .env) when you re-enter the directory
The dotfiles layout
Below is the layout of the directories and files (generated by the tree
command) in the dotfiles repo.
.
├── install
│ ├── apt-get.sh
│ ├── conda.sh
│ ├── git.sh
│ ├── install-emacs.sh
│ └── install-miniconda.sh
├── install.sh
├── runcom
│ ├── .bash_profile
│ ├── .bashrc
│ └── .profile
└── system
├── env
├── functions
└── path
Configuration
There any number of dotfiles that can be configured (for example, see the collection here), but this repo only provides customizations for the dotfiles .profile
, .bash_profile
, and .bashrc
— located in the directory, runcom
(which stands for “run commands”) — that contain commands that are executed at login or during interactive non-login shell sessions. For details about the role of shell initialization dotfiles, see the end of this post.
Instead of putting all our customizations in one long, unwieldy dotfile, it’s helpful to divide them into chunks, which we keep in the subfolder, system
.
The files env
, functions
, path
are sourced in a loop by the dotfiles in runcom
. For example, .bashrc
sources functions
and env
:
for DOTFILE in "$DOTFILES_DIR"/system/{functions,env}; do
[ -f "$DOTFILE" ] && . "$DOTFILE"
done
Let’s take a look at the configurations in each of these files:
env - enables autoenv for activating virtual environments
[ -f /opt/autoenv/activate.sh ] && . /opt/autoenv/activate.sh
functions - defines a custom function, conda-env-file
, that generates an environment.yml
that lists the dependencies for a conda virtual environment, and a one-line file .env
(not to be confused with env
in the previous bullet point) used by autoenv. (In addition to pip and python, I include the dependencies ipython, jedi, and flake8 needed by my Emacs python IDE setup.)
function conda-env-file {
# Create conda environment.yml file and autoenv activation file
# based on directory name.
autoenvfilename='.env'
condaenvfilename='environment.yml'
foldername=$(basename $PWD)
if [ ! -f $condaenvfilename ]; then
printf "name: $foldername\ndependencies:\n- pip\n- python\n- ipython\n- jedi\n- flake8" > $condaenvfilename
echo "$condaenvfilename created."
else
echo "$condaenvfilename already exists."
fi
if [ ! -f $autoenvfilename ]; then
printf "source activate $foldername\n" > $autoenvfilename
echo "$autoenvfilename created."
else
echo "$autoenvfilename already exists."
fi
}
path - prepends the miniconda3 path to the PATH environment variable. For example, calls to python will default to the Miniconda3 version (3.5.1 in my case) rather than my system version (2.7).
export PATH="/home/$USER/miniconda3/bin:$PATH"
Now, we’re done with configuring the dotfiles in this repo (apart from Emacs, which is treated separately). We just have to create symlinks in our home directory to the dotfiles in runcom
, which is performed by the shell script, install.sh
:
## ...
ln -sfv "$DOTFILES_DIR/runcom/.bash_profile" ~
ln -sfv "$DOTFILES_DIR/runcom/.profile" ~
ln -sfv "$DOTFILES_DIR/runcom/.bashrc" ~
## ...
Installation
In addition to setting up dotfiles symlinks, install.sh
automates the installation of all our data science tools via calls to each of the scripts in the install
subfolder. Each script is named after the mechanism of installation (i.e. apt-get
, conda
, git
) or purpose (to install Miniconda and Emacs).
- apt-get.sh - installs a handful of programs using the system package manager, including
build-essentials
, which is needed to compile programs from source. Also enables source-code repositories (not enabled by default in Linux Mint 17), to be used for compiling emacs from source. - install-emacs.sh - build Emacs 24.4 from source, which is needed for compatibility with the Magit plug-in (git for Emacs). At the time of writing, only Emacs 24.3 was available on the system repo.
- install-miniconda.sh - miniconda includes just conda, conda-build, and python. I prefer this lightweight version to the Anaconda version, which comes with more than 150 scientific packages by default. A note from the Miniconda downloads page: “There are two variants of the installer: Miniconda is Python 2 based and Miniconda3 is Python 3 based… the choice of which Miniconda is installed only affects the root environment. Regardless of which version of Miniconda you install, you can still install both Python 2.x and Python 3.x environments. The other difference is that the Python 3 version of Miniconda will default to Python 3 when creating new environments and building packages.” (I chose Miniconda3.)
- conda.sh - Use conda to install popular scientific packages for python, R, some popular R packages, and packages for IDE support in Emacs.
- git.sh - Install autoenv for working with virtual environment directories. Also clone the configurations from my Emacs repo.
Conclusion
The dotfiles repo discussed in this post will remain in this minimal state on GitHub so that it can be easily parsed and built upon. It’s the most straightforward to adopt if you are on a similar system (Linux Mint or Ubuntu 14.04), as I haven’t put in checks for OSX. If you don’t like Emacs, feel free to comment out the relevant lines in install.sh
and install/git.sh
, and replace with your editor of choice.
Also take a look at other collections of awesome dotfiles for nuggets (like the extract()
function) to co-opt. And enjoy the peace of mind that comes with having dotfiles insurance!
Notes on shell initialization dotfiles
The handling of the dotfiles .profile, .bash_profile, and .bashrc is frequently a source of confusion that we’ll try to clear up here.
For example, .profile and .bash_profile are both recommended for setting environment variables, so what’s the point of having both?
.profile
.profile is loaded upon login to a Unix system (for most distributions) and is where you should put customizations that apply to your whole session, e.g. environment variable assignments like PATH
that are not specifically related to bash. .profile holds anything that should be (1) available to graphical applications — like launching a program from a GUI by clicking on an icon or menu — or (2) to sh
, which is run by graphical display managers like GDM/LightDM/MDM when your computer boots up in graphics mode (the most common scenario these days). Note, even though the default login shell is bash in Ubuntu, the default system shell that is used during the bootup process in Ubuntu is dash, not bash, (readlink -f /bin/sh
outputs /bin/dash
).
Let’s give a concrete example of case (1): the miniconda installer provides a default option to add the miniconda binaries to the search path in .bashrc: export PATH="/home/$USER/miniconda3/bin:$PATH"
. Assuming you’ve used conda
(not apt-get
) to install python scientific computing libraries and have set the path in .bashrc, if Emacs is launched from an icon on the desktop, then Emacs plugins that depend on those libraries (e.g. ein
, a plugin that integrates IPython with Emacs) will throw an error; since the graphical invocation only loads .profile, the miniconda binaries would not be in the search path. (On the other hand, there would be no problem launching Emacs from the terminal via $ emacs
.) For this reason, it’s preferable to add the miniconda path in .profile instead of .bashrc.
For changes to .profile to take effect, you have to log out entirely and then log back in.
.bash_profile
Like .profile, .bash_profile should contain environment variable definitions. I haven’t yet encountered a situation where a configuration can be set in .bash_profile that can’t be set in .profile or .bashrc.
Therefore, my .bash_profile just loads .profile and .bashrc. Some choose to bypass .bash_profile entirely and only have .profile (which bash reads if .bash_profile or .bash_login don’t exist) and .bashrc.
.bashrc
Definitions of alias, functions, and other settings you’d want in an interactive command line should be put in .bashrc. .bashrc is sourced by interactive, non-login shells.
login, non-login, interactive, and non-interactive shells
- To check if you’re in a login shell, type on the command line
echo $0
. If the output is-bash
, then you’re in a login shell. If the output isbash
, then it’s not a login shell (seeman bash
). - Usually, a shell started from a new terminal in a GUI will be an interactive, non-login shell. The notable exception is OSX, whose terminal defaults to starting login shells. Thus, an OSX user may blithely sweep customizations that would ordinarily be placed in .bashrc — like aliases and functions — into .bash_profile and not bother with creating a .bashrc at all. However, those settings would not be properly initialized if the terminal default is changed to non-login shells.
- If you ssh in or login on a text console, then you get an interactive, login shell.
- More examples in this StackExchange thread.
This discussion might seem pedantic since you can often get away with a less careful setup. In my experience, though, what can go wrong will probably go wrong, so best to be proactive.