You should have brought many things, he thought. But you did not bring them, old man. Now is no time to think of what you do not have. Think of what you can do with what there is.

With increasingly sophisticated tools available, it may seem silly to mess around with shell scripts, which are nearly half a century old and have a clunky, unintuitive syntax. Nevertheless, shell scripts are fast, ubiquitous on UNIX-based systems, and, if they live in your ~/.bashrc (or equivalent) file, always at hand. Much of their power lies in the pipe operator (|), which is basically function composition in reverse, allowing commands to be strung together without storing intermediate results. In addition, you can add custom tab completion to any function in one line with the complete command. This can make them an excellent replacement for high-level tools that, while powerful, are often built on an Indra’s net of dependencies. Below are some of the everyday tasks I’ve used the shell to make simpler.

1 Data science

A lot of data processing can be done using standard commands such as cut, paste, column, comm, and combine. For example, the following displays the first ten lines of a CSV file of any size, even a multi-gigabyte file too large to load into pandas.

head -10 file.csv | column -ts, | less -S

Here, column -ts, formats the file into columns with a comma delimiter, and less -S overflows long lines rather than wrapping them. It’s simple enough to package this into a function that can live in your ~/.bashrc:

# Show the first few lines of a (potentially large) CSV file
csv_head () {
    local file=${1:?Usage: csv_head file [rows]}
    head -${2:-10} "$file" | column -ts, | less -S
}

Using sed and cut, we can also write a function that navigates through CSV files, à la pandas’ .loc[] function, taking either single numbers or ranges (1-2) as arguments:

# Navigate through cells in a CSV file
csv_loc () {
    local file=${1:?Usage: csv_loc file.csv [rows] [cols]}
    sed -n "${2/-/,}p" "$file" | cut -d, -f "${3:-1-}"
}

You can do more interesting file processing tasks as well. For a CSV file that’s truly massive, the split utility can parse it out into equally sized chunks, and a simple wrapper script can do this while preserving the header in each file:

split_csv () {
    local file=${1:?Usage: split_csv file}
    local base="${file%.*}"
    local ext="${file##*.}"

    # Write out header
    local header=$(head -1 "$file")

    # Split into 100MB chunks, minus header
    tail +2 "$file" | split -d -C 100MB - "$base-" \
        --additional-suffix=".$ext"

    # Write header onto each chunk
    for chunk in "$base"-*."$ext"; do
        cat <(echo "$header") "$chunk" | sponge "$chunk"
    done
}

Now you can process the file one piece at a time.

2 Jupyter

Of course, data analyses of any sophistication require more advanced tools. Jupyter notebooks are a great mini-IDE for creating literate Python programs, but hard to work with in the terminal; one glaring omission is a way to create a new notebook from the command line. This script, which lives in my ~/.bashrc, lets me type jn <filename> (with or without file extension) at the command line, opening the file if it exists and creating it if not.¹

jn () {
    # Complete filename with .ipynb extension if not present
    local filename=${1%.ipynb}.ipynb

    # Create notebook if it doesn't already exist
    if [[ ! -f "$filename" ]]; then
        echo '{"cells": [], "metadata": {}, "nbformat": 4, "nbformat_minor": 4}' > "$filename"
    fi

    # Open notebook
    jupyter-notebook "$filename"
}

Sometimes, though, I don’t want to start a whole Jupyter server just to examine the source code in a notebook. You can export the notebooks as pure code with jupyter-nbconvert --to python, but this is slow and include a lot of cruft like cell numbers. Fortunately, the notebooks themselves are just JSON files, and are easy to parse with a tool like jq. The following function extracts the code from a notebook:

# Extract code cells from a Jupyter notebook
jn_src () {
    jq -j '.cells[] | select(.cell_type == "code").source[], "\n\n"' "$1"
}

This makes it easy to treat the notebooks just like .py files:

# Run code from Jupyter notebook
jn_src notebook.ipynb | python3

# Edit in Vim
jn_src notebook.ipynb | vim -c "setf python" -

# Show differences between two notebooks
diff <(jn_src notebook1.ipynb) <(jn_src notebook2.ipynb)

Finally, you can add tab completion to both these functions to quickly navigate to your notebook files.

# Tab complete files with .ipynb extension
complete -o plusdirs -f -X '!*.ipynb' jn
complete -o plusdirs -f -X '!*.ipynb' jn_src

3 tmux

Another nuisance of Jupyter is that it keeps open the terminal it was launched from for writing log output, which is rarely useful. The solution is to use tmux, a “terminal multiplexer” that allows you to manage terminals by naming them, splitting them onscreen into windows, and moving them into the background (“detaching”) and foreground (“attaching”). I have a shortcut that executes tmux new-session -A -s Jupyter—this opens a “Jupyter” session, creating it if it doesn’t exist. From there, I create a new window in the session with C-b c, give it a name with C-b ,, and launch the notebook server with jn. I can then detach it with C-b d so that it’s running in the background. If I want to close down the server, I can get back to it from any terminal with C-b w.

For times when I just want to jump into a detached session to close it, I have the following helper function:

# Attach a detached tmux session from tmux
t () {
    local curr="$(tmux display-message -p "#S")"
    tmux switch -t "$1"
    tmux kill-session -t "$curr"
}

# Tab complete with detached tmux sessions
complete -W "$(tmux list-sessions -F "#{?session_attached,,#S}")" t

This ability to detach windows is also invaluable when using ssh. For instance, when logging onto an EC2 instance, I can attach a tmux instance on the server, name it, then detach and close the connection, coming back to it later when I think it’s almost finished. That way, I am never logged out of the server, and even if my connection breaks, the command won’t. To automate this, I put the following block in my ~/.ssh/config:

Host *.amazonaws.com
User ubuntu
IdentityFile ~/.ssh/aws-ssh.pem
RequestTTY force
RemoteCommand tmux new -A -s aws
LocalForward 9999 localhost:8888

Now, when I ssh ec2-whatever.compute.amazonaws.com, it will automatically log in and attach to the tmux instance called aws, creating it if there is none (the RequestTTY line is there to make sure that tmux starts in a shell). After running a command, I hit C-b C-b d to detach the remote session,² and I am disconnected from the server, but the command is still running on the server.

(The LocalForward line, incidentally, sets up port forwarding for Jupyter notebooks—using one, though, means I can’t detach the remote session, as that would sever the connection between local notebook interface and remote kernel.)

4 Man page replacements

Lots of small utilities like cheat and tldr offer “example pages”, supplementing man pages by storing oft-used formulas. The pages themselves are extremely useful, making this one of my most-used commands,³ but the utilities themselves tend to be on the bloated side. It’s simple enough to implement this in a few lines of BASH.

export cheatdir="$HOME/.cheat/"

# Easy user-defined man pages
cheat () {
    case "$1" in "-e") shift; $EDITOR "${cheatdir%/}/$1" ;; esac
    cat "${cheatdir%/}/${1:?Usage: cheat [-e] command_name}"
}

# Tab complete entries in $cheatdir
complete -W "$(ls ${cheatdir})" cheat

Just run cheat <command> to view its page, and cheat -e <command> to edit it.

This template can be expanded on; for instance, if you want syntax highlighting, you can install source-highlight, and replace cat with source-highlight --src-lang shell --out-format esc --input.

5 Passwords

I like to use XKCD-style passwords. They are memorable and secure because they draw from a massive entropy pool we already carry around in our heads: the lexicon. A typical English speaker knows on the order of $10^4$ words,⁴ and a typical diceware wordlist contains $6^5 = 7,776$ of those. Choosing six words from the list, without replacement, provides $\log_2\left(\frac{ {6^5}!}{ {(6^5-6)}!}\right) \approx 77$ bits of entropy.⁵

# Generate a memorable passphrase of n words (default: 6)
diceware () {
    shuf --random-source=/dev/random -n ${1:-6} eff_diceware.txt | cut -f 2 | paste -s -d ' '
}

The --random-source flag is important here: it provides true hardware randomness, like you’d get rolling physical dice. By default, the command uses a pseudorandom number generator, which is deterministic; these can be guessed and even gamed, hurting security by making some permutations of words more likely than others.

Of course, if fatuous rules like including punctuation are enforced, you can easily generate “old-fashioned” passwords:

# Generate a random password of n characters (default: 16)
bad_password () {
    strings /dev/urandom | tr -d '[:space:]' | head -c ${1:-16}; echo
}

6 Music

It’s not really known for it, but Reddit is home to a huge network of music subreddits, each representing a genre or subgenre, to which users post YouTube or SoundCloud links. There are services that aggregate these into an internet radio station, but it is simple enough to parse the JSON yourself and pipe it to VLC (make sure it’s running an updated version of youtube.lua), or your media player of choice.

# Play youtube links from music subreddits
reddit_music () {
    curl "https://www.reddit.com/r/${1:-listentothis}/.json?limit=100" | \
        jq -r '.data.children[].data.url' | \
        vlc --novideo --preferred-resolution=360 --network-caching=30000 - &
}

The --novideo and --preferred-resolution=360 flags save bandwidth by getting only the audio stream, and --network-caching=30000 makes sure the songs buffer long enough that there’s no lag.

We can even add “stations” manually to the tab completion command as a way of bookmarking them:

# Add your favorite genres here
complete -W "
    listentothis
    SoundsVintage
    90sAlternative
    MFPMPPJWFA
" reddit_music

If you want to combine subreddits, you can just concatenate them with a plus sign (swinghouse+triphop); VLC will play the top songs from both.

The nbformat and nbformat_minor keys are for my version of Jupyter. For other versions, you can get the appropriate numbers with:
```
import nbformat
print(
    nbformat.current_nbformat,
    nbformat.current_nbformat_minor
)
```
↩
The extra prefix tells local tmux to send the command through to the server.↩
In fact, it ties for sixth in the past ten weeks: y is a wrapper for youtube-dl, and lyrics is a BASH script that, well, looks up song lyrics. To get these stats:
```
history \
    | cut -d ' ' -f 4 \
    | sort \
    | uniq -c \
    | sort -n \
    | tail -10
```
↩
See, e.g., William E. Nagy and Richard C. Anderson, “How Many Words Are There in Printed School English?” Reading Research Quarterly 19, no. 3 (1984): 304–30, doi:10.2307/747823.↩
Assuming Kerckhoffs’s principle, which I guess we’ll have to now that I’ve blogged about it.↩

Alexander Klapheke

Useful shell scripts