Presenting a "Data Slide"

How to present a “data slide” in a scientific talk

As part of my PhD program, we are required to take a professional development class for scientists that focuses on the “soft” skills of being a successful scientist. We recently had a lecture on presenting our data, and I thought I would share some of my notes, impressions, and the project we were asked to do (preparing a singular slide to present data from an experiment).

Here are a few descriptions of the “cardinal rules.”

No “data dumping”.

Don’t just put data (raw or processed) on your slide. You need to contextualize it somehow.

“Data doesn’t speak for itself…you do.”

Determine your audience.

Think about your audience ahead of time. What will their backgrounds be? Are they in your field? Are they lay-persons? What will their needs be so you can best convey your ideas. This can be tricky. You need to tailor your content to match the level of familiarity and literacy of your audience to get your point across. This means spelling out acronyms, using images, simple graphs, and providing background when necessary.

Why is this important?

This one is simple. People will tune you out (eventually, but perhaps more quickly than otherwise) if you don’t tell them why they should listen to you.

Organization

Before you start your presentation, make an outline. When you present, you are trying to tell a story with a beginning (your question and its relevance), middle (your experiment, the data you collected), and end (results, graphs/figures). Craft your narrative and determine what evidence you need to support it at each point during your talk. You might end up with a bunch of little stories lined up end to end. If that’s the case, make sure you add summaries and overarching statements to tie it all together.

Visualization

Too complex to sum up appropriately here, and there are a million resources out there already anyway. I’m not going to reinvent the wheel here. My main takeaway from this lecture was the flow chart below, which I thought was a useful aid for determining which type of visualization best suits your data and message.

    / [pdf]

The other message I got was to try to think a little like a graphic designer, and to use gestalt principles from psychology to optimize a visualization for your particular message. Combine these with Weber’s Law - that some visual features are easier for us to perceive differences between. Essentially, we should encode our primary message using the features that we are naturally best at visually discriminating.

A few rules of thumb:

  1. Always include plot title, axis titles, legend, and scale. Label everything.
  2. Use features like color, angle, volume, etc. carefully (don’t use them simply to “jazz up” a graph!). If it isn’t there to make a point, don’t include it because it may otherwise mislead.
  3. Make your figures accessible to everyone who views them. Prepare for colorblindness or differing levels of literacy.

Class presentation/project

For the class project, we needed to prepare a single “data slide” that we might include in a presentation about our work. We then presented the slide to our classmates. I’ve put my slide(s) here and code below. To put these into context, I decided to benchmark the k-means clustering algorithms included in the R standard library to see how their run time is affected by increasing the number of randomly initialized starts.

Code

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
library(palmerpenguins)
library(tidyverse)
library(ggthemes)
library(ggsci)
library(colorblindr)
library(cowplot)
library(knitr)
library(kableExtra)

peng <- penguins %>%
  na.omit() %>%
  select(bill_length_mm, bill_depth_mm, body_mass_g, flipper_length_mm) %>%
  scale()
peng_labels <- penguins %>%
  na.omit() %>%
  pull(species)


starts <- c(seq(1, 1e4, 50), 1e4) # number of starting positions
algs = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen") # algorithms to try

myClus <- function(ns){
  return(kmeans(peng, 3, nstart = ns))
}

# function to run clustering for different algorithms
runClus <- function(alg, starts){
  n_starts <- length(starts)
  run_times <- rep(NULL, n_starts)
  converge_iters <- rep(NULL, n_starts)
  for(i in 1:n_starts){
    st <- system.time(replicate(1e3, myClus(n_starts[i])))
    start_time <- Sys.time()
    km <- kmeans(peng, 3, nstart = starts[i], algorithm = alg)
    end_time <- Sys.time()
    run_times[i] <- end_time - start_time
    converge_iters[i] <- km$iter
  }
  df <- data.frame(
    "run.times" = run_times
    , "convergence" = converge_iters
    , "algorithm" = rep(alg, n_starts)
  )
  return(df)
}

hw <- runClus(algs[1], starts)
ll <- runClus(algs[2], starts)
fo <- runClus(algs[3], starts)
mq <- runClus(algs[4], starts)

all_algs <- rbind(
  hw, ll, fo, mq
)

tol_palette <- c("#4477AA", "#CCBB44", "#228833", "#66CCEE", "#EE6677", "#AA3377", "#BBBBBB")
tol_palette2 <- c("#000000", "#DDAA33", "#BB5566", "#004488", "#FFFFFF")
n_runs <- nrow(filter(all_algs, algorithm == "Hartigan-Wong"))
alg_plot <- all_algs %>%
  add_column(
    n_starts = rep(starts, length(algs))
  ) %>%
  ggplot() +
  aes(
    x = n_starts
    , y = run.times
    , group = algorithm
    , color = algorithm
  ) +
  geom_line(
    size = 1
  ) +
  scale_color_manual(
    values = tol_palette2
  ) +
  labs(
    x = "Number of Starts"
    , y = "Run Time (s)"
    , title = "Number of Random Starts vs Algorithm Runtime"
    , subtitle = "The MacQueen algorithm outperforms other base R k-means"
    , color = "Algorithm"
  ) +
  theme_few() +
  theme(
    plot.title = element_text(size = 12)
    , plot.subtitle = element_text(size = 10)
    , panel.grid.major = element_line(color = "#DDDDDD", linetype = 2)
    , legend.position = "bottom"
    , axis.title = element_text(size = 10)
    , legend.title = element_text(size = 10)
  )

pca_res <- prcomp(peng, scale = FALSE)
pca_vals <- pca_res[["x"]] %>%
  data.frame() %>%
  add_column(
    label = peng_labels
  )

pca_plot <- pca_vals %>%
  ggplot() +
  aes(
    x = PC1
    , y = PC2
    , color = label
  ) +
  geom_point() +
  scale_color_manual(
    values = tol_palette
  ) +
  labs(
    x = "Principal Component 1"
    , y = "Principal Component 2"
    , title = "Palmer's Penguins"
    , color = "Species"
  ) +
  theme_few() +
  theme(
    panel.grid.major = element_line(color = "#DDDDDD", linetype = 2)
    , legend.position = "bottom"
    , plot.title = element_text(size = 12)
    , axis.title = element_text(size = 10)
    , legend.title = element_text(size = 10)
  )

prow <- plot_grid(
  pca_plot
  , alg_plot
  , labels = c('A', 'B')
  , label_size = 12
  , rel_widths = c(1.3, 2)
  )
# now add the title
title <- ggdraw() + 
  draw_label(
    "Which clustering algorithm is faster with the Palmer's Penguins dataset?",
    fontface = 'bold',
    x = 0,
    hjust = 0
  ) +
  theme(
    # add margin on the left of the drawing canvas,
    # so title is aligned with left edge of first plot
    plot.margin = margin(0, 0, 0, 7)
  )

figure <- plot_grid(
  title, prow,
  ncol = 1,
  # rel_heights values control vertical title margins
  rel_heights = c(0.1, 1)
)

Slide(s)

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy