Dec 6, 2020

Colorizing points in a base R plot

Colorizing points in a base R plot

By default, base \(R\) plot uses hollow circles for points, perfectly adequate for a single data set, but less so for multivariate data because the edges are too thin for color to stand out well. My go-to option: set the pch argument to 16 and the col argument to the color of my choice.

Background

pch is the argument that specifies the shape of a point in a plot. The three basic selections for a circle shape are:

pch colorizing options
1 (or the default when pch is not given) the edge color can be changed but not the interior
16 (a so-called “solid circle”) the interior can be changed but not the edge
21 (a so-called “filled circle”) the interior and the edge can be different

The pch=16 and pch=21 colorizing options apply to other shapes that also fall into their respective “pch groups”: 15-20 for solid shapes and 21-25 for filled shapes. To see which shapes correspond to which pch value, check out help("points") as well as many posts on the web such as this one.

Circles and squares are illustrated in light and dark backgrounds below.1 IMO, points “pop” from their interior, not from their edge.2

Notice how

  • For the default (pch == 1)
    • the interior color cannot be changed and is always transparent so the background always shows through
    • the default edge color is black so the point virtually disappears on a dark background
  • For solid shapes (pch in 15:20)
    • col specifies both the interior and edge colors, necessarily the same
    • bg – specified or not – has no impact
  • For filled shapes (pch in 21:25)
    • col specifies the edge color
    • bg specifies the interior color, defaulting to “transparent” if unspecified3

Example

Here is a bivariate example using the mtcars dataset in \(R\) and Paul Tol’s “bright” palette.4 5

green = "#228833"
magenta = "#AA3377"
# build plot title -- see stackoverflow citation in footnote
a = quote(paste("miles per gallon vs displacement (i"))
b = quote(n^3)
c = quote(")")
e <- substitute(a * b * c, list(a = a, b = b, c = c))
with(mtcars, 
     plot(disp, mpg
          , pch = 16
          , col = c(green, magenta)[as.numeric(vs)+1]
          , main = e
          )
     )
legend("topright", c("v-engine", "straight-block"), col = c(green, magenta), pch = 16)

Not only do smaller engines get better gas mileage, but high-displacement straight-blocks were nonexistent in 1973.

Bottom line

Use base \(R\)’s default black circles to quickly visualize sequential data.

For colored circles use pch = 16 and col = color_of_your_choice.

Use pch = 21 when it is useful to differentiate a point’s edge from its interior.

Use color-blind friendly colors whenever possible.

Try pch = "." for dots when you have many points but you don’t want lines.

Postscript

The complementary colors green (#228833) and magenta (#AA3377) used in this post come from Paul Tol’s color-blind friendly bright color palette.6 Tol’s “Notes” page is worth visiting for other helpful colorizing advice. This non-color-blind author would be interested in reader feedback regarding the distinguishability of the colors used in this post.


The end of business today, 12/12/2020, marks 251.387 months since the end of the last millenium.
Generated with Rmarkdown in RStudio.


R Environment

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mondate_0.10.01.02

loaded via a namespace (and not attached):
 [1] compiler_4.0.3  magrittr_1.5    tools_4.0.3     htmltools_0.5.0
 [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   knitr_1.28     
 [9] stringr_1.4.0   xfun_0.14       digest_0.6.25   rlang_0.4.6    
[13] evaluate_0.14  

  1. Regarding the color of the default background, the italicised phrases are from \(R\) help pages: normally “white” from help("par"), often transparent from help("frame")↩︎

  2. Called “border” in \(R\)↩︎

  3. Per help("par"): “For many devices the initial value [of the plot background] is set from the bg argument of the device, and for the rest it is normally "white".”↩︎

  4. Okabe and Ito have a wonderful site that discusses Color Universal Design: https://jfly.uni-koeln.de/color/. Okabe/Ito and Tol palettes can be displayed with \(R\) code downloadable from here: Goedhart, Joachim. (2019, August 29). Material related to the blog “Dataviz with Flying Colors”. Zenodo. http://doi.org/10.5281/zenodo.3381072↩︎

  5. Technique for superscript in title from https://stackoverflow.com/questions/34193276/concatenate-several-math-expressions-in-r↩︎

  6. For additional perspectives on color-impaired visualizations, see https://thenode.biologists.com/data-visualization-with-flying-colors/research/ and https://venngage.com/blog/color-blind-friendly-palette/ and https://jfly.uni-koeln.de/color/.↩︎

4 comments:

  1. What is the best option for when data points overlap?

    ReplyDelete
  2. Some R packages have sophisticated methods for dealing with overlapping data, but in base R the choices are limited. The best choice that does not depend on the device is via the function jitter. Try this site for that and other options: http://www.rensenieuwenhuis.nl/r-sessions-13-overlapping-data-points/

    ReplyDelete