Beyond tibbles - Creating your own tibble subclass and making it work with the tidyverse

Sometimes your data just isn't rectangular

By Martin Helm in R

May 5, 2024

Introduction

The tidyverse is the most popular analysis framework for R. It is based on the tibble class, which extends data.frames with some more intuitive behavior. But at the core, it still is a rectangular table format.

While this is sufficient for many cases, sometimes you need to store additional data that doesn’t fit easily into the table format, or would introduce a lot of redundant data. For example, if you want to track single values, such as a date of an analysis, you could add a date column, but it would hardly be efficient.

In other circumstances, you have data that belongs together from a domain perspective, but because they have different dimensions, you cannot simply join the two tables. Finally, sometimes your data needs to be processed differently than the standard tidyverse verbs do, for example for time series data some functions might need to take into account the order of the rows over time.

For both problems, you can create a subclass of a tibble. This lets you easily store additional attributes and also modify the behavior of functions in the tidyverse when they are applied to your custom class. This is an advanced topic, for which unfortunately not much documentation exists. In this post I will lay out how you can:

  1. Implement your own tibble subclass
  2. How to modify tidyverse function by implementing custom S3 methods
  3. What you need to do when doing this in a package

Throughout the article I will use the :: notation to make it clear where each command comes from.

Creating your own S3 tibble subclass

Instantiating a tibble subclass

Tibbles are a S3 class at their core, which inherit from data.frame. It is defined in the {tibble} package. As you can find a lot of nice introductions on how the S3 class system works, e.g. from Hadley Wickham’s Advanced R I will not cover it in detail here and assume that you know the basics of S3 and inheritance. If you need to refresh your knowledge, pause now and briefly read through the chapter linked above.

Very fortunately for us, the tibble class was designed to be extended. This can be done via the low-level constructor function, as described in Advanced R. We can now easily create a new subclass just by passing in a class argument to the new_tibble function.

We will use the class tbl_sub throughout the article. Replace it with your desired class name. Also note, that we need to provide some initial data. This can be a list, data.frame or tibble.

my_tibble <- tibble::new_tibble(tibble::tibble(), class = "tbl_sub")
my_tibble
## # A tibble: 0 × 0
class(my_tibble)
## [1] "tbl_sub"    "tbl_df"     "tbl"        "data.frame"

This in itself is not very useful, as we cannot store additional data (but would be sufficient to modify tidyverse’s behavior for our custom class). Let’s assume that the data we are working with are some kind of measurements recorded during an experiment, and we want to store the person who did the experiment as well. We could in theory add another column with this information, but that would be redundant and can waste a lot of memory for large data sets.

Instead we will add another attribute scientist to store this:

my_tibble_sub <- tibble::new_tibble(tibble::tibble(), scientist = "Martin", class = "tbl_sub")
my_tibble_sub
## # A tibble: 0 × 0

As you notice, the attribute is not printed by default. This is because we will need to modify the print functionality to let it know that we want to have the new experiment printed, this will be done a bit later. But the attribute exists and we can access it with attr:

attributes(my_tibble_sub)
## $class
## [1] "tbl_sub"    "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## integer(0)
## 
## $names
## character(0)
## 
## $scientist
## [1] "Martin"
attr(my_tibble_sub, "scientist")
## [1] "Martin"

Congratulations, you already created your tibble subclass with a custom attribute! But to make our solutions more visible to the user, easy and robust I want to to introduce two more topics before we move to modify dplyr and tidyr: constructor functions and attribute access.

Constructor functions

The S3 class system of R is very flexible, but at the same time this provides ample opportunity for the user to misuse our class. To prevent unintended behaviors (for example preventing that the user sets a number for our scientist attribute), one typically creates constructor functions for the object itself, and provides accessor functions for the attributes.

With these function, we can provide sensible defaults, and check the user input, before it is passed on to low-level functions. These low-level functions are then the ones that do the actual work, but we can have much more confidence that the input they receive is appropriate. (Note that even the low-level functions are not user-hidden, the user can still access them with the ::: operator, but it makes it clear that this is not intended). This is similar to other object oriented programming approaches, where you have constructor classes, builder and factory patterns etc.

We will therefore create two new functions, tbl_sub as the user facing function, and new_tbl_sub as the low-level constructor:

tbl_sub <- function(x, scientist) {
  new_tbl_sub(x, scientist = scientist)
}

new_tbl_sub <- function(x, scientist = NULL) {
  tibble::new_tibble(x, scientist = scientist, class = "tbl_sub")
}

my_tibble_sub <- tbl_sub(tibble::tibble(), scientist = "Martin")

Note that for the user facing function, we require a set scientist argument, whereas the low-level function has a default NULL. This allows us as the developer the possibility to create a tbl_sub without an scientist attribute, if we come across a use case for this in the future, but still makes it clear to the user that every experiment must have an associated scientist.

In addition to the constructor function, we also implement a check function, to verify that an instance of our tbl_sub class is well behaved. We can later use this to check that the user didn’t meddle with our class in unintended ways (e.g. by using the ::: operator). Since these checks will be quite frequent, they should be light-weight, as you will run them basically in all of your custom functions and methods.

Typical checks would be that the input object is really of our custom class, as well as checking that the attributes are present.

# This function is usually user-facing
is_tbl_sub <- function(x) {
  inherits(x, "tbl_sub")
}

# This is mostly for internal use
check_attributes <- function(x) {
  required_attributes <- "scientist"
  all(required_attributes %in% names(attributes(x)))
}

# Our actual function to perform the checks
check_tbl_sub <- function(x) {
  stopifnot("x is not a tbl_sub" = is_tbl_sub(x))
  stopifnot("At least one required attribute is missing" = check_attributes(x))
}

If you want to do more heavy-weight checks, I recommend setting creating your check function with an argument like thorough = FALSE, and only apply more intensive checks when this is set to true. You can then use this when it really matters, while defaulting to the small checks.

Accessor functions

As we have seen above, so far accessing our custom attribute is not very convenient, as we need to call attr and attr <- value to interact with it. Therefore, you typically provide get and set functions to do this. Within these methods you can also check the input of the user, to prevent misuse.

Alternatively, you could also create get and set methods using the S3 system, but this is typically much more complicated and doesn’t add a lot of value.

get_scientist <- function(x) {
  check_tbl_sub(x)
  attr(x, "scientist")
}

set_scientist <- function(x, scientist) {
  check_tbl_sub(x)
  stopifnot("scientist must be a character" = is.character(scientist))
  attr(x, "scientist") <- scientist
  x
}

get_scientist(my_tibble_sub)
## [1] "Martin"

Setting a number for scientist is not allowed:

x <- set_scientist(my_tibble_sub, 1)
## Error in set_scientist(my_tibble_sub, 1): scientist must be a character

Setting up for the future

Throughout the article, or your implementation, you will need to reference what you expect your custom tibble subclass to look like in several positions. To prevent errors and keep your code DRY, I highly recommend to store this format in a central place, making it easy for you to reference it.

Modifying tidyverse methods for our custom class

Recap of S3 methods and generics

Generics are foundational functions for R’s S3 class system. It allows for polymorphism, meaning that the same name of a function behaves differently depending on what object it was called on. For example print is a generic, that handles a multitude of classes, providing custom printing for each one.

To generate an S3 method, you first need to define a generic with useMethod (if this hasn’t been defined already by base R of a package you are extending).

my_generic <- function(x) {
  UseMethod("my_generic")
}

Then you can define your S3 methods with the typical function syntax, with one difference: The name of the function needs to be in the format functionname.classname:

my_generic.list <- function(x) {
  # Do something specific for lists
}

my_generic.data.frame <- function(x) {
  # Do something specific for data.frames
}

When the function my_generic is called, R will now check the class of x and call the corresponding generic definition for it.

Some objects have multiple classes, in our example, tibbles have actually class c("tbl_df", "tbl", "data.frame). In these cases, the method dispatch will from left to right. First it checks if a generic for tbl_df is defined, if not check if a generic for tbl is defined, and so on.

One pattern we will see frequently in this post is handing off to a downstream method. This allows us to write basic behavior for lower-level classes, which can be reused by more specific up-stream classes. We do this by calling NextMethod after having performed the more specific parts for the class at hand.

tibble

Printing

So far, we can create a custom class and store the attribute scientist but when printing our object it appears as a regular tibble. To make it more clear to the user that it is a different custom class we will first modify the printing behavior that also provides additional information.

To do so, we overwrite the tbl_sum function (implemented in the pillars package, which is responsible for tibble printing). Alternatively we could also overwrite the print method which is also called by tbl_sum.

We also see a pattern that will continue throughout the post, the call to NextMethod. Very often you might want to keep a lot of functionality of the tidyverse and only modify some elements. To do this, we can simply hand off to the respective method in the tidyverse and modify the result afterwards according to our needs.

In the printing case, the call to NextMethod is important, as otherwise the default tbl printing would not occur (call stack is our method -> next specific method etc). If you want to only show your custom print, then leave it out.

tbl_sum.tbl_sub <- function(x, ...) {
  c(
    "A tbl_sub containing data from an experiment by " =  get_scientist(x),
    NextMethod()
  )
}

my_tibble_sub
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         0 × 0

as_tibble

Another peculiarity I found is that tibble::as_tibble() strips the class, but retains the attribute.

my_tibble_sub %>% tibble::as_tibble() %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"
my_tibble_sub %>% tibble::as_tibble() %>% attributes()
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## integer(0)
## 
## $names
## character(0)
## 
## $scientist
## [1] "Martin"

I personally found this inconsistent and implemented a custom as_tibble method:

as_tibble.tbl_sub <- function(x, 
                              ..., 
                              .rows = NULL, 
                              .name_repair = c("check_unique", "unique", "universal", "minimal"),
                              rownames = pkgconfig::get_config("tibble::rownames", NULL)
) {
  tbl <- NextMethod()
  # Use an empty tibble as a template for what attributes to keep
  tbl_attributes <- tibble::tibble() %>% 
    attributes() %>% 
    names()
  
  attributes(tbl) <- attributes(x)[tbl_attributes]
  
  tbl
}

tibble::as_tibble(my_tibble_sub) %>% attributes()
## $class
## [1] "tbl_sub"    "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## integer(0)
## 
## $names
## character(0)

Other methods to consider for implementation

Other methods that you might find interesting to implement are as_tbl_sub to convert some other class to your subclass.

dplyr

Next we will extend dplyr to work with our custom attribute. Unfortunately, there is very little documentation on how to do this exactly, except the extend-dplyr vignette.

I observed that most dplyr functions work nicely with custom classes (they don’t drop the class and preserve attributes). Grouping on the other hand introduces a lot of problems, as it strips the custom class and attributes by default. To prevent this need to implement custom group_by and ungroup method. Also, functions that combine data.frames, such as bind_cols, needs to know what to do with your class as well. Otherwise it will silently drop information.

Let’s tackle these problems:

Implementing custom grouping generics

The grouping functions group_by and ungroup drop classes. This is because the default data.frame method for group_by always constructs a new grouped_df to return.

my_tibble_sub <- tbl_sub(
  tibble::tibble(
    var1 = 1:10, 
    var2 = sample(c("a", "b"), 10, replace = TRUE), 
    var3 = sample(c("c", "d"), 10, replace = TRUE)), 
  scientist = "Martin")
my_tibble_sub %>% group_by(var2) %>% ungroup() %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"
my_tibble_sub %>% group_by(var2) %>% ungroup() %>% get_scientist()
## Error in check_tbl_sub(x): x is not a tbl_sub

The culprit lines in dplyr are here:

# grouped-df.R
grouped_df <- function(data, vars, drop = group_by_drop_default(data)) {
  # ... Some preceding code leftout for clarity
  if (length(vars) == 0) {
    as_tibble(data) # This will remove your custom class and attributes if you dont have any grouping variables.
  } else {
    groups <- compute_groups(data, vars, drop = drop)
    new_grouped_df(data, groups) # This removes your custom class and attributes if you have grouping variables, as it creates a new grouped_df object from scratch
  }
}

Subclassing grouped_df

I found subclassing the grouped_df directly to be the best solution. The problem with subclassing is, that you loose the tbl_sub class information and you would need to modify the class string still manually, to insert it at the right position. Providing it into dplyr::new_grouped_df would add in betweengrouped_tbl_sub and grouped_df, but it must come after grouped_df:

tbl <- tibble::as_tibble(my_tibble_sub)
grouped_tbl <- dplyr::group_by(tbl, var2)
grouping_structure <- dplyr::group_data(grouped_tbl)

# This implementation adds the tbl_sub class at the wrong position
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
  dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub", "tbl_sub"))
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "tbl_sub"         "grouped_df"      "tbl_df"         
## [5] "tbl"             "data.frame"
# This implementation would loose the information that it is a tbl_sub, i.e. after ungrouping this would be lost completely.
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
  dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub"))
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "grouped_df"      "tbl_df"          "tbl"            
## [5] "data.frame"
# Therefore the correct implementation needs to meddle with the class identifier
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
  x <- dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub"))
  tbl_df_location <- grep("tbl_df", class(x), fixed = TRUE)
  class(x) <- append(class(x), "tbl_sub", after = tbl_df_location -1)
  x
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "grouped_df"      "tbl_sub"         "tbl_df"         
## [5] "tbl"             "data.frame"

Since this is only for internal use later on, we don’t implement a user facing method. As a reference, the tsibble package does not implement this method at all, they do everything inside the other method calls, such as group_by and ungroup.

Group_by

With our current use case, we can simply call the dplyr::dplyr_reconstruct function to restore the lost attributes to a grouped tbl:

my_tibble_sub_after_grouping <- my_tibble_sub %>% group_by(var2) %>% ungroup()
my_tibble_sub_reconstructed <- dplyr::dplyr_reconstruct(my_tibble_sub_after_grouping, my_tibble_sub)
my_tibble_sub_reconstructed
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d
class(my_tibble_sub_reconstructed)
## [1] "tbl_sub"    "tbl_df"     "tbl"        "data.frame"
get_scientist(my_tibble_sub_reconstructed)
## [1] "Martin"

We need to do the following steps:

  • Run the default group_by method to get the grouping information
  • Restore our custom attributes and class. This will remove the grouping information!
  • Add the grouping information back from the default group_by
  • Modify the class identifier of the returns object, to make it clear that it is a subclass of grouped_df. This is necessary so that we can later also handle grouped instances of our subclass for other dplyr verbs
group_by.tbl_sub <- function(.data, ..., .add = FALSE, drop = dplyr::group_by_drop_default(.data)) {
  grouped_tbl <- NextMethod()
  
  if (dplyr::is.grouped_df(grouped_tbl)) {
    # Extract grouping information from default method  
    grouping_structure <- dplyr::group_data(grouped_tbl)
    # Restore original attributes and add grouping information
    x <- new_grouped_tbl_sub(.data, groups = grouping_structure, scientist = get_scientist(.data))
  } else {
    # This is an edge case if no groups are actually provided. Then simply return a regular subclass
    # If we dont implement this, a call like my_tibble_sub %>% group_by(var2, var3) %>% ungroup(var2, var3) would still have a       groups attribute
    x <- new_tbl_sub(x = .data, scientist = get_scientist(.data))
  }  
  x
}

my_tibble_sub %>% dplyr::group_by(var2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
## # Groups:                                           var2 [2]
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d
my_tibble_sub %>% dplyr::group_by(var2) %>% class()
## [1] "grouped_tbl_sub" "grouped_df"      "tbl_sub"         "tbl_df"         
## [5] "tbl"             "data.frame"
my_tibble_sub %>% dplyr::group_by(var2) %>% get_scientist()
## [1] "Martin"
my_tibble_sub %>% dplyr::group_by(var2) %>% attributes()
## $class
## [1] "grouped_tbl_sub" "grouped_df"      "tbl_sub"         "tbl_df"         
## [5] "tbl"             "data.frame"     
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $names
## [1] "var1" "var2" "var3"
## 
## $scientist
## [1] "Martin"
## 
## $groups
## # A tibble: 2 × 2
##   var2        .rows
##   <chr> <list<int>>
## 1 a             [6]
## 2 b             [4]
# It also works with multiple levels of grouping:
my_tibble_sub %>% dplyr::group_by(var2, var3)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
## # Groups:                                           var2, var3 [4]
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d

Ungrouping

Now we still need to implement a custom ungrouping method, because currently after ungrouping everything is stripped again:

my_tibble_sub %>% dplyr::group_by(var2) %>% dplyr::ungroup()
## # A tibble: 10 × 3
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d

Thanks to our custom grouped_tbl_sub class, we can now identify when to call our custom ungrouping method. The general logic is again very similar, run the default ungrouping and then restore our attributes. But we need to be a bit careful of what to restore, as a simple restore would write back all groups (since groups are a simple attribute under the hood) and class string. At the same time, we need to support partial ungrouping, therefore we cannot call as_tibble as this would strip all grouping information.

ungroup.grouped_tbl_sub <- function(x, ...) {
  # Run default ungrouping.
  tbl <- NextMethod()
  
  if (dplyr::is_grouped_df(tbl)) {
    # If the tibble is still grouped, we dont need to do anything, as the default dplyr method doesnt call as_tibble
    x <- tbl
  } else {
    # Otherwise the tibble is completely ungrouped and we need to reapply custom attributes, but remove grouping
    # This is most simplest done by simply creating a new tibble subclass
    x <- tbl_sub(tbl, scientist = get_scientist(x))
  }
  x
}

my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup()
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d
my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup(var2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
## # Groups:                                           var3 [2]
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d
# Also works with ungrouping all groups explicitly
my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup(var2, var3)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
##     var1 var2  var3 
##    <int> <chr> <chr>
##  1     1 b     c    
##  2     2 a     d    
##  3     3 a     c    
##  4     4 b     c    
##  5     5 a     d    
##  6     6 a     c    
##  7     7 a     d    
##  8     8 a     c    
##  9     9 b     c    
## 10    10 b     d

Slicing

Slicing with custom tibble works out of the box:

my_tibble_sub %>% slice_max(var1, n = 2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         2 × 3
##    var1 var2  var3 
##   <int> <chr> <chr>
## 1    10 b     d    
## 2     9 b     c

But slicing with a grouped custom tibble does not, notice that it got turned into a regular tibble again:

my_tibble_sub %>% group_by(var2) %>% slice_max(var1, n = 2)
## # A tibble: 4 × 3
## # Groups:   var2 [2]
##    var1 var2  var3 
##   <int> <chr> <chr>
## 1     8 a     c    
## 2     7 a     d    
## 3    10 b     d    
## 4     9 b     c

Checking the dplyr documentation, we see that we need to implement a custom dplyr_row_slice method. The default method creates a new grouped_df from scratch, which again removes custom class and attributes. By now you will know the pattern:

dplyr_row_slice.grouped_tbl_sub <- function(data, i, ..., preserve = FALSE) {
  tbl <- NextMethod()
  groups <- group_data(tbl)
  new_grouped_tbl_sub(tbl, groups, scientist = get_scientist(data))
} 

my_tibble_sub %>% group_by(var2) %>% slice_max(var1, n = 2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         4 × 3
## # Groups:                                           var2 [2]
##    var1 var2  var3 
##   <int> <chr> <chr>
## 1     8 a     c    
## 2     7 a     d    
## 3    10 b     d    
## 4     9 b     c

Mutate

The same pattern happens for mutate, where again only grouped subclasses are lost:

my_tibble_sub %>% 
  dplyr::mutate(new_var = var1 + 10)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 4
##     var1 var2  var3  new_var
##    <int> <chr> <chr>   <dbl>
##  1     1 b     c          11
##  2     2 a     d          12
##  3     3 a     c          13
##  4     4 b     c          14
##  5     5 a     d          15
##  6     6 a     c          16
##  7     7 a     d          17
##  8     8 a     c          18
##  9     9 b     c          19
## 10    10 b     d          20
my_tibble_sub %>% 
  dplyr::group_by(var2) %>% 
  dplyr::mutate(new_var = var1 + 10)
## # A tibble: 10 × 4
## # Groups:   var2 [2]
##     var1 var2  var3  new_var
##    <int> <chr> <chr>   <dbl>
##  1     1 b     c          11
##  2     2 a     d          12
##  3     3 a     c          13
##  4     4 b     c          14
##  5     5 a     d          15
##  6     6 a     c          16
##  7     7 a     d          17
##  8     8 a     c          18
##  9     9 b     c          19
## 10    10 b     d          20
mutate.grouped_tbl_sub <- function(.data,
                                   ...,
                                   .by = NULL,
                                   .keep = c("all", "used", "unused", "none"),
                                   .before = NULL,
                                   .after = NULL) {
  tbl <- NextMethod()
  groups <- group_data(tbl)
  new_grouped_tbl_sub(tbl, groups, scientist = get_scientist(.data))
}

my_tibble_sub %>% 
  dplyr::group_by(var2) %>% 
  dplyr::mutate(new_var = var1 + 10)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 4
## # Groups:                                           var2 [2]
##     var1 var2  var3  new_var
##    <int> <chr> <chr>   <dbl>
##  1     1 b     c          11
##  2     2 a     d          12
##  3     3 a     c          13
##  4     4 b     c          14
##  5     5 a     d          15
##  6     6 a     c          16
##  7     7 a     d          17
##  8     8 a     c          18
##  9     9 b     c          19
## 10    10 b     d          20

Combining multiple tbl_sub instances

I dont cover every method here, just some examples. The exact behavior depends on your actual data and what you need to happen.

bind_rows

bind_rows silently drops all attributes and only preserves information from first. This is because it doesnt know what to do.

my_tibble_sub <- tbl_sub(data.frame(var1 = 1:5), scientist = "Martin")
my_tibble_sub_2 <- tbl_sub(data.frame(var1 = 10:14), scientist = "Someone Else")

my_tibble_sub %>% 
  bind_rows(my_tibble_sub2) %>% 
  get_scientist()
## Error in list2(...): object 'my_tibble_sub2' not found

Depending on your class, you need to think what makes sense. In this case one could either only append the scientist information from both tbl_sub instances, loosing information exactly which data was provided by which scientist, or you could additionally create a new column that adds this information. Or you could prevent this behavior completely by throwing an error (or a warning if you only append the two scientist attributes).

Unfortunately, since bind_rows is not a generic, we cannot override it with a custom method. There are 3 options:

  • We can either ask the users to do it to provide an input to the .id argument to identify the individual scientists later on. This still gives incorrect attributes and printing
  • Implement an rbind method, which is fortunately generic. But a user might use bind_rows instead without looking to closely
  • Overwrite dplyr’s bind_rows implementation. This makes it more obvious to the user.

As an example, I here implement our own bind_rows as well as reference it from the rbind generic. I found this to be the most obvious solution, and the user can still opt to go for dplyr’s version with dplyr::bind_rows.

bind_rows <- function(..., .id = NULL) {
  dots <- rlang::list2(...)
  
  scientists <- lapply(dots, get_scientist) %>% unlist()
  
  # If the user didnt specify custom identifiers, use the scientists by default
  if (!rlang::is_named(dots)) {
    names(dots) <- scientists  
  }
  
  # If the user didnt specify a custom name for the id column, use scientist
  if (is.null(.id)) {
    .id <- "scientist"
  }
  
  tbl_concatenated <- dplyr::bind_rows(dots, .id = .id) %>% 
    set_scientist(scientists)
  
  tbl_concatenated
}

rbind.tbl_sub <- function(...) {
  bind_rows(...) # Forward to our custom bind_rows and use defaults for arguments not available in rbind
}

bind_rows(my_tibble_sub, my_tibble_sub_2)
## # A tbl_sub containing data from an experiment by 1: Martin
## # A tbl_sub containing data from an experiment by 2: Someone Else
## # A tibble:                                          10 × 2
##    scientist     var1
##    <chr>        <int>
##  1 Martin           1
##  2 Martin           2
##  3 Martin           3
##  4 Martin           4
##  5 Martin           5
##  6 Someone Else    10
##  7 Someone Else    11
##  8 Someone Else    12
##  9 Someone Else    13
## 10 Someone Else    14

bind_cols

A similar case happens for bind_cols. In this case you would probably modify the column names to retain the information. This is left as an exercise for the reader ;)

tbl_sub_1 <- tbl_sub(data.frame(var1 = 1:5), scientist = "Martin")
tbl_sub_2 <- tbl_sub(data.frame(var2 = letters[1:5]), scientist = "Someone Else")

tbl_sub_1 %>% 
  bind_cols(tbl_sub_2) %>% 
  get_scientist()
## [1] "Martin"

Rowwise data.frames

So far in the article, I focused mostly on grouped data.frames as I personally use them more often. But the same problems that we discussed also happen for rowwise data.frames, as the dplyr functions construct the rowwise data.frames from scratch.

my_tibble_sub %>% 
  dplyr::rowwise() %>% 
  dplyr::mutate(new_var = var1 + 10)
## # A tibble: 5 × 2
## # Rowwise: 
##    var1 new_var
##   <int>   <dbl>
## 1     1      11
## 2     2      12
## 3     3      13
## 4     4      14
## 5     5      15

To prevent this you will need to implement a custom rowwise class and also the same methods for your rowwise subclass. Since this article is very long and the pattern gets repetitive, I don’t show examples here. If you are facing problems, feel free to reach out so I can give you some guidance.

tidyr

Similar as for dplyr, several of the tidyr verbs drop attributes. Here I will show how to implement the pivot_* functions, as they are probably one of the main functions used by many people.

When developing these methods, we need to be careful what the users give as arguments to the data masked arguments, like names_from. As they allow both character as well as the unquoted input, we first need to defuse it and then inject it into the tidyr method using the !! operator.

pivot_wider.tbl_sub <- function(data,
                                ...,
                                id_cols = NULL,
                                id_expand = FALSE,
                                names_from = name,
                                names_prefix = "",
                                names_sep = "_",
                                names_glue = NULL,
                                names_sort = FALSE,
                                names_vary = "fastest",
                                names_expand = FALSE,
                                names_repair = "check_unique",
                                values_from = value,
                                values_fill = NULL,
                                values_fn = NULL,
                                unused_fn = NULL) {
  
  # We cannot use NextMethod directly. Instead defuse the arguments and inject into tidyr method.
  
  names_from <- rlang::enquo(names_from)
  values_from <- rlang::enquo(values_from)
  id_cols <- rlang::enquo(id_cols)
  
  x <- tidyr:::pivot_wider.data.frame(data,
                                      ...,
                                      id_cols = !!id_cols,
                                      id_expand = id_expand,
                                      names_from = !!names_from,
                                      names_prefix = names_prefix,
                                      names_sep = names_sep,
                                      names_glue = names_glue,
                                      names_sort = names_sort,
                                      names_vary = names_vary,
                                      names_expand = names_expand,
                                      names_repair = names_repair,
                                      values_from = !!values_from,
                                      values_fill = values_fill,
                                      values_fn = values_fn,
                                      unused_fn = unused_fn)
  
  x <- dplyr::dplyr_reconstruct(x, data)
  x
}

pivot_longer.tbl_sub <- function(data,
                                 cols,
                                 ...,
                                 cols_vary = "fastest",
                                 names_to = "name",
                                 names_prefix = NULL,
                                 names_sep = NULL,
                                 names_pattern = NULL,
                                 names_ptypes = NULL,
                                 names_transform = NULL,
                                 names_repair = "check_unique",
                                 values_to = "value",
                                 values_drop_na = FALSE,
                                 values_ptypes = NULL,
                                 values_transform = NULL) {
  
  cols <- rlang::enquo(cols)
  
  x <- tidyr:::pivot_longer.data.frame(data = data,
                                       cols = !!cols,
                                       ...,
                                       cols_vary = cols_vary,
                                       names_to = names_to,
                                       names_prefix = names_prefix,
                                       names_sep = names_sep,
                                       names_pattern = names_pattern,
                                       names_ptypes = names_ptypes,
                                       names_transform = names_transform,
                                       names_repair = names_repair,
                                       values_to = values_to,
                                       values_drop_na = values_drop_na,
                                       values_ptypes = values_ptypes,
                                       values_transform = values_transform
  )
  
  x <- dplyr::dplyr_reconstruct(x, data)
  x
}
experiment_data <- tbl_sub(
  tibble::tibble(
    var1 = 1:10, 
    experiment = rep(c("a", "b"),each = 5), 
    cell_type = rep(c("type1", "type2", "type3", "type4", "type5"), times = 2)), 
  scientist = "Martin")

experiment_data %>% 
  pivot_wider(values_from = var1, names_from = cell_type)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         2 × 6
##   experiment type1 type2 type3 type4 type5
##   <chr>      <int> <int> <int> <int> <int>
## 1 a              1     2     3     4     5
## 2 b              6     7     8     9    10
experiment_data %>% 
  pivot_wider(values_from = var1, names_from = cell_type) %>% 
  pivot_longer(cols = dplyr::starts_with("type"), names_to = "cell_type", values_to = "var1")
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble:                                         10 × 3
##    experiment cell_type  var1
##    <chr>      <chr>     <int>
##  1 a          type1         1
##  2 a          type2         2
##  3 a          type3         3
##  4 a          type4         4
##  5 a          type5         5
##  6 b          type1         6
##  7 b          type2         7
##  8 b          type3         8
##  9 b          type4         9
## 10 b          type5        10

Implementing tibble subclasses in a packages

When you will develop a custom class, it is highly likely that you are doing that inside a package that you develop. First I will list some requirements what you need to do to make your custom methods work, as well as give some guidelines which of the other new function you should export or not.

Importing and exporting methods and generics

When you are in a package context, you need to explicitly import and export the generics for the methods we define here. Otherwise R will not know that a function is a generic and the dispatch using the S3 system is not invoked.

In a nutshell you need to do the following:

  • Add the @export flag to the Roxygen docstring for all your custom methods
  • Import the generics from the original packages. This is best done also using the Roxygendocstring using @importFrom. Note that you dont need to import methods from base R, such as rbind.

The second part is often done in a dedicated package description docstring. This docstring doesnt document a function, it rather describes your package in general. It might look like this.

# file my_awesome_package.R

#' My awesome package
#' 
#' This packages implements the `tbl_sub` class that can store a scientist attribute.
#' Otherwise it behaves like a `tibble`
#' 
#' @importFrom tibble as_tibble tbl_sum
#' @importFrom dplyr group_by ungroup mutate dplyr_row_slice dplyr
#' @importFrom tidyr pivot_longer pivot_wider
"_PACKAGE"

Dont forget to run devtools::document after writing the docstrings, otherwise the NAMESPACE file will not be updated. Even during development, the methods are only used once the namespace includes them.

Define which functions are user facing

As with any other functions, use the @export tag for new functions that you want to be user-facing.

In our example I would not export the low-level constructor new_tbl_sub, but only tbl_sub. Other useful functions to export are the accessor functions, like get_scientist and set_scientist.

References

Last but not least, I am standing on the shoulds of giants to provide you this work. Many thanks to the developers of the tidyverse to make it extensible in this way and also a special shout out to the tsibble package, which I references heavily for their implementation of their custom class.

If you want to read up on more details, you also find below some resources that I found helpful: