Beyond tibbles - Creating your own tibble subclass and making it work with the tidyverse
Sometimes your data just isn't rectangular
By Martin Helm in R
May 5, 2024
Introduction
The tidyverse is the most popular analysis framework for R. It is based on the tibble
class, which extends data.frames
with some more intuitive behavior. But at the core, it still is a rectangular table format.
While this is sufficient for many cases, sometimes you need to store additional data that doesn’t fit easily into the table format, or would introduce a lot of redundant data. For example, if you want to track single values, such as a date of an analysis, you could add a date column, but it would hardly be efficient.
In other circumstances, you have data that belongs together from a domain perspective, but because they have different dimensions, you cannot simply join the two tables. Finally, sometimes your data needs to be processed differently than the standard tidyverse verbs do, for example for time series data some functions might need to take into account the order of the rows over time.
For both problems, you can create a subclass of a tibble
. This lets you easily store additional attributes and also modify the behavior of functions in the tidyverse when they are applied to your custom class.
This is an advanced topic, for which unfortunately not much documentation exists. In this post I will lay out how you can:
- Implement your own tibble subclass
- How to modify tidyverse function by implementing custom S3 methods
- What you need to do when doing this in a package
Throughout the article I will use the ::
notation to make it clear where
each command comes from.
Creating your own S3 tibble subclass
Instantiating a tibble subclass
Tibbles are a S3 class at their core, which inherit from data.frame
. It is defined in the {tibble}
package. As you can find a lot of nice introductions on how the S3 class system works, e.g. from Hadley Wickham’s
Advanced R I will not cover it in detail here and assume that you know the basics of S3 and inheritance. If you need to refresh your knowledge, pause now and briefly read through the chapter linked above.
Very fortunately for us, the tibble
class was designed to be extended. This can be
done via the low-level constructor function, as described in Advanced R.
We can now easily create a new subclass just by passing in a class
argument to the new_tibble
function.
We will use the class tbl_sub
throughout the article. Replace it with your desired class name.
Also note, that we need to provide some initial data. This can be a list
, data.frame
or tibble
.
my_tibble <- tibble::new_tibble(tibble::tibble(), class = "tbl_sub")
my_tibble
## # A tibble: 0 × 0
class(my_tibble)
## [1] "tbl_sub" "tbl_df" "tbl" "data.frame"
This in itself is not very useful, as we cannot store additional data (but would be sufficient to modify tidyverse’s behavior for our custom class). Let’s assume that the data we are working with are some kind of measurements recorded during an experiment, and we want to store the person who did the experiment as well. We could in theory add another column with this information, but that would be redundant and can waste a lot of memory for large data sets.
Instead we will add another attribute scientist
to store this:
my_tibble_sub <- tibble::new_tibble(tibble::tibble(), scientist = "Martin", class = "tbl_sub")
my_tibble_sub
## # A tibble: 0 × 0
As you notice, the attribute is not printed by default. This is because we will need to modify
the print functionality to let it know that we want to have the new experiment printed, this will be done a bit later. But the attribute exists and we can access it with attr
:
attributes(my_tibble_sub)
## $class
## [1] "tbl_sub" "tbl_df" "tbl" "data.frame"
##
## $row.names
## integer(0)
##
## $names
## character(0)
##
## $scientist
## [1] "Martin"
attr(my_tibble_sub, "scientist")
## [1] "Martin"
Congratulations, you already created your tibble subclass with a custom attribute!
But to make our solutions more visible to the user, easy and robust I want to to introduce two more topics before we move to modify dplyr
and tidyr
: constructor functions and attribute access.
Constructor functions
The S3 class system of R is very flexible, but at the same time this provides ample opportunity for the user to misuse our class. To prevent unintended behaviors (for example preventing that the user sets a number for our scientist
attribute), one typically creates constructor functions for the object itself, and provides accessor functions for the attributes.
With these function, we can provide sensible defaults, and check the user input, before it is passed on to low-level functions. These low-level functions are then the ones that do the actual work, but we can have much more confidence that the input they receive is appropriate. (Note that even the low-level functions are not user-hidden, the user can still access them with the :::
operator, but it makes it clear that this is not intended). This is similar to other object oriented programming approaches, where you have constructor classes, builder and factory patterns etc.
We will therefore create two new functions, tbl_sub
as the user facing function, and new_tbl_sub
as the low-level constructor:
tbl_sub <- function(x, scientist) {
new_tbl_sub(x, scientist = scientist)
}
new_tbl_sub <- function(x, scientist = NULL) {
tibble::new_tibble(x, scientist = scientist, class = "tbl_sub")
}
my_tibble_sub <- tbl_sub(tibble::tibble(), scientist = "Martin")
Note that for the user facing function, we require a set scientist
argument, whereas the
low-level function has a default NULL
. This allows us as the developer the possibility to create
a tbl_sub
without an scientist
attribute, if we come across a use case for this in the future,
but still makes it clear to the user that every experiment must have an associated scientist.
In addition to the constructor function, we also implement a check function, to verify that an instance of our tbl_sub
class is well behaved. We can later use this to check that the user didn’t meddle with our class in unintended ways (e.g. by using the :::
operator). Since these checks will be quite frequent, they should be light-weight, as you will run them basically in all of your custom functions and methods.
Typical checks would be that the input object is really of our custom class, as well as checking that the attributes are present.
# This function is usually user-facing
is_tbl_sub <- function(x) {
inherits(x, "tbl_sub")
}
# This is mostly for internal use
check_attributes <- function(x) {
required_attributes <- "scientist"
all(required_attributes %in% names(attributes(x)))
}
# Our actual function to perform the checks
check_tbl_sub <- function(x) {
stopifnot("x is not a tbl_sub" = is_tbl_sub(x))
stopifnot("At least one required attribute is missing" = check_attributes(x))
}
If you want to do more heavy-weight checks, I recommend setting creating your check function with an argument like thorough = FALSE
, and only apply
more intensive checks when this is set to true. You can then use this when it really matters, while defaulting to the small checks.
Accessor functions
As we have seen above, so far accessing our custom attribute is not very convenient, as we need to call attr
and attr <- value
to interact with it.
Therefore, you typically provide get and set functions to do this.
Within these methods you can also check the input of the user, to prevent misuse.
Alternatively, you could also create get and set methods using the S3 system, but this is typically much more complicated and doesn’t add a lot of value.
get_scientist <- function(x) {
check_tbl_sub(x)
attr(x, "scientist")
}
set_scientist <- function(x, scientist) {
check_tbl_sub(x)
stopifnot("scientist must be a character" = is.character(scientist))
attr(x, "scientist") <- scientist
x
}
get_scientist(my_tibble_sub)
## [1] "Martin"
Setting a number for scientist is not allowed:
x <- set_scientist(my_tibble_sub, 1)
## Error in set_scientist(my_tibble_sub, 1): scientist must be a character
Setting up for the future
Throughout the article, or your implementation, you will need to reference what you expect your custom tibble subclass to look like in several positions. To prevent errors and keep your code DRY, I highly recommend to store this format in a central place, making it easy for you to reference it.
Modifying tidyverse methods for our custom class
Recap of S3 methods and generics
Generics are foundational functions for R’s S3 class system. It allows for polymorphism,
meaning that the same name of a function behaves differently depending on what object it was called on. For example print
is a generic, that handles a multitude of classes, providing custom printing for each one.
To generate an S3 method, you first need to define a generic with useMethod
(if this hasn’t been defined already by base R of a package you are extending).
my_generic <- function(x) {
UseMethod("my_generic")
}
Then you can define your S3 methods with the typical function
syntax, with one difference:
The name of the function needs to be in the format functionname.classname
:
my_generic.list <- function(x) {
# Do something specific for lists
}
my_generic.data.frame <- function(x) {
# Do something specific for data.frames
}
When the function my_generic
is called, R will now check the class of x
and call the corresponding generic definition for it.
Some objects have multiple classes, in our example, tibbles
have actually class c("tbl_df", "tbl", "data.frame)
. In these cases, the method dispatch will from left to right. First it checks if a generic for tbl_df
is defined, if not check if a generic for tbl
is defined, and so on.
One pattern we will see frequently in this post is handing off to a downstream method. This allows us to write basic behavior for lower-level classes, which can be reused by more specific up-stream classes. We do this by calling NextMethod
after having performed the more specific parts for the class at hand.
tibble
Printing
So far, we can create a custom class and store the attribute scientist
but when printing our object it appears as a regular tibble. To make it more clear to the user that it is a different custom class we will first modify the printing behavior that also provides additional information.
To do so, we overwrite the tbl_sum
function (implemented in the pillars
package, which is responsible for tibble printing). Alternatively we could also overwrite the print
method which is also called by tbl_sum
.
We also see a pattern that will continue throughout the post, the call to NextMethod
. Very often you might want to keep a lot of functionality of the tidyverse and only modify some elements. To do this, we can simply hand off to the respective method in the tidyverse and modify the result afterwards according to our needs.
In the printing case, the call to NextMethod
is important, as otherwise the default tbl printing would not occur (call stack is our method -> next specific method etc). If you want to only show your custom print, then leave it out.
tbl_sum.tbl_sub <- function(x, ...) {
c(
"A tbl_sub containing data from an experiment by " = get_scientist(x),
NextMethod()
)
}
my_tibble_sub
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 0 × 0
as_tibble
Another peculiarity I found is that tibble::as_tibble()
strips the class, but retains the attribute.
my_tibble_sub %>% tibble::as_tibble() %>% class()
## [1] "tbl_df" "tbl" "data.frame"
my_tibble_sub %>% tibble::as_tibble() %>% attributes()
## $class
## [1] "tbl_df" "tbl" "data.frame"
##
## $row.names
## integer(0)
##
## $names
## character(0)
##
## $scientist
## [1] "Martin"
I personally found this inconsistent and implemented a custom as_tibble
method:
as_tibble.tbl_sub <- function(x,
...,
.rows = NULL,
.name_repair = c("check_unique", "unique", "universal", "minimal"),
rownames = pkgconfig::get_config("tibble::rownames", NULL)
) {
tbl <- NextMethod()
# Use an empty tibble as a template for what attributes to keep
tbl_attributes <- tibble::tibble() %>%
attributes() %>%
names()
attributes(tbl) <- attributes(x)[tbl_attributes]
tbl
}
tibble::as_tibble(my_tibble_sub) %>% attributes()
## $class
## [1] "tbl_sub" "tbl_df" "tbl" "data.frame"
##
## $row.names
## integer(0)
##
## $names
## character(0)
Other methods to consider for implementation
Other methods that you might find interesting to implement are as_tbl_sub
to convert some other
class to your subclass.
dplyr
Next we will extend dplyr to work with our custom attribute. Unfortunately, there is very little documentation on how to do this exactly, except the extend-dplyr vignette.
I observed that most dplyr functions work nicely with custom classes (they don’t drop the class and preserve attributes). Grouping on the other hand introduces a lot of problems, as it strips the custom class and attributes by default. To prevent this need to implement custom group_by
and ungroup
method. Also, functions that combine data.frames
, such as bind_cols
, needs to know what to do with your class as well. Otherwise it will silently drop information.
Let’s tackle these problems:
Implementing custom grouping generics
The grouping functions group_by
and ungroup
drop classes. This is because the default data.frame
method for group_by
always constructs a new grouped_df
to return.
my_tibble_sub <- tbl_sub(
tibble::tibble(
var1 = 1:10,
var2 = sample(c("a", "b"), 10, replace = TRUE),
var3 = sample(c("c", "d"), 10, replace = TRUE)),
scientist = "Martin")
my_tibble_sub %>% group_by(var2) %>% ungroup() %>% class()
## [1] "tbl_df" "tbl" "data.frame"
my_tibble_sub %>% group_by(var2) %>% ungroup() %>% get_scientist()
## Error in check_tbl_sub(x): x is not a tbl_sub
The culprit lines in dplyr are here:
# grouped-df.R
grouped_df <- function(data, vars, drop = group_by_drop_default(data)) {
# ... Some preceding code leftout for clarity
if (length(vars) == 0) {
as_tibble(data) # This will remove your custom class and attributes if you dont have any grouping variables.
} else {
groups <- compute_groups(data, vars, drop = drop)
new_grouped_df(data, groups) # This removes your custom class and attributes if you have grouping variables, as it creates a new grouped_df object from scratch
}
}
Subclassing grouped_df
I found subclassing the grouped_df directly to be the best solution.
The problem with subclassing is, that you loose the tbl_sub
class information and you would need to modify the class string still manually, to insert it at the right position. Providing it into dplyr::new_grouped_df
would add in betweengrouped_tbl_sub
and grouped_df
, but it must come after grouped_df
:
tbl <- tibble::as_tibble(my_tibble_sub)
grouped_tbl <- dplyr::group_by(tbl, var2)
grouping_structure <- dplyr::group_data(grouped_tbl)
# This implementation adds the tbl_sub class at the wrong position
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub", "tbl_sub"))
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "tbl_sub" "grouped_df" "tbl_df"
## [5] "tbl" "data.frame"
# This implementation would loose the information that it is a tbl_sub, i.e. after ungrouping this would be lost completely.
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub"))
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "grouped_df" "tbl_df" "tbl"
## [5] "data.frame"
# Therefore the correct implementation needs to meddle with the class identifier
new_grouped_tbl_sub <- function(x, groups, scientist = NULL) {
x <- dplyr::new_grouped_df(x = x, groups = groups, scientist = scientist, class = c("grouped_tbl_sub"))
tbl_df_location <- grep("tbl_df", class(x), fixed = TRUE)
class(x) <- append(class(x), "tbl_sub", after = tbl_df_location -1)
x
}
new_grouped_tbl_sub(my_tibble_sub, groups = grouping_structure, scientist = "Martin") %>% class()
## [1] "grouped_tbl_sub" "grouped_df" "tbl_sub" "tbl_df"
## [5] "tbl" "data.frame"
Since this is only for internal use later on, we don’t implement a user facing method.
As a reference, the tsibble
package does not implement this method at all, they do everything
inside the other method calls, such as group_by
and ungroup
.
Group_by
With our current use case, we can simply call the dplyr::dplyr_reconstruct
function to restore the lost attributes to a grouped tbl:
my_tibble_sub_after_grouping <- my_tibble_sub %>% group_by(var2) %>% ungroup()
my_tibble_sub_reconstructed <- dplyr::dplyr_reconstruct(my_tibble_sub_after_grouping, my_tibble_sub)
my_tibble_sub_reconstructed
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
class(my_tibble_sub_reconstructed)
## [1] "tbl_sub" "tbl_df" "tbl" "data.frame"
get_scientist(my_tibble_sub_reconstructed)
## [1] "Martin"
We need to do the following steps:
- Run the default group_by method to get the grouping information
- Restore our custom attributes and class. This will remove the grouping information!
- Add the grouping information back from the default group_by
- Modify the class identifier of the returns object, to make it clear that it is a subclass of
grouped_df
. This is necessary so that we can later also handle grouped instances of our subclass for other dplyr verbs
group_by.tbl_sub <- function(.data, ..., .add = FALSE, drop = dplyr::group_by_drop_default(.data)) {
grouped_tbl <- NextMethod()
if (dplyr::is.grouped_df(grouped_tbl)) {
# Extract grouping information from default method
grouping_structure <- dplyr::group_data(grouped_tbl)
# Restore original attributes and add grouping information
x <- new_grouped_tbl_sub(.data, groups = grouping_structure, scientist = get_scientist(.data))
} else {
# This is an edge case if no groups are actually provided. Then simply return a regular subclass
# If we dont implement this, a call like my_tibble_sub %>% group_by(var2, var3) %>% ungroup(var2, var3) would still have a groups attribute
x <- new_tbl_sub(x = .data, scientist = get_scientist(.data))
}
x
}
my_tibble_sub %>% dplyr::group_by(var2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## # Groups: var2 [2]
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
my_tibble_sub %>% dplyr::group_by(var2) %>% class()
## [1] "grouped_tbl_sub" "grouped_df" "tbl_sub" "tbl_df"
## [5] "tbl" "data.frame"
my_tibble_sub %>% dplyr::group_by(var2) %>% get_scientist()
## [1] "Martin"
my_tibble_sub %>% dplyr::group_by(var2) %>% attributes()
## $class
## [1] "grouped_tbl_sub" "grouped_df" "tbl_sub" "tbl_df"
## [5] "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $names
## [1] "var1" "var2" "var3"
##
## $scientist
## [1] "Martin"
##
## $groups
## # A tibble: 2 × 2
## var2 .rows
## <chr> <list<int>>
## 1 a [6]
## 2 b [4]
# It also works with multiple levels of grouping:
my_tibble_sub %>% dplyr::group_by(var2, var3)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## # Groups: var2, var3 [4]
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
Ungrouping
Now we still need to implement a custom ungrouping method, because currently after ungrouping everything is stripped again:
my_tibble_sub %>% dplyr::group_by(var2) %>% dplyr::ungroup()
## # A tibble: 10 × 3
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
Thanks to our custom grouped_tbl_sub
class, we can now identify when to call our custom ungrouping method. The general logic is again very similar, run the default ungrouping and then restore our attributes.
But we need to be a bit careful of what to restore, as a simple restore would write back all groups (since groups are a simple attribute under the hood) and class string. At the same time, we need to support partial ungrouping, therefore we cannot call as_tibble
as this would strip all grouping information.
ungroup.grouped_tbl_sub <- function(x, ...) {
# Run default ungrouping.
tbl <- NextMethod()
if (dplyr::is_grouped_df(tbl)) {
# If the tibble is still grouped, we dont need to do anything, as the default dplyr method doesnt call as_tibble
x <- tbl
} else {
# Otherwise the tibble is completely ungrouped and we need to reapply custom attributes, but remove grouping
# This is most simplest done by simply creating a new tibble subclass
x <- tbl_sub(tbl, scientist = get_scientist(x))
}
x
}
my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup()
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup(var2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## # Groups: var3 [2]
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
# Also works with ungrouping all groups explicitly
my_tibble_sub %>% dplyr::group_by(var2, var3) %>% dplyr::ungroup(var2, var3)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## var1 var2 var3
## <int> <chr> <chr>
## 1 1 b c
## 2 2 a d
## 3 3 a c
## 4 4 b c
## 5 5 a d
## 6 6 a c
## 7 7 a d
## 8 8 a c
## 9 9 b c
## 10 10 b d
Slicing
Slicing with custom tibble works out of the box:
my_tibble_sub %>% slice_max(var1, n = 2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 2 × 3
## var1 var2 var3
## <int> <chr> <chr>
## 1 10 b d
## 2 9 b c
But slicing with a grouped custom tibble does not, notice that it got turned into a regular tibble again:
my_tibble_sub %>% group_by(var2) %>% slice_max(var1, n = 2)
## # A tibble: 4 × 3
## # Groups: var2 [2]
## var1 var2 var3
## <int> <chr> <chr>
## 1 8 a c
## 2 7 a d
## 3 10 b d
## 4 9 b c
Checking the dplyr documentation, we see that we need to implement a custom dplyr_row_slice
method. The default method creates a new grouped_df
from scratch, which again removes
custom class and attributes. By now you will know the pattern:
dplyr_row_slice.grouped_tbl_sub <- function(data, i, ..., preserve = FALSE) {
tbl <- NextMethod()
groups <- group_data(tbl)
new_grouped_tbl_sub(tbl, groups, scientist = get_scientist(data))
}
my_tibble_sub %>% group_by(var2) %>% slice_max(var1, n = 2)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 4 × 3
## # Groups: var2 [2]
## var1 var2 var3
## <int> <chr> <chr>
## 1 8 a c
## 2 7 a d
## 3 10 b d
## 4 9 b c
Mutate
The same pattern happens for mutate, where again only grouped subclasses are lost:
my_tibble_sub %>%
dplyr::mutate(new_var = var1 + 10)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 4
## var1 var2 var3 new_var
## <int> <chr> <chr> <dbl>
## 1 1 b c 11
## 2 2 a d 12
## 3 3 a c 13
## 4 4 b c 14
## 5 5 a d 15
## 6 6 a c 16
## 7 7 a d 17
## 8 8 a c 18
## 9 9 b c 19
## 10 10 b d 20
my_tibble_sub %>%
dplyr::group_by(var2) %>%
dplyr::mutate(new_var = var1 + 10)
## # A tibble: 10 × 4
## # Groups: var2 [2]
## var1 var2 var3 new_var
## <int> <chr> <chr> <dbl>
## 1 1 b c 11
## 2 2 a d 12
## 3 3 a c 13
## 4 4 b c 14
## 5 5 a d 15
## 6 6 a c 16
## 7 7 a d 17
## 8 8 a c 18
## 9 9 b c 19
## 10 10 b d 20
mutate.grouped_tbl_sub <- function(.data,
...,
.by = NULL,
.keep = c("all", "used", "unused", "none"),
.before = NULL,
.after = NULL) {
tbl <- NextMethod()
groups <- group_data(tbl)
new_grouped_tbl_sub(tbl, groups, scientist = get_scientist(.data))
}
my_tibble_sub %>%
dplyr::group_by(var2) %>%
dplyr::mutate(new_var = var1 + 10)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 4
## # Groups: var2 [2]
## var1 var2 var3 new_var
## <int> <chr> <chr> <dbl>
## 1 1 b c 11
## 2 2 a d 12
## 3 3 a c 13
## 4 4 b c 14
## 5 5 a d 15
## 6 6 a c 16
## 7 7 a d 17
## 8 8 a c 18
## 9 9 b c 19
## 10 10 b d 20
Combining multiple tbl_sub instances
I dont cover every method here, just some examples. The exact behavior depends on your actual data and what you need to happen.
bind_rows
bind_rows silently drops all attributes and only preserves information from first. This is because it doesnt know what to do.
my_tibble_sub <- tbl_sub(data.frame(var1 = 1:5), scientist = "Martin")
my_tibble_sub_2 <- tbl_sub(data.frame(var1 = 10:14), scientist = "Someone Else")
my_tibble_sub %>%
bind_rows(my_tibble_sub2) %>%
get_scientist()
## Error in list2(...): object 'my_tibble_sub2' not found
Depending on your class, you need to think what makes sense.
In this case one could either only append the scientist information from both tbl_sub
instances,
loosing information exactly which data was provided by which scientist,
or you could additionally create a new column that adds this information.
Or you could prevent this behavior completely by throwing an error (or a warning if you only append the two scientist attributes).
Unfortunately, since bind_rows
is not a generic, we cannot override it with a custom
method.
There are 3 options:
- We can either ask the users to do it to provide an input to the
.id
argument to identify the individual scientists later on. This still gives incorrect attributes and printing - Implement an
rbind
method, which is fortunately generic. But a user might usebind_rows
instead without looking to closely - Overwrite dplyr’s
bind_rows
implementation. This makes it more obvious to the user.
As an example, I here implement our own bind_rows
as well as reference it from the
rbind
generic. I found this to be the most obvious solution, and the user can
still opt to go for dplyr’s version with dplyr::bind_rows
.
bind_rows <- function(..., .id = NULL) {
dots <- rlang::list2(...)
scientists <- lapply(dots, get_scientist) %>% unlist()
# If the user didnt specify custom identifiers, use the scientists by default
if (!rlang::is_named(dots)) {
names(dots) <- scientists
}
# If the user didnt specify a custom name for the id column, use scientist
if (is.null(.id)) {
.id <- "scientist"
}
tbl_concatenated <- dplyr::bind_rows(dots, .id = .id) %>%
set_scientist(scientists)
tbl_concatenated
}
rbind.tbl_sub <- function(...) {
bind_rows(...) # Forward to our custom bind_rows and use defaults for arguments not available in rbind
}
bind_rows(my_tibble_sub, my_tibble_sub_2)
## # A tbl_sub containing data from an experiment by 1: Martin
## # A tbl_sub containing data from an experiment by 2: Someone Else
## # A tibble: 10 × 2
## scientist var1
## <chr> <int>
## 1 Martin 1
## 2 Martin 2
## 3 Martin 3
## 4 Martin 4
## 5 Martin 5
## 6 Someone Else 10
## 7 Someone Else 11
## 8 Someone Else 12
## 9 Someone Else 13
## 10 Someone Else 14
bind_cols
A similar case happens for bind_cols
. In this case you would probably modify the column names to retain the information.
This is left as an exercise for the reader ;)
tbl_sub_1 <- tbl_sub(data.frame(var1 = 1:5), scientist = "Martin")
tbl_sub_2 <- tbl_sub(data.frame(var2 = letters[1:5]), scientist = "Someone Else")
tbl_sub_1 %>%
bind_cols(tbl_sub_2) %>%
get_scientist()
## [1] "Martin"
Rowwise data.frames
So far in the article, I focused mostly on grouped data.frames as I personally use them more often. But the same problems that we discussed also happen for rowwise data.frames, as the dplyr functions construct the rowwise data.frames from scratch.
my_tibble_sub %>%
dplyr::rowwise() %>%
dplyr::mutate(new_var = var1 + 10)
## # A tibble: 5 × 2
## # Rowwise:
## var1 new_var
## <int> <dbl>
## 1 1 11
## 2 2 12
## 3 3 13
## 4 4 14
## 5 5 15
To prevent this you will need to implement a custom rowwise class and also the same methods for your rowwise subclass. Since this article is very long and the pattern gets repetitive, I don’t show examples here. If you are facing problems, feel free to reach out so I can give you some guidance.
tidyr
Similar as for dplyr, several of the tidyr verbs drop attributes. Here I will show how to implement the pivot_*
functions, as they are probably one of the main functions used by many people.
When developing these methods, we need to be careful what the users give as arguments
to the data masked arguments, like names_from
. As they allow both character
as well as
the unquoted input, we first need to defuse it and then inject it into the tidyr method
using the !!
operator.
pivot_wider.tbl_sub <- function(data,
...,
id_cols = NULL,
id_expand = FALSE,
names_from = name,
names_prefix = "",
names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_vary = "fastest",
names_expand = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL,
values_fn = NULL,
unused_fn = NULL) {
# We cannot use NextMethod directly. Instead defuse the arguments and inject into tidyr method.
names_from <- rlang::enquo(names_from)
values_from <- rlang::enquo(values_from)
id_cols <- rlang::enquo(id_cols)
x <- tidyr:::pivot_wider.data.frame(data,
...,
id_cols = !!id_cols,
id_expand = id_expand,
names_from = !!names_from,
names_prefix = names_prefix,
names_sep = names_sep,
names_glue = names_glue,
names_sort = names_sort,
names_vary = names_vary,
names_expand = names_expand,
names_repair = names_repair,
values_from = !!values_from,
values_fill = values_fill,
values_fn = values_fn,
unused_fn = unused_fn)
x <- dplyr::dplyr_reconstruct(x, data)
x
}
pivot_longer.tbl_sub <- function(data,
cols,
...,
cols_vary = "fastest",
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL) {
cols <- rlang::enquo(cols)
x <- tidyr:::pivot_longer.data.frame(data = data,
cols = !!cols,
...,
cols_vary = cols_vary,
names_to = names_to,
names_prefix = names_prefix,
names_sep = names_sep,
names_pattern = names_pattern,
names_ptypes = names_ptypes,
names_transform = names_transform,
names_repair = names_repair,
values_to = values_to,
values_drop_na = values_drop_na,
values_ptypes = values_ptypes,
values_transform = values_transform
)
x <- dplyr::dplyr_reconstruct(x, data)
x
}
experiment_data <- tbl_sub(
tibble::tibble(
var1 = 1:10,
experiment = rep(c("a", "b"),each = 5),
cell_type = rep(c("type1", "type2", "type3", "type4", "type5"), times = 2)),
scientist = "Martin")
experiment_data %>%
pivot_wider(values_from = var1, names_from = cell_type)
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 2 × 6
## experiment type1 type2 type3 type4 type5
## <chr> <int> <int> <int> <int> <int>
## 1 a 1 2 3 4 5
## 2 b 6 7 8 9 10
experiment_data %>%
pivot_wider(values_from = var1, names_from = cell_type) %>%
pivot_longer(cols = dplyr::starts_with("type"), names_to = "cell_type", values_to = "var1")
## # A tbl_sub containing data from an experiment by : Martin
## # A tibble: 10 × 3
## experiment cell_type var1
## <chr> <chr> <int>
## 1 a type1 1
## 2 a type2 2
## 3 a type3 3
## 4 a type4 4
## 5 a type5 5
## 6 b type1 6
## 7 b type2 7
## 8 b type3 8
## 9 b type4 9
## 10 b type5 10
Implementing tibble subclasses in a packages
When you will develop a custom class, it is highly likely that you are doing that inside a package that you develop. First I will list some requirements what you need to do to make your custom methods work, as well as give some guidelines which of the other new function you should export or not.
Importing and exporting methods and generics
When you are in a package context, you need to explicitly import and export the generics for the methods we define here. Otherwise R will not know that a function is a generic and the dispatch using the S3 system is not invoked.
In a nutshell you need to do the following:
- Add the
@export
flag to theRoxygen
docstring for all your custom methods - Import the generics from the original packages. This is best done also using the
Roxygen
docstring using@importFrom
. Note that you dont need to import methods from base R, such asrbind
.
The second part is often done in a dedicated package description docstring. This docstring doesnt document a function, it rather describes your package in general. It might look like this.
# file my_awesome_package.R
#' My awesome package
#'
#' This packages implements the `tbl_sub` class that can store a scientist attribute.
#' Otherwise it behaves like a `tibble`
#'
#' @importFrom tibble as_tibble tbl_sum
#' @importFrom dplyr group_by ungroup mutate dplyr_row_slice dplyr
#' @importFrom tidyr pivot_longer pivot_wider
"_PACKAGE"
Dont forget to run devtools::document
after writing the docstrings, otherwise the NAMESPACE
file will not be updated. Even during development, the methods are only used once the namespace includes them.
Define which functions are user facing
As with any other functions, use the @export
tag for new functions that you want to be user-facing.
In our example I would not export the low-level constructor new_tbl_sub
, but only tbl_sub
.
Other useful functions to export are the accessor functions, like get_scientist
and set_scientist
.
References
Last but not least, I am standing on the shoulds of giants to provide you this work. Many thanks to the developers of the tidyverse to make it extensible in this way and also a special
shout out to the tsibble
package, which I references heavily for their implementation of their custom class.
If you want to read up on more details, you also find below some resources that I found helpful: