Join specifications — join

Source: R/join-by.R

join_by.Rd

join_by() constructs a specification that describes how to join two tablesusing a small domain specific language. The result can be supplied as theby argument to any of the join functions (such as left_join()).

Usage

join_by(...)

Arguments

...

Expressions specifying the join.

Each expression should consist of one of the following:

Equality condition: ==
Inequality conditions: >=, >, <=, or <
Rolling helper: closest()
Overlap helpers: between(), within(), or overlaps()

Other expressions are not supported. If you need to perform a join ona computed variable, e.g. join_by(sales_date - 40 >= promo_date),you'll need to precompute and store it in a separate column.

Column names should be specified as quoted or unquoted names. By default,the name on the left-hand side of a join condition refers to the left-handtable, unless overridden by explicitly prefixing the column name witheither x$ or y$.

If a single column name is provided without any join conditions, it isinterpreted as if that column name was duplicated on each side of ==,i.e. x is interpreted as x == x.

Join types

The following types of joins are supported by dplyr:

Equality joins
Inequality joins
Rolling joins
Overlap joins
Cross joins

Equality, inequality, rolling, and overlap joins are discussed in more detailbelow. Cross joins are implemented through cross_join().

Equality joins

Equality joins require keys to be equal between one or more pairs of columns,and are the most common type of join. To construct an equality join usingjoin_by(), supply two column names to join with separated by ==.Alternatively, supplying a single name will be interpreted as an equalityjoin between two columns of the same name. For example, join_by(x) isequivalent to join_by(x == x).

Inequality joins

Inequality joins match on an inequality, such as >, >=, <, or <=, andare common in time series analysis and genomics. To construct an inequalityjoin using join_by(), supply two column names separated by one of the abovementioned inequalities.

Note that inequality joins will match a single row in x to a potentiallylarge number of rows in y. Be extra careful when constructing inequalityjoin specifications!

Rolling joins

Rolling joins are a variant of inequality joins that limit the resultsreturned from an inequality join condition. They are useful for "rolling" theclosest match forward/backwards when there isn't an exact match. To constructa rolling join, wrap an inequality with closest().

closest(expr)
expr must be an inequality involving one of: >, >=, <, or <=.
For example, closest(x >= y) is interpreted as: For each value in x,find the closest value in y that is less than or equal to that x value.

closest() will always use the left-hand table (x) as the primary table,and the right-hand table (y) as the one to find the closest match in,regardless of how the inequality is specified. For example,closest(y$a >= x$b) will always be interpreted as closest(x$b <= y$a).

Overlap joins

Overlap joins are a special case of inequality joins involving one or twocolumns from the left-hand table overlapping a range defined by two columnsfrom the right-hand table. There are three helpers that join_by()recognizes to assist with constructing overlap joins, all of which can beconstructed from simpler inequalities.

between(x, y_lower, y_upper, ..., bounds = "[]")
For each value in x, this finds everywhere that value falls between[y_lower, y_upper]. Equivalent to x >= y_lower, x <= y_upper bydefault.
bounds can be one of "[]", "[)", "(]", or"()" to alter the inclusiveness of the lower and upper bounds. Thischanges whether >= or > and <= or < are used to build theinequalities shown above.
Dots are for future extensions and must be empty.
within(x_lower, x_upper, y_lower, y_upper)
For each range in [x_lower, x_upper], this finds everywhere that rangefalls completely within [y_lower, y_upper]. Equivalent to x_lower >= y_lower, x_upper <= y_upper.
The inequalities used to build within() are the same regardless of theinclusiveness of the supplied ranges.
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
For each range in [x_lower, x_upper], this finds everywhere that rangeoverlaps [y_lower, y_upper] in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower by default.
bounds can be one of "[]", "[)", "(]", or"()" to alter the inclusiveness of the lower and upper bounds."[]" uses <= and >=, but the 3 other options use < and >and generate the exact same inequalities.
Dots are for future extensions and must be empty.

These conditions assume that the ranges are well-formed and non-empty, i.e.x_lower <= x_upper when bounds are treated as "[]", andx_lower < x_upper otherwise.

Column referencing

When specifying join conditions, join_by() assumes that column names on theleft-hand side of the condition refer to the left-hand table (x), and nameson the right-hand side of the condition refer to the right-hand table (y).Occasionally, it is clearer to be able to specify a right-hand table name onthe left-hand side of the condition, and vice versa. To support this, columnnames can be prefixed by x$ or y$ to explicitly specify which table theycome from.

Examples

sales <- tibble( id = c(1L, 1L, 1L, 2L, 2L), sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01")))sales#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01promos <- tibble( id = c(1L, 1L, 2L), promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02")))promos#> # A tibble: 3 × 2#> id promo_date#> <int> <date> #> 1 1 2019-01-01#> 2 1 2019-01-05#> 3 2 2019-01-02# Match `id` to `id`, and `sale_date` to `promo_date`by <- join_by(id, sale_date == promo_date)left_join(sales, promos, by)#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01# For each `sale_date` within a particular `id`,# find all `promo_date`s that occurred before that particular saleby <- join_by(id, sale_date >= promo_date)left_join(sales, promos, by)#> # A tibble: 6 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 1 2019-01-05 2019-01-05#> 5 2 2019-01-04 2019-01-02#> 6 2 2019-01-01 NA # For each `sale_date` within a particular `id`,# find only the closest `promo_date` that occurred before that saleby <- join_by(id, closest(sale_date >= promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-05#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # If you want to disallow exact matching in rolling joins, use `>` rather# than `>=`. Note that the promo on `2019-01-05` is no longer considered the# closest match for the sale on the same date.by <- join_by(id, closest(sale_date > promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # Same as before, but also require that the promo had to occur at most 1# day before the sale was made. We'll use a full join to see that id 2's# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.sales <- mutate(sales, sale_date_lower = sale_date - 1)by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)full_join(sales, promos, by)#> # A tibble: 6 × 4#> id sale_date sale_date_lower promo_date#> <int> <date> <date> <date> #> 1 1 2018-12-31 2018-12-30 NA #> 2 1 2019-01-02 2019-01-01 2019-01-01#> 3 1 2019-01-05 2019-01-04 2019-01-05#> 4 2 2019-01-04 2019-01-03 NA #> 5 2 2019-01-01 2018-12-31 NA #> 6 2 NA NA 2019-01-02# ---------------------------------------------------------------------------segments <- tibble( segment_id = 1:4, chromosome = c("chr1", "chr2", "chr2", "chr1"), start = c(140, 210, 380, 230), end = c(150, 240, 415, 280))segments#> # A tibble: 4 × 4#> segment_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 140 150#> 2 2 chr2 210 240#> 3 3 chr2 380 415#> 4 4 chr1 230 280reference <- tibble( reference_id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr2"), start = c(100, 200, 300, 415), end = c(150, 250, 399, 450))reference#> # A tibble: 4 × 4#> reference_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 100 150#> 2 2 chr1 200 250#> 3 3 chr2 300 399#> 4 4 chr2 415 450# Find every time a segment `start` falls between the reference# `[start, end]` range.by <- join_by(chromosome, between(start, start, end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450# If you wanted the reference columns first, supply `reference` as `x`# and `segments` as `y`, then explicitly refer to their columns using `x$`# and `y$`.by <- join_by(chromosome, between(y$start, x$start, x$end))full_join(reference, segments, by)#> # A tibble: 5 × 7#> reference_id chromosome start.x end.x segment_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 100 150 1 140 150#> 2 2 chr1 200 250 4 230 280#> 3 3 chr2 300 399 3 380 415#> 4 4 chr2 415 450 NA NA NA#> 5 NA chr2 NA NA 2 210 240# Find every time a segment falls completely within a reference.# Sometimes using `x$` and `y$` makes your intentions clearer, even if they# match the default behavior.by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))inner_join(segments, reference, by)#> # A tibble: 1 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150# Find every time a segment overlaps a reference in any way.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 3 chr2 380 415 4 415 450#> 5 4 chr1 230 280 2 200 250# It is common to have right-open ranges with bounds like `[)`, which would# mean an end value of `415` would no longer overlap a start value of `415`.# Setting `bounds` allows you to compute overlaps with those kinds of ranges.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450

Join specifications — join_by (2024)