Source: R/join-by.R
join_by.Rd
join_by()
constructs a specification that describes how to join two tablesusing a small domain specific language. The result can be supplied as theby
argument to any of the join functions (such as left_join()
).
Usage
join_by(...)
Arguments
- ...
Expressions specifying the join.
Each expression should consist of one of the following:
Equality condition:
==
Inequality conditions:
>=
,>
,<=
, or<
Rolling helper:
closest()
See AlsoHow to merge data in R using R merge, dplyr, or data.tableA Scientist's Guide to R: Step 2.2 - Joining Data with dplyr | Craig Hutton, PhDjoin_by: Join specifications in tidyverse/dplyr: A Grammar of Data ManipulationHow to perform merges (joins) on two or more data frames with base R, tidyverse and data.tableOverlap helpers:
between()
,within()
, oroverlaps()
Other expressions are not supported. If you need to perform a join ona computed variable, e.g.
join_by(sales_date - 40 >= promo_date)
,you'll need to precompute and store it in a separate column.Column names should be specified as quoted or unquoted names. By default,the name on the left-hand side of a join condition refers to the left-handtable, unless overridden by explicitly prefixing the column name witheither
x$
ory$
.If a single column name is provided without any join conditions, it isinterpreted as if that column name was duplicated on each side of
==
,i.e.x
is interpreted asx == x
.
Join types
The following types of joins are supported by dplyr:
Equality joins
Inequality joins
Rolling joins
Overlap joins
Cross joins
Equality, inequality, rolling, and overlap joins are discussed in more detailbelow. Cross joins are implemented through cross_join()
.
Equality joins
Equality joins require keys to be equal between one or more pairs of columns,and are the most common type of join. To construct an equality join usingjoin_by()
, supply two column names to join with separated by ==
.Alternatively, supplying a single name will be interpreted as an equalityjoin between two columns of the same name. For example, join_by(x)
isequivalent to join_by(x == x)
.
Inequality joins
Inequality joins match on an inequality, such as >
, >=
, <
, or <=
, andare common in time series analysis and genomics. To construct an inequalityjoin using join_by()
, supply two column names separated by one of the abovementioned inequalities.
Note that inequality joins will match a single row in x
to a potentiallylarge number of rows in y
. Be extra careful when constructing inequalityjoin specifications!
Rolling joins
Rolling joins are a variant of inequality joins that limit the resultsreturned from an inequality join condition. They are useful for "rolling" theclosest match forward/backwards when there isn't an exact match. To constructa rolling join, wrap an inequality with closest()
.
closest(expr)
expr
must be an inequality involving one of:>
,>=
,<
, or<=
.For example,
closest(x >= y)
is interpreted as: For each value inx
,find the closest value iny
that is less than or equal to thatx
value.
closest()
will always use the left-hand table (x
) as the primary table,and the right-hand table (y
) as the one to find the closest match in,regardless of how the inequality is specified. For example,closest(y$a >= x$b)
will always be interpreted as closest(x$b <= y$a)
.
Overlap joins
Overlap joins are a special case of inequality joins involving one or twocolumns from the left-hand table overlapping a range defined by two columnsfrom the right-hand table. There are three helpers that join_by()
recognizes to assist with constructing overlap joins, all of which can beconstructed from simpler inequalities.
between(x, y_lower, y_upper, ..., bounds = "[]")
For each value in
x
, this finds everywhere that value falls between[y_lower, y_upper]
. Equivalent tox >= y_lower, x <= y_upper
bydefault.bounds
can be one of"[]"
,"[)"
,"(]"
, or"()"
to alter the inclusiveness of the lower and upper bounds. Thischanges whether>=
or>
and<=
or<
are used to build theinequalities shown above.Dots are for future extensions and must be empty.
within(x_lower, x_upper, y_lower, y_upper)
For each range in
[x_lower, x_upper]
, this finds everywhere that rangefalls completely within[y_lower, y_upper]
. Equivalent tox_lower >= y_lower, x_upper <= y_upper
.The inequalities used to build
within()
are the same regardless of theinclusiveness of the supplied ranges.overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
For each range in
[x_lower, x_upper]
, this finds everywhere that rangeoverlaps[y_lower, y_upper]
in any capacity. Equivalent tox_lower <= y_upper, x_upper >= y_lower
by default.bounds
can be one of"[]"
,"[)"
,"(]"
, or"()"
to alter the inclusiveness of the lower and upper bounds."[]"
uses<=
and>=
, but the 3 other options use<
and>
and generate the exact same inequalities.Dots are for future extensions and must be empty.
These conditions assume that the ranges are well-formed and non-empty, i.e.x_lower <= x_upper
when bounds are treated as "[]"
, andx_lower < x_upper
otherwise.
Column referencing
When specifying join conditions, join_by()
assumes that column names on theleft-hand side of the condition refer to the left-hand table (x
), and nameson the right-hand side of the condition refer to the right-hand table (y
).Occasionally, it is clearer to be able to specify a right-hand table name onthe left-hand side of the condition, and vice versa. To support this, columnnames can be prefixed by x$
or y$
to explicitly specify which table theycome from.
Examples
sales <- tibble( id = c(1L, 1L, 1L, 2L, 2L), sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01")))sales#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01promos <- tibble( id = c(1L, 1L, 2L), promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02")))promos#> # A tibble: 3 × 2#> id promo_date#> <int> <date> #> 1 1 2019-01-01#> 2 1 2019-01-05#> 3 2 2019-01-02# Match `id` to `id`, and `sale_date` to `promo_date`by <- join_by(id, sale_date == promo_date)left_join(sales, promos, by)#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01# For each `sale_date` within a particular `id`,# find all `promo_date`s that occurred before that particular saleby <- join_by(id, sale_date >= promo_date)left_join(sales, promos, by)#> # A tibble: 6 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 1 2019-01-05 2019-01-05#> 5 2 2019-01-04 2019-01-02#> 6 2 2019-01-01 NA # For each `sale_date` within a particular `id`,# find only the closest `promo_date` that occurred before that saleby <- join_by(id, closest(sale_date >= promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-05#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # If you want to disallow exact matching in rolling joins, use `>` rather# than `>=`. Note that the promo on `2019-01-05` is no longer considered the# closest match for the sale on the same date.by <- join_by(id, closest(sale_date > promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # Same as before, but also require that the promo had to occur at most 1# day before the sale was made. We'll use a full join to see that id 2's# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.sales <- mutate(sales, sale_date_lower = sale_date - 1)by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)full_join(sales, promos, by)#> # A tibble: 6 × 4#> id sale_date sale_date_lower promo_date#> <int> <date> <date> <date> #> 1 1 2018-12-31 2018-12-30 NA #> 2 1 2019-01-02 2019-01-01 2019-01-01#> 3 1 2019-01-05 2019-01-04 2019-01-05#> 4 2 2019-01-04 2019-01-03 NA #> 5 2 2019-01-01 2018-12-31 NA #> 6 2 NA NA 2019-01-02# ---------------------------------------------------------------------------segments <- tibble( segment_id = 1:4, chromosome = c("chr1", "chr2", "chr2", "chr1"), start = c(140, 210, 380, 230), end = c(150, 240, 415, 280))segments#> # A tibble: 4 × 4#> segment_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 140 150#> 2 2 chr2 210 240#> 3 3 chr2 380 415#> 4 4 chr1 230 280reference <- tibble( reference_id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr2"), start = c(100, 200, 300, 415), end = c(150, 250, 399, 450))reference#> # A tibble: 4 × 4#> reference_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 100 150#> 2 2 chr1 200 250#> 3 3 chr2 300 399#> 4 4 chr2 415 450# Find every time a segment `start` falls between the reference# `[start, end]` range.by <- join_by(chromosome, between(start, start, end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450# If you wanted the reference columns first, supply `reference` as `x`# and `segments` as `y`, then explicitly refer to their columns using `x$`# and `y$`.by <- join_by(chromosome, between(y$start, x$start, x$end))full_join(reference, segments, by)#> # A tibble: 5 × 7#> reference_id chromosome start.x end.x segment_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 100 150 1 140 150#> 2 2 chr1 200 250 4 230 280#> 3 3 chr2 300 399 3 380 415#> 4 4 chr2 415 450 NA NA NA#> 5 NA chr2 NA NA 2 210 240# Find every time a segment falls completely within a reference.# Sometimes using `x$` and `y$` makes your intentions clearer, even if they# match the default behavior.by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))inner_join(segments, reference, by)#> # A tibble: 1 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150# Find every time a segment overlaps a reference in any way.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 3 chr2 380 415 4 415 450#> 5 4 chr1 230 280 2 200 250# It is common to have right-open ranges with bounds like `[)`, which would# mean an end value of `415` would no longer overlap a start value of `415`.# Setting `bounds` allows you to compute overlaps with those kinds of ranges.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450