Join specifications — join_by (2024)

Join specifications — join_by (1)

Source: R/join-by.R

join_by.Rd

join_by() constructs a specification that describes how to join two tablesusing a small domain specific language. The result can be supplied as theby argument to any of the join functions (such as left_join()).

Usage

join_by(...)

Arguments

...

Expressions specifying the join.

Each expression should consist of one of the following:

  • Equality condition: ==

  • Inequality conditions: >=, >, <=, or <

  • Rolling helper: closest()

  • Overlap helpers: between(), within(), or overlaps()

Other expressions are not supported. If you need to perform a join ona computed variable, e.g. join_by(sales_date - 40 >= promo_date),you'll need to precompute and store it in a separate column.

Column names should be specified as quoted or unquoted names. By default,the name on the left-hand side of a join condition refers to the left-handtable, unless overridden by explicitly prefixing the column name witheither x$ or y$.

If a single column name is provided without any join conditions, it isinterpreted as if that column name was duplicated on each side of ==,i.e. x is interpreted as x == x.

Join types

The following types of joins are supported by dplyr:

  • Equality joins

  • Inequality joins

  • Rolling joins

  • Overlap joins

  • Cross joins

Equality, inequality, rolling, and overlap joins are discussed in more detailbelow. Cross joins are implemented through cross_join().

Equality joins

Equality joins require keys to be equal between one or more pairs of columns,and are the most common type of join. To construct an equality join usingjoin_by(), supply two column names to join with separated by ==.Alternatively, supplying a single name will be interpreted as an equalityjoin between two columns of the same name. For example, join_by(x) isequivalent to join_by(x == x).

Inequality joins

Inequality joins match on an inequality, such as >, >=, <, or <=, andare common in time series analysis and genomics. To construct an inequalityjoin using join_by(), supply two column names separated by one of the abovementioned inequalities.

Note that inequality joins will match a single row in x to a potentiallylarge number of rows in y. Be extra careful when constructing inequalityjoin specifications!

Rolling joins

Rolling joins are a variant of inequality joins that limit the resultsreturned from an inequality join condition. They are useful for "rolling" theclosest match forward/backwards when there isn't an exact match. To constructa rolling join, wrap an inequality with closest().

  • closest(expr)

    expr must be an inequality involving one of: >, >=, <, or <=.

    For example, closest(x >= y) is interpreted as: For each value in x,find the closest value in y that is less than or equal to that x value.

closest() will always use the left-hand table (x) as the primary table,and the right-hand table (y) as the one to find the closest match in,regardless of how the inequality is specified. For example,closest(y$a >= x$b) will always be interpreted as closest(x$b <= y$a).

Overlap joins

Overlap joins are a special case of inequality joins involving one or twocolumns from the left-hand table overlapping a range defined by two columnsfrom the right-hand table. There are three helpers that join_by()recognizes to assist with constructing overlap joins, all of which can beconstructed from simpler inequalities.

  • between(x, y_lower, y_upper, ..., bounds = "[]")

    For each value in x, this finds everywhere that value falls between[y_lower, y_upper]. Equivalent to x >= y_lower, x <= y_upper bydefault.

    bounds can be one of "[]", "[)", "(]", or"()" to alter the inclusiveness of the lower and upper bounds. Thischanges whether >= or > and <= or < are used to build theinequalities shown above.

    Dots are for future extensions and must be empty.

  • within(x_lower, x_upper, y_lower, y_upper)

    For each range in [x_lower, x_upper], this finds everywhere that rangefalls completely within [y_lower, y_upper]. Equivalent to x_lower >= y_lower, x_upper <= y_upper.

    The inequalities used to build within() are the same regardless of theinclusiveness of the supplied ranges.

  • overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

    For each range in [x_lower, x_upper], this finds everywhere that rangeoverlaps [y_lower, y_upper] in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower by default.

    bounds can be one of "[]", "[)", "(]", or"()" to alter the inclusiveness of the lower and upper bounds."[]" uses <= and >=, but the 3 other options use < and >and generate the exact same inequalities.

    Dots are for future extensions and must be empty.

These conditions assume that the ranges are well-formed and non-empty, i.e.x_lower <= x_upper when bounds are treated as "[]", andx_lower < x_upper otherwise.

Column referencing

When specifying join conditions, join_by() assumes that column names on theleft-hand side of the condition refer to the left-hand table (x), and nameson the right-hand side of the condition refer to the right-hand table (y).Occasionally, it is clearer to be able to specify a right-hand table name onthe left-hand side of the condition, and vice versa. To support this, columnnames can be prefixed by x$ or y$ to explicitly specify which table theycome from.

Examples

sales <- tibble( id = c(1L, 1L, 1L, 2L, 2L), sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01")))sales#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01promos <- tibble( id = c(1L, 1L, 2L), promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02")))promos#> # A tibble: 3 × 2#> id promo_date#> <int> <date> #> 1 1 2019-01-01#> 2 1 2019-01-05#> 3 2 2019-01-02# Match `id` to `id`, and `sale_date` to `promo_date`by <- join_by(id, sale_date == promo_date)left_join(sales, promos, by)#> # A tibble: 5 × 2#> id sale_date #> <int> <date> #> 1 1 2018-12-31#> 2 1 2019-01-02#> 3 1 2019-01-05#> 4 2 2019-01-04#> 5 2 2019-01-01# For each `sale_date` within a particular `id`,# find all `promo_date`s that occurred before that particular saleby <- join_by(id, sale_date >= promo_date)left_join(sales, promos, by)#> # A tibble: 6 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 1 2019-01-05 2019-01-05#> 5 2 2019-01-04 2019-01-02#> 6 2 2019-01-01 NA # For each `sale_date` within a particular `id`,# find only the closest `promo_date` that occurred before that saleby <- join_by(id, closest(sale_date >= promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-05#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # If you want to disallow exact matching in rolling joins, use `>` rather# than `>=`. Note that the promo on `2019-01-05` is no longer considered the# closest match for the sale on the same date.by <- join_by(id, closest(sale_date > promo_date))left_join(sales, promos, by)#> # A tibble: 5 × 3#> id sale_date promo_date#> <int> <date> <date> #> 1 1 2018-12-31 NA #> 2 1 2019-01-02 2019-01-01#> 3 1 2019-01-05 2019-01-01#> 4 2 2019-01-04 2019-01-02#> 5 2 2019-01-01 NA # Same as before, but also require that the promo had to occur at most 1# day before the sale was made. We'll use a full join to see that id 2's# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.sales <- mutate(sales, sale_date_lower = sale_date - 1)by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)full_join(sales, promos, by)#> # A tibble: 6 × 4#> id sale_date sale_date_lower promo_date#> <int> <date> <date> <date> #> 1 1 2018-12-31 2018-12-30 NA #> 2 1 2019-01-02 2019-01-01 2019-01-01#> 3 1 2019-01-05 2019-01-04 2019-01-05#> 4 2 2019-01-04 2019-01-03 NA #> 5 2 2019-01-01 2018-12-31 NA #> 6 2 NA NA 2019-01-02# ---------------------------------------------------------------------------segments <- tibble( segment_id = 1:4, chromosome = c("chr1", "chr2", "chr2", "chr1"), start = c(140, 210, 380, 230), end = c(150, 240, 415, 280))segments#> # A tibble: 4 × 4#> segment_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 140 150#> 2 2 chr2 210 240#> 3 3 chr2 380 415#> 4 4 chr1 230 280reference <- tibble( reference_id = 1:4, chromosome = c("chr1", "chr1", "chr2", "chr2"), start = c(100, 200, 300, 415), end = c(150, 250, 399, 450))reference#> # A tibble: 4 × 4#> reference_id chromosome start end#> <int> <chr> <dbl> <dbl>#> 1 1 chr1 100 150#> 2 2 chr1 200 250#> 3 3 chr2 300 399#> 4 4 chr2 415 450# Find every time a segment `start` falls between the reference# `[start, end]` range.by <- join_by(chromosome, between(start, start, end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450# If you wanted the reference columns first, supply `reference` as `x`# and `segments` as `y`, then explicitly refer to their columns using `x$`# and `y$`.by <- join_by(chromosome, between(y$start, x$start, x$end))full_join(reference, segments, by)#> # A tibble: 5 × 7#> reference_id chromosome start.x end.x segment_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 100 150 1 140 150#> 2 2 chr1 200 250 4 230 280#> 3 3 chr2 300 399 3 380 415#> 4 4 chr2 415 450 NA NA NA#> 5 NA chr2 NA NA 2 210 240# Find every time a segment falls completely within a reference.# Sometimes using `x$` and `y$` makes your intentions clearer, even if they# match the default behavior.by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))inner_join(segments, reference, by)#> # A tibble: 1 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150# Find every time a segment overlaps a reference in any way.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 3 chr2 380 415 4 415 450#> 5 4 chr1 230 280 2 200 250# It is common to have right-open ranges with bounds like `[)`, which would# mean an end value of `415` would no longer overlap a start value of `415`.# Setting `bounds` allows you to compute overlaps with those kinds of ranges.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))full_join(segments, reference, by)#> # A tibble: 5 × 7#> segment_id chromosome start.x end.x reference_id start.y end.y#> <int> <chr> <dbl> <dbl> <int> <dbl> <dbl>#> 1 1 chr1 140 150 1 100 150#> 2 2 chr2 210 240 NA NA NA#> 3 3 chr2 380 415 3 300 399#> 4 4 chr1 230 280 2 200 250#> 5 NA chr2 NA NA 4 415 450
Join specifications — join_by (2024)
Top Articles
2022 fantasy football draft kit: Rankings, cheat sheets, mock drafts, sleepers and analysis
Osmumten's Fang Ge
NYT Mini Crossword today: puzzle answers for Tuesday, September 17 | Digital Trends
Victory Road Radical Red
Patreon, reimagined — a better future for creators and fans
Walgreens Pharmqcy
Trabestis En Beaumont
What to Serve with Lasagna (80+ side dishes and wine pairings)
Boat Jumping Female Otezla Commercial Actress
Blue Beetle Showtimes Near Regal Swamp Fox
Bahsid Mclean Uncensored Photo
Clear Fork Progress Book
Best Uf Sororities
3S Bivy Cover 2D Gen
Candy Land Santa Ana
Schedule 360 Albertsons
Strange World Showtimes Near Roxy Stadium 14
Gayla Glenn Harris County Texas Update
China’s UberEats - Meituan Dianping, Abandons Bike Sharing And Ride Hailing - Digital Crew
Icivics The Electoral Process Answer Key
Znamy dalsze plany Magdaleny Fręch. Nie będzie nawet chwili przerwy
Dr Seuss Star Bellied Sneetches Pdf
Publix Near 12401 International Drive
Star Wars Armada Wikia
Roseann Marie Messina · 15800 Detroit Ave, Suite D, Lakewood, OH 44107-3748 · Lay Midwife
Bridgestone Tire Dealer Near Me
Productos para el Cuidado del Cabello Después de un Alisado: Tips y Consejos
Craigslist Free Stuff San Gabriel Valley
Wake County Court Records | NorthCarolinaCourtRecords.us
Rust Belt Revival Auctions
Gerber Federal Credit
Gyeon Jahee
Hypixel Skyblock Dyes
Arcane Odyssey Stat Reset Potion
Covalen hiring Ai Annotator - Dutch , Finnish, Japanese , Polish , Swedish in Dublin, County Dublin, Ireland | LinkedIn
Foolproof Module 6 Test Answers
Robeson County Mugshots 2022
Tillman Funeral Home Tallahassee
The All-New MyUMobile App - Support | U Mobile
Xxn Abbreviation List 2023
Gravel Racing
Craigslist - Pets for Sale or Adoption in Hawley, PA
Achieving and Maintaining 10% Body Fat
Kent And Pelczar Obituaries
Kb Home The Overlook At Medio Creek
Thotsbook Com
'The Night Agent' Star Luciane Buchanan's Dating Life Is a Mystery
Autozone Battery Hold Down
Plasma Donation Greensburg Pa
Shiftselect Carolinas
Gelato 47 Allbud
Latest Posts
Article information

Author: Msgr. Benton Quitzon

Last Updated:

Views: 5751

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Msgr. Benton Quitzon

Birthday: 2001-08-13

Address: 96487 Kris Cliff, Teresiafurt, WI 95201

Phone: +9418513585781

Job: Senior Designer

Hobby: Calligraphy, Rowing, Vacation, Geocaching, Web surfing, Electronics, Electronics

Introduction: My name is Msgr. Benton Quitzon, I am a comfortable, charming, thankful, happy, adventurous, handsome, precious person who loves writing and wants to share my knowledge and understanding with you.