链接连接的复杂条件

问题描述：

我需要根据与第一个链中链接的两个其他数据框中的值更新数据框。链接连接的复杂条件

目标DF t_offices有4个领域的兴趣在这里：

 administrative_area_level_1 administrative_area_level_2  country  locality 
    1      Arizona    Maricopa County United States  Phoenix 
    2  District of Columbia      <NA> United States Washington 
    3      <NA>      <NA>   India   <NA> 
    4     New York    Albany County United States  Albany 
    5      Utrecht     Nieuwegein Netherlands Nieuwegein 
    6     Connecticut   Fairfield County United States  Stamford 
    707     Illinois      <NA> United States   <NA> 
    4241     Illinois      <NA> United States West Chicago 
999998      Alabama      <NA> United States  Altoona 
999999    Pennsylvania      <NA> United States Washington

我需要administrative_area_level_2与美国记录的县更新NA值。该值是在DF t_places：

 state_ab   place_name     county_name place_nameshort 
    1  AL   Abanda CDP    Chambers County   Abanda 
    2  AL  Abbeville city     Henry County  Abbeville 
    3  AL  Adamsville city    Jefferson County  Adamsville 
    4  AL   Addison town    Winston County   Addison 
    5  AL   Akron town     Hale County   Akron 
    6  AL  Alabaster city    Shelby County  Alabaster 
    12  AL   Altoona town Blount County, Etowah County   Altoona 
    4298  DC  Washington city   District of Columbia  Washington 
    7527  IL West Chicago city    DuPage County  Washington 
32611  PA Washington township    Armstrong County West Chicago 
32612  PA Washington township     Berks County  Washington

place_nameshort是place_name截断版本没有名称（例如“城市”，“镇”等）

我加入t_offices和t_places对国家和地方为了得到正确的县。这可能会返回多个县1），因为county_name可能包含以逗号分隔的多个县，以及2）因为截断的place_nameshort可能会在同一状态内返回同义词。我需要只是那些县明确的情况下（返回单县）。

由于t_places只包含state_ab，我需要第三个数据帧r_states为state_name：

state_ab    state_name 
1  AL    Alabama 
2  AK     Alaska 
3  AZ    Arizona 
4  AR    Arkansas 
5  CA    California 
6  CO    Colorado 
9  DC District of Columbia 
17  IL    Illinois 
42  PA   Pennsylvania

通过对state_ab与r_states加盟t_places，我可以得到state_name与t_offices$administrative_area_level_1匹配。

这是我的尝试，它是不完整的，因为它不控制多个县，由于在州内的同义词，并且哪个不起作用。

no_county <- (!is.na(t_offices$country) 
      & t_offices$country == "United States" 
      & !is.na(t_offices$administrative_area_level_1) 
      & is.na(t_offices$administrative_area_level_2) 
      & !is.na(t_offices$locality)) 

t_offices$administrative_area_level_2[no_county] <- 
    t_places$county_name[!grepl(",", t_places$county_name) 
         & match(t_places$place_nameshort, t_offices$locality[no_county]) 
         & match(t_places$state_ab, 
           r_states$state_ab[match(r_states$state_name, 
                 t_offices$administrative_area_level_1[no_county])])]

编辑：继@ r2evans的意见，这是我新的编码的尝试，它仍然不能正常工作：

# split multiple counties into columns 
library(splitstackshape) 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 

# merge state names into places 
places_statename <- merge(t_places, r_states[,2:3]) 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    !is.na(t_offices$country) 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # blank county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update blank counties 
t_offices$administrative_area_level_2[no_county] <- 
    # unambiguous counties 
    places_statename$county_name_1[is.na(places_statename$county_name_2) 
           # locality matches place 
           & match(t_offices$locality[no_county], places_statename$place_nameshort) 
           # administrative_area_level_1 matches state 
           & match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)]

我建议你为了支持直接加入改革您的数据（通过'merge'或'dplyr :: left_join'和朋友）。这使得一切都变得更容易，更强大，并且更容易处理/排除故障。一开始：如果'县名'可以包含多个以逗号分隔的值，可以用'tidyr :: separate'和'tidyr :: gather'来分割它们（所以加入更直观/简单。问题可以重现;现在，我们没有符合您所有要求的代表性数据。 – r2evans

@ r2evans感谢您的建议！我已经添加了（真实和制作的）样本数据以使问题具有可重现性。你的第一个建议是，我应该合并t_places和r_states并将县名融入一个表中，然后用t_offices将该表加入？ – syre

@ r2evans不会融化，但会转换为多列 – syre

答

这是我长期的解决方案。有可能更短，更优雅的。

# split multiple counties into columns 
library(splitstackshape) 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 
# subset original places with single county 
places_singlecounty <- t_places[is.na(places_statename$county_name_2), c(1,8,9)] 
# subset truncated places with single county 
library(data.table) 
setDT(places_singlecounty) 
places_singlecounty <- merge(places_singlecounty, 
          places_singlecounty[, .N, by = c("state_ab", "place_nameshort")][N == 1, 1:2]) 
# merge state names into single-county truncated places 
places_statename <- merge(places_singlecounty, r_states[,2:3], by = "state_ab") 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    !is.na(t_offices$country) 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # NA county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update t_offices NA counties based on single-county truncated places 
setDT(t_offices) 
t_offices[no_county, administrative_area_level_2 := 
      places_statename[.(.SD), county_name_1, 
          on = c(state_name = "administrative_area_level_1", 
            place_nameshort = "locality")]]

链接连接的复杂条件

相关推荐