链接连接的复杂条件
问题描述:
我需要根据与第一个链中链接的两个其他数据框中的值更新数据框。链接连接的复杂条件
目标DF t_offices
有4个领域的兴趣在这里:
administrative_area_level_1 administrative_area_level_2 country locality
1 Arizona Maricopa County United States Phoenix
2 District of Columbia <NA> United States Washington
3 <NA> <NA> India <NA>
4 New York Albany County United States Albany
5 Utrecht Nieuwegein Netherlands Nieuwegein
6 Connecticut Fairfield County United States Stamford
707 Illinois <NA> United States <NA>
4241 Illinois <NA> United States West Chicago
999998 Alabama <NA> United States Altoona
999999 Pennsylvania <NA> United States Washington
我需要administrative_area_level_2
与美国记录的县更新NA值。该值是在DF t_places
:
state_ab place_name county_name place_nameshort
1 AL Abanda CDP Chambers County Abanda
2 AL Abbeville city Henry County Abbeville
3 AL Adamsville city Jefferson County Adamsville
4 AL Addison town Winston County Addison
5 AL Akron town Hale County Akron
6 AL Alabaster city Shelby County Alabaster
12 AL Altoona town Blount County, Etowah County Altoona
4298 DC Washington city District of Columbia Washington
7527 IL West Chicago city DuPage County Washington
32611 PA Washington township Armstrong County West Chicago
32612 PA Washington township Berks County Washington
place_nameshort
是place_name
截断版本没有名称(例如“城市”,“镇”等)
我加入t_offices
和t_places
对国家和地方为了得到正确的县。这可能会返回多个县1),因为county_name
可能包含以逗号分隔的多个县,以及2)因为截断的place_nameshort
可能会在同一状态内返回同义词。我需要只是那些县明确的情况下(返回单县)。
由于t_places
只包含state_ab
,我需要第三个数据帧r_states
为state_name
:
state_ab state_name
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
9 DC District of Columbia
17 IL Illinois
42 PA Pennsylvania
通过对state_ab
与r_states
加盟t_places
,我可以得到state_name
与t_offices$administrative_area_level_1
匹配。
这是我的尝试,它是不完整的,因为它不控制多个县,由于在州内的同义词,并且哪个不起作用。
no_county <- (!is.na(t_offices$country)
& t_offices$country == "United States"
& !is.na(t_offices$administrative_area_level_1)
& is.na(t_offices$administrative_area_level_2)
& !is.na(t_offices$locality))
t_offices$administrative_area_level_2[no_county] <-
t_places$county_name[!grepl(",", t_places$county_name)
& match(t_places$place_nameshort, t_offices$locality[no_county])
& match(t_places$state_ab,
r_states$state_ab[match(r_states$state_name,
t_offices$administrative_area_level_1[no_county])])]
编辑:继@ r2evans的意见,这是我新的编码的尝试,它仍然不能正常工作:
# split multiple counties into columns
library(splitstackshape)
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F)
# merge state names into places
places_statename <- merge(t_places, r_states[,2:3])
# define condition to select t_offices records in U.S. with state and place but no county
no_county <- (
# country is U.S.
!is.na(t_offices$country)
& t_offices$country == "United States"
# with state
& !is.na(t_offices$administrative_area_level_1)
# blank county
& is.na(t_offices$administrative_area_level_2)
# with place
& !is.na(t_offices$locality))
# update blank counties
t_offices$administrative_area_level_2[no_county] <-
# unambiguous counties
places_statename$county_name_1[is.na(places_statename$county_name_2)
# locality matches place
& match(t_offices$locality[no_county], places_statename$place_nameshort)
# administrative_area_level_1 matches state
& match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)]
答
这是我长期的解决方案。有可能更短,更优雅的。
# split multiple counties into columns
library(splitstackshape)
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F)
# subset original places with single county
places_singlecounty <- t_places[is.na(places_statename$county_name_2), c(1,8,9)]
# subset truncated places with single county
library(data.table)
setDT(places_singlecounty)
places_singlecounty <- merge(places_singlecounty,
places_singlecounty[, .N, by = c("state_ab", "place_nameshort")][N == 1, 1:2])
# merge state names into single-county truncated places
places_statename <- merge(places_singlecounty, r_states[,2:3], by = "state_ab")
# define condition to select t_offices records in U.S. with state and place but no county
no_county <- (
# country is U.S.
!is.na(t_offices$country)
& t_offices$country == "United States"
# with state
& !is.na(t_offices$administrative_area_level_1)
# NA county
& is.na(t_offices$administrative_area_level_2)
# with place
& !is.na(t_offices$locality))
# update t_offices NA counties based on single-county truncated places
setDT(t_offices)
t_offices[no_county, administrative_area_level_2 :=
places_statename[.(.SD), county_name_1,
on = c(state_name = "administrative_area_level_1",
place_nameshort = "locality")]]
我建议你为了支持直接加入改革您的数据(通过'merge'或'dplyr :: left_join'和朋友)。这使得一切都变得更容易,更强大,并且更容易处理/排除故障。一开始:如果'县名'可以包含多个以逗号分隔的值,可以用'tidyr :: separate'和'tidyr :: gather'来分割它们(所以加入更直观/简单。问题可以重现;现在,我们没有符合您所有要求的代表性数据。 – r2evans
@ r2evans感谢您的建议!我已经添加了(真实和制作的)样本数据以使问题具有可重现性。你的第一个建议是,我应该合并t_places和r_states并将县名融入一个表中,然后用t_offices将该表加入? – syre
@ r2evans不会融化,但会转换为多列 – syre