clickhouse FULL JOIN 全连接遇到的坑

声明:只在我实际使用中碰到的一个问题

clickhouse官网文档中并没有找到full join 的信息,但是实际使用的时候是可以用 all full join 关联将两个不同接口的表通过某些字段关联成一张表。

问题:

我用两个子查询查出的结果用 all full join 关联,关联keyword_id和fr字段,但是两张表中除了这两个字段以外还有一个同名的account_id字段,遇到的问题是,第一个子查询得到的结果是44条,第二个子查询得到的结果是5条,能根据keyword_id 和fr关联上的数据只有两条。结果是47条数据,这个是没问题的,而有问题的是其中有3条是第二张表的与第一张没对应上的结果account_id变成空了。

如下:

clickhouse FULL JOIN 全连接遇到的坑

sql:

 
	SELECT

		*
		
	FROM
		(
			SELECT
				fr,
				keyword_id,
				account_id,
				sum(cost) AS ad_cost,
				sum(cost_real) AS ad_cost_real,
				sum(impression) AS ad_impression,
				sum(click) AS ad_click
			FROM
				marketing.sem_keyword_report
			WHERE
				1 = 1
			AND the_day >= '2018-12-12'
			AND the_day <= '2018-12-18'
			AND lower(fr) IN ('bd_sem')
			AND account_id IN ('18091503')
			GROUP BY
				account_id,
				keyword_id,
				fr
		) ALL
	FULL JOIN (
		SELECT
			fr,
			keyword_id,
		 
			account_id,
			clue_all,
			clue_all_new,
			c1_kpi_daily_new_customer_amount,
			c1_kpi_new_customer_amount,
			c2_kpi_daily_new_customer_amount,
			c2_kpi_new_customer_amount,
			c2c_c1_create,
			c2b_c1_create,
			c2c_c1_onsite,
			c2b_c1_onsite,
			c2c_c1_onsale,
			c2b_c1_onsale,
			c2c_c2_appoint,
			b2c_c2_appoint,
			ssss_c2_appoint,
			c2c_c2_finish_appoint,
			b2c_c2_finish_appoint,
			ssss_c2_finish_appoint,
			c2c_c2_order,
			b2c_c2_order,
			weighting_number,
			ssss_c2_order
		FROM
			(
				SELECT
					fr,
					kid AS keyword_id,
					sum(clue_all) AS clue_all,
					sum(clue_all_new) AS clue_all_new,
					sum(
						c1_kpi_daily_new_customer_amount
					) AS c1_kpi_daily_new_customer_amount,
					sum(c1_kpi_new_customer_amount) AS c1_kpi_new_customer_amount,
					sum(
						c2_kpi_daily_new_customer_amount
					) AS c2_kpi_daily_new_customer_amount,
					sum(c2_kpi_new_customer_amount) AS c2_kpi_new_customer_amount,
					sum(c2c_c1_create) AS c2c_c1_create,
					sum(c2b_c1_create) AS c2b_c1_create,
					sum(c2c_c1_onsite) AS c2c_c1_onsite,
					sum(c2b_c1_onsite) AS c2b_c1_onsite,
					sum(c2c_c1_onsale) AS c2c_c1_onsale,
					sum(c2b_c1_onsale) AS c2b_c1_onsale,
					sum(c2c_c2_appoint) AS c2c_c2_appoint,
					sum(b2c_c2_appoint) AS b2c_c2_appoint,
					sum(ssss_c2_appoint) AS ssss_c2_appoint,
					sum(c2c_c2_finish_appoint) AS c2c_c2_finish_appoint,
					sum(b2c_c2_finish_appoint) AS b2c_c2_finish_appoint,
					sum(ssss_c2_finish_appoint) AS ssss_c2_finish_appoint,
					sum(c2c_c2_order) AS c2c_c2_order,
					sum(b2c_c2_order) AS b2c_c2_order,
					sum(weighting_number) AS weighting_number,
					sum(ssss_c2_order) AS ssss_c2_order
				FROM
					marketing.market_kid_stat_new_v5
				WHERE
					1 = 1
				AND dims = 'clue'
				AND dts >= '2018-12-12'
				AND dts <= '2018-12-18'
				AND lower(fr) IN ('bd_sem')
				GROUP BY
					keyword_id,
					fr
			) ANY
		LEFT JOIN (
			SELECT
				fr,
				keyword_id,
				account_id
			FROM
				marketing.sem_keyword_type
		) USING keyword_id,
		fr
	WHERE
		1 = 1
	AND lower(fr) IN ('bd_sem')
	AND account_id IN ('18091503')
	) USING keyword_id,fr
	

继续排查:

将右边的account_id 起别名显示时 account_id as aid

结果如下: aid显示的是第二张表能查到的5条数据对应的account_id ,

clickhouse FULL JOIN 全连接遇到的坑

将第一张表与第二张表颠倒顺序,结果就是相反的。只有5条数据有accountid而其他42条没有account_id
数据字段是没有问题的,没有的为0,但是keyword_id和fr被补齐了,而account_id 却没有。
看了一下sql 可能是因为fr和keyword_id作为关联条件using了。那么尝试一下把account_id 也加入using
结果account_id被补全了
结果:

clickhouse FULL JOIN 全连接遇到的坑

 

结论:

虽然我不是很清楚到底是什么机制导致了上述的问题,也不明白为什么最后在using中加入那个字段,结果就被补全了

但是有一点儿我能明白的是clickhouse使用 full join 还是隐约跟左右表有关,并不是真正实现将左右两张表完全放到根据条件相同完全补全字段。同名字段可能会存在问题。

 

以上是我遇到的问题以及解决,仅作为自己工作记录

感谢观看