合并使用大熊猫由文本相似度2个dataframes
我运行一个查询象下面这样:合并使用大熊猫由文本相似度2个dataframes
select *
from sd_sms LEFT JOIN categories_phrases
on sd_sms.body like concat('%',categories_phrases.phrase1,'%')
and sd_sms.body like concat('%',categories_phrases.phrase2,'%')
and sd_sms.body like concat('%',categories_phrases.phrase3,'%')
and sd_sms.body like concat('%',categories_phrases.phrase4,'%')
基本上,它会连接两个表,如果表A中的字段包含表B中的几个短语,但现在我需要在Python中执行此操作。
是否有任何简单的方法来合并使用熊猫这两个表,所以它给了我相同的结果?
请告知
此代码示例适用于文本数据和喜欢参加子句中的条件。
from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
df1 = pd.DataFrame({"name": ['Antony', 'Mark', 'Jacob'], "age":
[11,12,13]})
df2 = pd.DataFrame({"name": ['Antony', 'Gill', 'John']})
q = """SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE '%' || df2.name || '%'"""
df = pysqldf(q)
这只是一个虚拟DF与示例数据,但我应用了与您的问题类似的条件。
希望它有用。
一个nex的例子它几乎工作。我试图玩耍。我修改如下面的代码: 从pandasql进口* 进口熊猫作为PD pysqldf =拉姆达问:sqldf(Q,全局()) DF1 = pd.DataFrame({ “名称”:[ '安东尼' ,'Mark','Jacob'],“age”: [11,12,13]}) df2 = pd.DataFrame({“name”:['Anto','Mark Gill','John']] }) q = “” “ SELECT df1.name,df1.age,df2.name FROM DF1 左连接上df1.name DF2 LIKE '%{}%' ”“” q.format( df2.name) df = pysqldf(q) df 最后一列显示NONE。 你能帮助改善吗? –
pandasql使用SQLITE语法。所以你必须在类似的子句中连接字符串'q =“”“SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE'%'|| df2.name ||'%'”“”'。我编辑了上面的 –
以上的answare。谢谢。 –
我不明白你是什么类型的数据,因为你错过你的答案一些示例数据;但是如果您需要像sintax一样使用SQL查询熊猫数据框,则可以尝试使用pandasql
package.v它基于SQLAlchemy ORM工具。
from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
df
date beef births 0 1975-01-01 00:00:00.000000 2106.0 265775.0 1 1975-02-01 00:00:00.000000 1845.0 241045.0 2 1975-03-01 00:00:00.000000 1891.0 268849.0 3 1975-04-01 00:00:00.000000 1895.0 247455.0 4 1975-05-01 00:00:00.000000 1849.0 254545.0 5 1975-06-01 00:00:00.000000 1849.0 254096.0 6 1975-07-01 00:00:00.000000 1916.0 275163.0 7 1975-08-01 00:00:00.000000 1961.0 281300.0 8 1975-09-01 00:00:00.000000 2065.0 270738.0 9 1975-10-01 00:00:00.000000 2270.0 265494.0 10 1975-11-01 00:00:00.000000 1970.0 251973.0 11 1975-12-01 00:00:00.000000 2055.0 260532.0 12 1976-01-01 00:00:00.000000 2208.0 257455.0 13 1976-01-01 00:00:00.000000 2208.0 259173.0 14 1976-02-01 00:00:00.000000 1966.0 236551.0 15 1976-02-01 00:00:00.000000 1966.0 238153.0 16 1976-03-01 00:00:00.000000 2318.0 257951.0 17 1976-03-01 00:00:00.000000 2318.0 261608.0 18 1976-04-01 00:00:00.000000 2015.0 246469.0 19 1976-04-01 00:00:00.000000 2015.0 250992.0 20 1976-05-01 00:00:00.000000 1969.0 256986.0 21 1976-05-01 00:00:00.000000 1969.0 261572.0 22 1976-06-01 00:00:00.000000 2161.0 250525.0 23 1976-06-01 00:00:00.000000 2161.0 255734.0 24 1976-07-01 00:00:00.000000 2111.0 279630.0 25 1976-07-01 00:00:00.000000 2111.0 279744.0 26 1976-08-01 00:00:00.000000 2233.0 279937.0 27 1976-08-01 00:00:00.000000 2233.0 286496.0 28 1976-09-01 00:00:00.000000 2274.0 273750.0 29 1976-09-01 00:00:00.000000 2274.0 283718.0 ... ... ... ... 533 2010-06-01 00:00:00.000000 2320.0 NaN 534 2010-07-01 00:00:00.000000 2229.6 NaN 535 2010-08-01 00:00:00.000000 2286.6 NaN 536 2010-09-01 00:00:00.000000 2252.2 NaN 537 2010-10-01 00:00:00.000000 2234.9 NaN 538 2010-11-01 00:00:00.000000 2235.5 NaN 539 2010-12-01 00:00:00.000000 2270.9 NaN 540 2011-01-01 00:00:00.000000 2122.9 356457.0 541 2011-02-01 00:00:00.000000 2020.4 338521.0 542 2011-03-01 00:00:00.000000 2266.2 350630.0 543 2011-04-01 00:00:00.000000 2052.5 346397.0 544 2011-05-01 00:00:00.000000 2131.9 354886.0 545 2011-06-01 00:00:00.000000 2375.0 348587.0 546 2011-07-01 00:00:00.000000 2134.1 375384.0 547 2011-08-01 00:00:00.000000 2386.9 373333.0 548 2011-09-01 00:00:00.000000 2215.2 367965.0 549 2011-10-01 00:00:00.000000 2215.1 357875.0 550 2011-11-01 00:00:00.000000 2148.8 323788.0 551 2011-12-01 00:00:00.000000 2126.3 353871.0 552 2012-01-01 00:00:00.000000 2113.8 337980.0 553 2012-02-01 00:00:00.000000 2009.0 316641.0 554 2012-03-01 00:00:00.000000 2159.8 347803.0 555 2012-04-01 00:00:00.000000 1990.6 337272.0 556 2012-05-01 00:00:00.000000 2232.0 345257.0 557 2012-06-01 00:00:00.000000 2252.1 346971.0 558 2012-07-01 00:00:00.000000 2200.8 368450.0 559 2012-08-01 00:00:00.000000 2367.5 359554.0 560 2012-09-01 00:00:00.000000 2016.0 361922.0 561 2012-10-01 00:00:00.000000 2343.7 347625.0 562 2012-11-01 00:00:00.000000 2206.6 320195.0
这里回购:https://github.com/yhat/pandasql和一个漂亮的快速入门教程http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html
我的数据类型是文本,我想使用类似于sql-like子句的文本来连接两个数据帧)可以做 –
这个支持SQL like子句吗?我的老板尝试这个,但它没有工作 –
我只是发布像条款 –
你能提供一些示例数据和你想要的输出吗? – Allen
您可以使用此链接下载示例数据:https://drive.google.com/file/d/0B9sctdRURN0PSXk2ZUxGMU9JdU0/view?usp=sharing –
基本上我需要像https://blog.ouseful.info/2012/09/26/merge-data-sets-based-partially-matched-data-elements/ –