如何将Julia中的IndexedTable转换为DataFrame?

问题描述:

在一个快速说明性工作中,IndexedTables似乎比DataFrames快得多,以处理各个元素(例如选择或“更新”),但DataFrames具有更好的功能生态系统,例如,绘图,导出..如何将Julia中的IndexedTable转换为DataFrame?

因此,在工作流的某个点,我想将IndexedTable转换为DataFrame,例如,

using DataFrames, IndexedTables, IndexedTables.Table 

tn = Table(
    Columns(
     param = String["price","price","price","price","waterContent","waterContent"], 
     item = String["banana","banana","apple","apple","banana", "apple"], 
     region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
     value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
    ) 
) 

到>>

df_tn = DataFrame(
    param  = String["price","price","price","price","waterContent","waterContent"], 
    item  = String["banana","banana","apple","apple","banana", "apple"], 
    region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA], 
    value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
) 

t = Table(
    Columns(
     String["price","price","price","price","waterContent","waterContent"], 
     String["banana","banana","apple","apple","banana", "apple"], 
     Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
     Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
    ) 
) 

到>>

df_t = DataFrame(
    x1 = String["price","price","price","price","waterContent","waterContent"], 
    x2 = String["banana","banana","apple","apple","banana", "apple"], 
    x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA], 
    x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8] 
) 

我能找到个人 “行” 值交互过表pair()

for (i,pair) in enumerate(pairs(tn)) 
    rowValues = [] 
    for (j,section) in enumerate(pair) 
     for item in section 
      push!(rowValues,item) 
     end 
    end 
    println(rowValues) 
end 

我不能得到列的名称和类型,我想按列工作会更有效。

编辑:我还是设法与上面的代码,以获得“列”类型的,我只是现在需要得到的列名,如果有的话:

colTypes = Union{Union,DataType}[] 

for item in tn.index.columns 
    push!(colTypes, eltype(item)) 
end 
for item in tn.data.columns 
    push!(colTypes, eltype(item)) 
end 

EDIT2:按照要求,这是一个IndexedTable的例子,因为“索引”列被命名为元组,但是“数据”列是正常元组,所以将失败使用(当前)Dan Getz答案转换列名称的示例:

t_named_idx = Table(
    Columns(
     param = String["price","price","price","price","waterContent","waterContent"], 
     item = String["banana","banana","apple","apple","banana", "apple"], 
     region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    ) 
) 

这个问题似乎在IndexedTable API中,特别是在columns(t)函数中,它不区分索引和值。

以下转换功能:

toDataFrame(cols::Tuple, prefix="x") = 
    DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...) 

toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") = 
    DataFrame(;(c => cols[c] for c in fieldnames(cols))...) 

toDataFrame(t::IndexedTable) = toDataFrame(columns(t)) 

得到(上朱莉娅0。6 tnt在问题定义为):

julia> tn 
param   item  region │ value2000 value2010 
─────────────────────────────────┼───────────────────── 
"price"   "apple" "FR" │ 1.1  1.2 
"price"   "apple" "UK" │ 0.8  0.8 
"price"   "banana" "FR" │ 2.8  3.2 
"price"   "banana" "UK" │ 2.7  2.9 
"waterContent" "apple" NA  │ 0.7  0.8 
"waterContent" "banana" NA  │ 0.2  0.2 

julia> df_tn = toDataFrame(tn) 
6×5 DataFrames.DataFrame 
│ Row │ param   │ item  │ region │ value2000 │ value2010 │ 
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1  │ 1.2  │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8  │ 0.8  │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8  │ 3.2  │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7  │ 2.9  │ 
│ 5 │ "waterContent" │ "apple" │ NA  │ 0.7  │ 0.8  │ 
│ 6 │ "waterContent" │ "banana" │ NA  │ 0.2  │ 0.2  │ 

类型信息大多保留:

julia> typeof(df_tn[:,1]) 
DataArrays.DataArray{String,1} 

julia> typeof(df_tn[:,4]) 
DataArrays.DataArray{Float64,1} 

而对于未命名的列:

julia> t 
───────────────────────────────┬───────── 
"price"   "apple" "FR" │ 1.1 1.2 
"price"   "apple" "UK" │ 0.8 0.8 
"price"   "banana" "FR" │ 2.8 3.2 
"price"   "banana" "UK" │ 2.7 2.9 
"waterContent" "apple" NA │ 0.7 0.8 
"waterContent" "banana" NA │ 0.2 0.2 

julia> df_t = toDataFrame(t) 
6×5 DataFrames.DataFrame 
│ Row │ x1    │ x2  │ x3 │ x4 │ x5 │ 
├─────┼────────────────┼──────────┼──────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │ 

编辑:由于通过@注意Antonello混合命名和未命名元组的情况处理不正确。要正确处理它,我们可以这样定义:

toDataFrame(t::IndexedTable) = 
    hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t)))) 

,然后混合箱给像一个结果:

julia> toDataFrame(tn2) 
6×5 DataFrames.DataFrame 
│ Row │ param   │ item  │ region │ x1 │ x2 │ 
├─────┼────────────────┼──────────┼────────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA  │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA  │ 0.2 │ 0.2 │ 
+0

它比我的解决方案更快(微秒也适用于大数据集),并且显然要优雅得多,但是当{index,data}中只有一个是NamedTuple时,所有列都会转换为xi名称。总的来说,这个答案告诉我们,最好看一个模块的api,而不是试图在其内部倾倒一个目标。 – Antonello

+0

我注意到混合命名和非命名元组的问题,事实上,早期的解决方案迭代处理得很好。这个解决方案也可以调整一下来处理它。我会看看。 –

+0

在处理混合大小写的答案中添加了编辑。 –

丑陋,快速和肮脏的“解决方案”(我希望这是另一种方式是可行的):

julia> df = DataFrame(
     permutedims( # <- structural transpose 
      vcat(
      reshape([j for i in keys(t) for j in i], :, length(t)) , 
      reshape([j for i in t  for j in i], :, length(t)) 
      ), 
      (2,1) 
     ) 
     ) 
6×5 DataFrames.DataFrame 
│ Row │ x1    │ x2  │ x3 │ x4 │ x5 │ 
├─────┼────────────────┼──────────┼──────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │ 
+0

谢谢你..我'm'正在学习'它..因为它是,转换后的df由'Any'列构成,它不存储最终的列名.. – Antonello

这是写一个转换功能的初始attampt ..它使列名和类型。如果它可以被清理并在DataFrame或IndexedTable包中实现,将会很好,如convert(DataFrame,t::IndexedArray)

function toDataFrame(t::IndexedTable) 

    # Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple 

    # Getting the column types.. this is independent if it is a keyed or normal IndexedArray 
    colTypes = Union{Union,DataType}[] 
    for item in t.index.columns 
     push!(colTypes, eltype(item)) 
    end 
    if(typeof(t.data) <: Vector) # The Data part is a simple Array 
     push!(colTypes, eltype(t.data)) 
    else       # The data part is a Tuple 
     for item in t.data.columns 
      push!(colTypes, eltype(item)) 
     end 
    end 
    # Getting the column names.. this change if it is a keyed or normal IndexedArray 
    colNames = Symbol[] 
    lIdx = length(t.index.columns) 
    if(eltype(t.index.columns) <: AbstractVector) # normal Tuple 
     [push!(colNames, Symbol("x",i)) for i in 1:lIdx] 
    else           # NamedTuple 
     for (k,v) in zip(keys(t.index.columns), t.index.columns) 
      push!(colNames, k) 
     end 
    end 
    if(typeof(t.data) <: Vector) # The Data part is a simple single Array 
     push!(colNames, Symbol("x",lIdx+1)) 
    else 
     lData = length(t.data.columns) 
     if(eltype(t.data.columns) <: AbstractVector) # normal Tuple 
      [push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)] 
     else           # NamedTuple 
      for (k,v) in zip(keys(t.data.columns), t.data.columns) 
       push!(colNames, k) 
      end 
     end 
    end 
    # building an empty DataFrame.. 
    df = DataFrame() 
    for i in 1:length(colTypes) 
     df[colNames[i]] = colTypes[i][] 
    end 
    # and finally filling the df with values.. 
    for (i,pair) in enumerate(pairs(t)) 
     rowValues = [] 
     for (j,section) in enumerate(pair) 
      for item in section 
       push!(rowValues,item) 
      end 
     end 
     push!(df, rowValues) 
    end 
    return df 
end 
+0

Dan的解决方案似乎很好。你测试过了吗?数据是否为Vector有问题吗?顺便说一下,我们将来有机会“粘合”封装。请参阅:https://github.com/JuliaLang/julia/issues/2025#issuecomment-338005473 – Liso

+0

@Liso好的,我试了一下,Dan Gets的解决方案效果很好(微秒也适用于大数据集),除非{index,data}之一是一个NamedTuple,所有列都转换为xi名称。 – Antonello

+0

你可以添加一个小例子,它不适用于列名吗? – Liso

只要安装IterableTables然后

using IterableTables 
df = DataFrames.DataFrame(it) 
+0

很高兴听到有关IterableTables(和require.jl!)的信息,但只有在源可缩减为指定元组时才会转换。在我的例子中,它使用'tn',但不能(至少在没有初步转换的情况下)使用't'。 – Antonello