如何将Julia中的IndexedTable转换为DataFrame？

问题描述：

在一个快速说明性工作中，IndexedTables似乎比DataFrames快得多，以处理各个元素（例如选择或“更新”），但DataFrames具有更好的功能生态系统，例如，绘图，导出..如何将Julia中的IndexedTable转换为DataFrame？

因此，在工作流的某个点，我想将IndexedTable转换为DataFrame，例如，

using DataFrames, IndexedTables, IndexedTables.Table 

tn = Table(
    Columns(
     param = String["price","price","price","price","waterContent","waterContent"], 
     item = String["banana","banana","apple","apple","banana", "apple"], 
     region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
     value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
    ) 
)

到>>

df_tn = DataFrame(
    param  = String["price","price","price","price","waterContent","waterContent"], 
    item  = String["banana","banana","apple","apple","banana", "apple"], 
    region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA], 
    value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
)

或

t = Table(
    Columns(
     String["price","price","price","price","waterContent","waterContent"], 
     String["banana","banana","apple","apple","banana", "apple"], 
     Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
     Float64[3.2,2.9,1.2,0.8,0.2,0.8], 
    ) 
)

到>>

df_t = DataFrame(
    x1 = String["price","price","price","price","waterContent","waterContent"], 
    x2 = String["banana","banana","apple","apple","banana", "apple"], 
    x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA], 
    x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8] 
)

我能找到个人 “行” 值交互过表pair() ：

for (i,pair) in enumerate(pairs(tn)) 
    rowValues = [] 
    for (j,section) in enumerate(pair) 
     for item in section 
      push!(rowValues,item) 
     end 
    end 
    println(rowValues) 
end

我不能得到列的名称和类型，我想按列工作会更有效。

编辑：我还是设法与上面的代码，以获得“列”类型的，我只是现在需要得到的列名，如果有的话：

colTypes = Union{Union,DataType}[] 

for item in tn.index.columns 
    push!(colTypes, eltype(item)) 
end 
for item in tn.data.columns 
    push!(colTypes, eltype(item)) 
end

EDIT2：按照要求，这是一个IndexedTable的例子，因为“索引”列被命名为元组，但是“数据”列是正常元组，所以将失败使用（当前）Dan Getz答案转换列名称的示例：

t_named_idx = Table(
    Columns(
     param = String["price","price","price","price","waterContent","waterContent"], 
     item = String["banana","banana","apple","apple","banana", "apple"], 
     region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA] 
    ), 
    Columns(
     Float64[2.8,2.7,1.1,0.8,0.2,0.7], 
    ) 
)

这个问题似乎在IndexedTable API中，特别是在columns(t)函数中，它不区分索引和值。

答

以下转换功能：

toDataFrame(cols::Tuple, prefix="x") = 
    DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...) 

toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") = 
    DataFrame(;(c => cols[c] for c in fieldnames(cols))...) 

toDataFrame(t::IndexedTable) = toDataFrame(columns(t))

得到（上朱莉娅0。6 tn和t在问题定义为）：

julia> tn 
param   item  region │ value2000 value2010 
─────────────────────────────────┼───────────────────── 
"price"   "apple" "FR" │ 1.1  1.2 
"price"   "apple" "UK" │ 0.8  0.8 
"price"   "banana" "FR" │ 2.8  3.2 
"price"   "banana" "UK" │ 2.7  2.9 
"waterContent" "apple" NA  │ 0.7  0.8 
"waterContent" "banana" NA  │ 0.2  0.2 

julia> df_tn = toDataFrame(tn) 
6×5 DataFrames.DataFrame 
│ Row │ param   │ item  │ region │ value2000 │ value2010 │ 
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1  │ 1.2  │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8  │ 0.8  │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8  │ 3.2  │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7  │ 2.9  │ 
│ 5 │ "waterContent" │ "apple" │ NA  │ 0.7  │ 0.8  │ 
│ 6 │ "waterContent" │ "banana" │ NA  │ 0.2  │ 0.2  │

类型信息大多保留：

julia> typeof(df_tn[:,1]) 
DataArrays.DataArray{String,1} 

julia> typeof(df_tn[:,4]) 
DataArrays.DataArray{Float64,1}

而对于未命名的列：

julia> t 
───────────────────────────────┬───────── 
"price"   "apple" "FR" │ 1.1 1.2 
"price"   "apple" "UK" │ 0.8 0.8 
"price"   "banana" "FR" │ 2.8 3.2 
"price"   "banana" "UK" │ 2.7 2.9 
"waterContent" "apple" NA │ 0.7 0.8 
"waterContent" "banana" NA │ 0.2 0.2 

julia> df_t = toDataFrame(t) 
6×5 DataFrames.DataFrame 
│ Row │ x1    │ x2  │ x3 │ x4 │ x5 │ 
├─────┼────────────────┼──────────┼──────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │

编辑：由于通过@注意Antonello混合命名和未命名元组的情况处理不正确。要正确处理它，我们可以这样定义：

toDataFrame(t::IndexedTable) = 
    hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))

，然后混合箱给像一个结果：

julia> toDataFrame(tn2) 
6×5 DataFrames.DataFrame 
│ Row │ param   │ item  │ region │ x1 │ x2 │ 
├─────┼────────────────┼──────────┼────────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA  │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA  │ 0.2 │ 0.2 │

它比我的解决方案更快（微秒也适用于大数据集），并且显然要优雅得多，但是当{index，data}中只有一个是NamedTuple时，所有列都会转换为xi名称。总的来说，这个答案告诉我们，最好看一个模块的api，而不是试图在其内部倾倒一个目标。 – Antonello

我注意到混合命名和非命名元组的问题，事实上，早期的解决方案迭代处理得很好。这个解决方案也可以调整一下来处理它。我会看看。 –

在处理混合大小写的答案中添加了编辑。 –

答

丑陋，快速和肮脏的“解决方案”（我希望这是另一种方式是可行的）：

julia> df = DataFrame(
     permutedims( # <- structural transpose 
      vcat(
      reshape([j for i in keys(t) for j in i], :, length(t)) , 
      reshape([j for i in t  for j in i], :, length(t)) 
      ), 
      (2,1) 
     ) 
     ) 
6×5 DataFrames.DataFrame 
│ Row │ x1    │ x2  │ x3 │ x4 │ x5 │ 
├─────┼────────────────┼──────────┼──────┼─────┼─────┤ 
│ 1 │ "price"  │ "apple" │ "FR" │ 1.1 │ 1.2 │ 
│ 2 │ "price"  │ "apple" │ "UK" │ 0.8 │ 0.8 │ 
│ 3 │ "price"  │ "banana" │ "FR" │ 2.8 │ 3.2 │ 
│ 4 │ "price"  │ "banana" │ "UK" │ 2.7 │ 2.9 │ 
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │ 
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │

谢谢你..我'm'正在学习'它..因为它是，转换后的df由'Any'列构成，它不存储最终的列名.. – Antonello

答

这是写一个转换功能的初始attampt ..它使列名和类型。如果它可以被清理并在DataFrame或IndexedTable包中实现，将会很好，如convert(DataFrame,t::IndexedArray)。

function toDataFrame(t::IndexedTable) 

    # Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple 

    # Getting the column types.. this is independent if it is a keyed or normal IndexedArray 
    colTypes = Union{Union,DataType}[] 
    for item in t.index.columns 
     push!(colTypes, eltype(item)) 
    end 
    if(typeof(t.data) <: Vector) # The Data part is a simple Array 
     push!(colTypes, eltype(t.data)) 
    else       # The data part is a Tuple 
     for item in t.data.columns 
      push!(colTypes, eltype(item)) 
     end 
    end 
    # Getting the column names.. this change if it is a keyed or normal IndexedArray 
    colNames = Symbol[] 
    lIdx = length(t.index.columns) 
    if(eltype(t.index.columns) <: AbstractVector) # normal Tuple 
     [push!(colNames, Symbol("x",i)) for i in 1:lIdx] 
    else           # NamedTuple 
     for (k,v) in zip(keys(t.index.columns), t.index.columns) 
      push!(colNames, k) 
     end 
    end 
    if(typeof(t.data) <: Vector) # The Data part is a simple single Array 
     push!(colNames, Symbol("x",lIdx+1)) 
    else 
     lData = length(t.data.columns) 
     if(eltype(t.data.columns) <: AbstractVector) # normal Tuple 
      [push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)] 
     else           # NamedTuple 
      for (k,v) in zip(keys(t.data.columns), t.data.columns) 
       push!(colNames, k) 
      end 
     end 
    end 
    # building an empty DataFrame.. 
    df = DataFrame() 
    for i in 1:length(colTypes) 
     df[colNames[i]] = colTypes[i][] 
    end 
    # and finally filling the df with values.. 
    for (i,pair) in enumerate(pairs(t)) 
     rowValues = [] 
     for (j,section) in enumerate(pair) 
      for item in section 
       push!(rowValues,item) 
      end 
     end 
     push!(df, rowValues) 
    end 
    return df 
end

Dan的解决方案似乎很好。你测试过了吗？数据是否为Vector有问题吗？顺便说一下，我们将来有机会“粘合”封装。请参阅：https：//github.com/JuliaLang/julia/issues/2025#issuecomment-338005473 – Liso

@Liso好的，我试了一下，Dan Gets的解决方案效果很好（微秒也适用于大数据集），除非{index，data}之一是一个NamedTuple，所有列都转换为xi名称。 – Antonello

你可以添加一个小例子，它不适用于列名吗？ – Liso

答

只要安装IterableTables然后

using IterableTables 
df = DataFrames.DataFrame(it)

很高兴听到有关IterableTables（和require.jl！）的信息，但只有在源可缩减为指定元组时才会转换。在我的例子中，它使用'tn'，但不能（至少在没有初步转换的情况下）使用't'。 – Antonello

如何将Julia中的IndexedTable转换为DataFrame？

相关推荐