如何将Julia中的IndexedTable转换为DataFrame?
在一个快速说明性工作中,IndexedTables
似乎比DataFrames
快得多,以处理各个元素(例如选择或“更新”),但DataFrames
具有更好的功能生态系统,例如,绘图,导出..如何将Julia中的IndexedTable转换为DataFrame?
因此,在工作流的某个点,我想将IndexedTable转换为DataFrame,例如,
using DataFrames, IndexedTables, IndexedTables.Table
tn = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
到>>
df_tn = DataFrame(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
value2000 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
value2010 = Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
或
t = Table(
Columns(
String["price","price","price","price","waterContent","waterContent"],
String["banana","banana","apple","apple","banana", "apple"],
Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
Float64[3.2,2.9,1.2,0.8,0.2,0.8],
)
)
到>>
df_t = DataFrame(
x1 = String["price","price","price","price","waterContent","waterContent"],
x2 = String["banana","banana","apple","apple","banana", "apple"],
x3 = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA],
x4 = Float64[2.8,2.7,1.1,0.8,0.2,0.7],
x5 = Float64[3.2,2.9,1.2,0.8,0.2,0.8]
)
我能找到个人 “行” 值交互过表pair()
:
for (i,pair) in enumerate(pairs(tn))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
println(rowValues)
end
我不能得到列的名称和类型,我想按列工作会更有效。
编辑:我还是设法与上面的代码,以获得“列”类型的,我只是现在需要得到的列名,如果有的话:
colTypes = Union{Union,DataType}[]
for item in tn.index.columns
push!(colTypes, eltype(item))
end
for item in tn.data.columns
push!(colTypes, eltype(item))
end
EDIT2:按照要求,这是一个IndexedTable的例子,因为“索引”列被命名为元组,但是“数据”列是正常元组,所以将失败使用(当前)Dan Getz答案转换列名称的示例:
t_named_idx = Table(
Columns(
param = String["price","price","price","price","waterContent","waterContent"],
item = String["banana","banana","apple","apple","banana", "apple"],
region = Union{String,DataArrays.NAtype}["FR","UK","FR","UK",NA,NA]
),
Columns(
Float64[2.8,2.7,1.1,0.8,0.2,0.7],
)
)
这个问题似乎在IndexedTable API中,特别是在columns(t)
函数中,它不区分索引和值。
以下转换功能:
toDataFrame(cols::Tuple, prefix="x") =
DataFrame(;(Symbol("$prefix$c") => cols[c] for c in fieldnames(cols))...)
toDataFrame(cols::NamedTuples.NamedTuple, prefix="x") =
DataFrame(;(c => cols[c] for c in fieldnames(cols))...)
toDataFrame(t::IndexedTable) = toDataFrame(columns(t))
得到(上朱莉娅0。6 tn
和t
在问题定义为):
julia> tn
param item region │ value2000 value2010
─────────────────────────────────┼─────────────────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_tn = toDataFrame(tn)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ value2000 │ value2010 │
├─────┼────────────────┼──────────┼────────┼───────────┼───────────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
类型信息大多保留:
julia> typeof(df_tn[:,1])
DataArrays.DataArray{String,1}
julia> typeof(df_tn[:,4])
DataArrays.DataArray{Float64,1}
而对于未命名的列:
julia> t
───────────────────────────────┬─────────
"price" "apple" "FR" │ 1.1 1.2
"price" "apple" "UK" │ 0.8 0.8
"price" "banana" "FR" │ 2.8 3.2
"price" "banana" "UK" │ 2.7 2.9
"waterContent" "apple" NA │ 0.7 0.8
"waterContent" "banana" NA │ 0.2 0.2
julia> df_t = toDataFrame(t)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
编辑:由于通过@注意Antonello混合命名和未命名元组的情况处理不正确。要正确处理它,我们可以这样定义:
toDataFrame(t::IndexedTable) =
hcat(toDataFrame(columns(keys(t)),"y"),toDataFrame(columns(values(t))))
,然后混合箱给像一个结果:
julia> toDataFrame(tn2)
6×5 DataFrames.DataFrame
│ Row │ param │ item │ region │ x1 │ x2 │
├─────┼────────────────┼──────────┼────────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
丑陋,快速和肮脏的“解决方案”(我希望这是另一种方式是可行的):
julia> df = DataFrame(
permutedims( # <- structural transpose
vcat(
reshape([j for i in keys(t) for j in i], :, length(t)) ,
reshape([j for i in t for j in i], :, length(t))
),
(2,1)
)
)
6×5 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
├─────┼────────────────┼──────────┼──────┼─────┼─────┤
│ 1 │ "price" │ "apple" │ "FR" │ 1.1 │ 1.2 │
│ 2 │ "price" │ "apple" │ "UK" │ 0.8 │ 0.8 │
│ 3 │ "price" │ "banana" │ "FR" │ 2.8 │ 3.2 │
│ 4 │ "price" │ "banana" │ "UK" │ 2.7 │ 2.9 │
│ 5 │ "waterContent" │ "apple" │ NA │ 0.7 │ 0.8 │
│ 6 │ "waterContent" │ "banana" │ NA │ 0.2 │ 0.2 │
谢谢你..我'm'正在学习'它..因为它是,转换后的df由'Any'列构成,它不存储最终的列名.. – Antonello
这是写一个转换功能的初始attampt ..它使列名和类型。如果它可以被清理并在DataFrame或IndexedTable包中实现,将会很好,如convert(DataFrame,t::IndexedArray)
。
function toDataFrame(t::IndexedTable)
# Note: the index is always a Tuple (named or not) while the data part can be a simple Array, a tuple or a Named tuple
# Getting the column types.. this is independent if it is a keyed or normal IndexedArray
colTypes = Union{Union,DataType}[]
for item in t.index.columns
push!(colTypes, eltype(item))
end
if(typeof(t.data) <: Vector) # The Data part is a simple Array
push!(colTypes, eltype(t.data))
else # The data part is a Tuple
for item in t.data.columns
push!(colTypes, eltype(item))
end
end
# Getting the column names.. this change if it is a keyed or normal IndexedArray
colNames = Symbol[]
lIdx = length(t.index.columns)
if(eltype(t.index.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in 1:lIdx]
else # NamedTuple
for (k,v) in zip(keys(t.index.columns), t.index.columns)
push!(colNames, k)
end
end
if(typeof(t.data) <: Vector) # The Data part is a simple single Array
push!(colNames, Symbol("x",lIdx+1))
else
lData = length(t.data.columns)
if(eltype(t.data.columns) <: AbstractVector) # normal Tuple
[push!(colNames, Symbol("x",i)) for i in (lIdx+1):(lIdx+lData)]
else # NamedTuple
for (k,v) in zip(keys(t.data.columns), t.data.columns)
push!(colNames, k)
end
end
end
# building an empty DataFrame..
df = DataFrame()
for i in 1:length(colTypes)
df[colNames[i]] = colTypes[i][]
end
# and finally filling the df with values..
for (i,pair) in enumerate(pairs(t))
rowValues = []
for (j,section) in enumerate(pair)
for item in section
push!(rowValues,item)
end
end
push!(df, rowValues)
end
return df
end
只要安装IterableTables然后
using IterableTables
df = DataFrames.DataFrame(it)
很高兴听到有关IterableTables(和require.jl!)的信息,但只有在源可缩减为指定元组时才会转换。在我的例子中,它使用'tn',但不能(至少在没有初步转换的情况下)使用't'。 – Antonello
它比我的解决方案更快(微秒也适用于大数据集),并且显然要优雅得多,但是当{index,data}中只有一个是NamedTuple时,所有列都会转换为xi名称。总的来说,这个答案告诉我们,最好看一个模块的api,而不是试图在其内部倾倒一个目标。 – Antonello
我注意到混合命名和非命名元组的问题,事实上,早期的解决方案迭代处理得很好。这个解决方案也可以调整一下来处理它。我会看看。 –
在处理混合大小写的答案中添加了编辑。 –