R:如何读取带有data.table :: fread的CSV文件,其逗号为小数,点数为千分隔符=“。”。
我有几个CSV文件,其中包含本地德语样式的数字,即逗号作为小数点分隔符和点数作为千位分隔符,例如10.380,45。 CSV文件中的值由“;”分隔。这些文件还包含来自类字符,日期,日期&时间和逻辑的列。R:如何读取带有data.table :: fread的CSV文件,其逗号为小数,点数为千分隔符=“。”。
read.table函数的问题是,您可以用dec =“,”指定小数点分隔符,但不能指定千分点分隔符。 (如果我错了,请纠正我)
我知道预处理是一种解决方法,但我想以某种方式编写我的代码,以使其他人可以在没有我的情况下使用它。
通过设置我自己的类,我发现了一种按照我希望使用read.csv2的方式来读取CSV文件的方法,如以下示例中所示。 基于Most elegant way to load csv with point as thousands separator in R
# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")
# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)
#### read with read.csv2 ####
# First, define your own class
#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))
# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))
# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
)
)
我现在的问题是,将不同的数据集有多达200列和35万行。使用上面的解决方案,我需要40到60秒才能加载一个CSV文件,我想加快速度。
通过我的研究,我发现data.table
包的fread()
,这个包真的很快。加载CSV文件需要大约3到5秒的时间。
不幸的是,也不可能指定千位分隔符。于是,我就用我的colClasses的解决方案,但似乎有问题,你不能使用单独的类用fread https://github.com/Rdatatable/data.table/issues/491
参见我下面的测试代码:
##### read with fread ####
library(data.table)
# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
dec = ",",
sep=";",
verbose=TRUE)
str(df_test_readfread1)
# PROBLEM: In my real dataset it turns the number into an numeric column,
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550,
# into 10.5
# Here it keeps everything as character
# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
),
sep=";",
verbose=TRUE)
str(df_test_readfread2)
# Keeps everything as character
所以我的问题是:有没有办法使用fread读取数字值为10.380.45的CSV文件? (或者:用这些数值读取CSV的最快方法是什么?)
预先感谢您的回答,我希望我的问题不会太长;-)。
我从来没有使用过包装自己,但它从哈德利韦翰的,应该是好东西
https://cran.r-project.org/web/packages/readr/readr.pdf
它应该处理语言环境:
locale(date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8", asciify = FALSE)
decimal_mark
和grouping_mark
是你在找什么
编辑表单PhiSeu:解决方案
感谢您的建议,这里有两个解决方案read_csv2()
从readr
包。对于我的350000行CSV文件,大约需要8秒,这比read.csv2解决方案快得多。 (另一个有用的包从哈德利和RStudio,感谢)
library(readr)
# solution 1 with specified columns
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de"),
col_names = TRUE,
cols(
col_text = col_character(),
col_num = col_number(), # number is automatically regcognized through locale=("de")
col_date2 = col_date(format ="%d.%m.%Y") # Date specification
)
)
# solution 2 with overall definition of date format
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de",date_format = "%d.%m.%Y"), # specifies the date format for the whole file
col_names = TRUE
)
也许先删除所有逗号。
filepath<-paste0(getwd(),"/Test.csv")
filestring<-readChar(filepath, file.info(filepath)$size)
filestring<-gsub('.','',filestring,fixed=TRUE)
fread(filestring)
参见[#1636](https://github.com/Rdatatable/data.table/issues/1636)。这让我感到缺乏...不知道为什么设置options(“datatable.fread.dec.locale”=“de_DE.utf8”)不能解决问题。 @阿伦是不是很奇怪? – MichaelChirico