fwrite.Rd
As write.csv
but much faster (e.g. 2 seconds versus 1 minute) and just as flexible. Modern machines almost surely have more than one CPU so fwrite
uses them; on all operating systems including Linux, Mac and Windows.
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", sep2 = c("","|",""), eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), logical01 = getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt = logical01, # deprecated scipen = getOption('scipen', 0L), dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB = 8L, nThread = getDTthreads(verbose), showProgress = getOption("datatable.showProgress", interactive()), compress = c("auto", "none", "gzip"), yaml = FALSE, bom = FALSE, verbose = getOption("datatable.verbose", FALSE))
x | Any |
---|---|
file | Output file name. |
append | If |
quote | When |
sep | The separator between columns. Default is |
sep2 | For columns of type |
eol | Line separator. Default is |
na | The string to use for missing values in the data. Default is a blank string |
dec | The decimal separator, by default |
row.names | Should row names be written? For compatibility with |
col.names | Should the column names (header row) be written? The default is |
qmethod | A character string specifying how to deal with embedded double quote characters when quoting strings.
|
logical01 | Should |
logicalAsInt | Deprecated. Old name for `logical01`. Name change for consistency with `fread` for which `logicalAsInt` would not make sense. |
scipen |
|
dateTimeAs | How
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for |
buffMB | The buffer size (MB) per thread in the range 1 to 1024, default 8MB. Experiment to see what works best for your data on your hardware. |
nThread | The number of threads to use. Experiment to see what works best for your data on your hardware. |
showProgress | Display a progress meter on the console? Ignored when |
compress | If |
yaml | If |
bom | If |
verbose | Be chatty and report timings? |
fwrite
began as a community contribution with pull request #1613 by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/. Final items were tracked in issue #1664 such as automatic quoting, bit64::integer64
support, decimal/scientific formatting exactly matching write.csv
between 2.225074e-308 and 1.797693e+308 to 15 significant figures, row.names
, dates (between 0000-03-01 and 9999-12-31), times and sep2
for list
columns where each cell can itself be a vector.
To save space, fwrite
prefers to write wide numeric values in scientific notation -- e.g. 10000000000
takes up much more space than 1e+10
. Most file readers (e.g. fread
) understand scientific notation, so there's no fidelity loss. Like in base R, users can control this by specifying the scipen
argument, which follows the same rules as options('scipen')
. fwrite
will see how much space a value will take to write in scientific vs. decimal notation, and will only write in scientific notation if the latter is more than scipen
characters wider. For 10000000000
, then, 1e+10
will be written whenever scipen<6
.
CSVY Support:
The following fields will be written to the header of the file and surrounded by ---
on top and bottom:
source
- Contains the R version and data.table
version used to write the file
creation_time_utc
- Current timestamp in UTC time just before the header is written
schema
with element fields
giving name
-type
(class
) pairs for the table; multi-class objects (e.g. c('POSIXct', 'POSIXt')
) will have their first class written.
header
- same as col.names
(which is header
on input)
sep
sep2
eol
na.strings
- same as na
dec
qmethod
logical01
http://howardhinnant.github.io/date_algorithms.html
https://en.wikipedia.org/wiki/Decimal_mark
#> A,B #> 1,foo #> 2,"A,Name" #> 3,baz#> A,B #> 1,foo #> 2,A,Name #> 3,bazfwrite(DF, row.names=TRUE, quote=TRUE)#> "","A","B" #> "1",1,"foo" #> "2",2,"A,Name" #> "3",3,"baz"#> "","A","B" #> "1",1,"foo" #> "2",2,"A,Name" #> "3",3,"baz"DF = data.frame(A=c(2.1,-1.234e-307,pi), B=c("foo","A,Name","bar")) fwrite(DF, quote='auto') # Just DF[2,2] is auto quoted#> A,B #> 2.1,foo #> -1.234e-307,"A,Name" #> 3.14159265358979,bar#> "A","B" #> 2.1,"foo" #> -1.234e-307,"A,Name" #> 3.14159265358979,"bar"#> A,B #> 2,1|2|3 #> 5.6,foo|"A,Name"|bar #> -3,3.14|6.28|9.42#> A|B #> 2|{1,2,3} #> 5.6|{foo,"A,Name",bar} #> -3|{3.14,6.28,9.42}if (FALSE) { set.seed(1) DT = as.data.table( lapply(1:10, sample, x=as.numeric(1:5e7), size=5e6)) # 382MB system.time(fwrite(DT, "/dev/shm/tmp1.csv")) # 0.8s system.time(write.csv(DT, "/dev/shm/tmp2.csv", # 60.6s quote=FALSE, row.names=FALSE)) system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical set.seed(1) N = 1e7 DT = data.table( str1=sample(sprintf("%010d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), str2=sample(sprintf("%09d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), str3=sample(sapply(sample(2:30, 100, TRUE), function(n) paste0(sample(LETTERS, n, TRUE), collapse="")), N, TRUE), str4=sprintf("%05d",sample(sample(1e5,50),N,TRUE)), num1=sample(round(rnorm(1e6,mean=6.5,sd=15),2), N, replace=TRUE), num2=sample(round(rnorm(1e6,mean=6.5,sd=15),10), N, replace=TRUE), str5=sample(c("Y","N"),N,TRUE), str6=sample(c("M","F"),N,TRUE), int1=sample(ceiling(rexp(1e6)), N, replace=TRUE), int2=sample(N,N,replace=TRUE)-N/2 ) # 774MB system.time(fwrite(DT,"/dev/shm/tmp1.csv")) # 1.1s system.time(write.csv(DT,"/dev/shm/tmp2.csv", # 63.2s row.names=FALSE, quote=FALSE)) system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical unlink("/dev/shm/tmp1.csv") unlink("/dev/shm/tmp2.csv") }