开发者

long/bigint/decimal equivalent datatype in R

开发者 https://www.devze.com 2022-12-16 18:27 出处:网络
What datatype choices do we have to handle large numbers in R? By default, the size of an integer seems to be 32bit, so bigint numbers from sql server as well as any large numbers passed from python v

What datatype choices do we have to handle large numbers in R? By default, the size of an integer seems to be 32bit, so bigint numbers from sql server as well as any large numbers passed from python via rpy2 get mangled.

> 123456789123
[1] 123456789123
> 1234567891234
[1] 1.234568e+12

When reading a bigint value of 123456789123456789 using RODBC, it comes back as 123456789123456784 (see the last digit), and the same number when deserialized via RJSONIO, comes back as -1395630315L (which seems like an additional bug/limitation of RJSONIO).

> fromJSON('[1234567891]')
[1] 1234567891
> fromJSON('[1234567开发者_StackOverflow8912]')
[1] -539222976

Actually, I do need to be able to handle large numbers coming from JSON, so with RJSONIO's limitation, I may not have a workaround except for finding a better JSON library (which seems like a non-option right now). I would like to hear what experts have to say on this as well as in general.


I understood your question a little differently vs the two who posted before i did.

If R's largest default value is not big enough for you, you have a few choices (disclaimer: I have used each of the libraries i mention below, but not through the R bindings, instead through other language bindings or the native library)

The Brobdingnag package: uses natural logs to store the values; (like Rmpfr, implemented using R's new class structure). I'm always impressed by anyone whose work requires numbers of this scale.

library(Brobdingnag)

googol <- as.brob(1e100)   

The gmp package: R bindings to the venerable GMP (GNU Multi-precision library). This must go back 20 years because i used it in University. This Library's motto is "Arithmetic Without Limits," which is a credible claim--integers, rationals, floats, whatever, right up to the limits of the RAM on your box.

library(gmp)

x = as.bigq(8000, 21)

The Rmpfr package: R bindings which interface to both gmp (above) and MPFR, (MPFR is in turn a contemporary implementation of gmp. I have used the Python bindings ('bigfloat') and can recommend it highly. This might be your best option of the three, given its scope, given that it appears to be the most actively maintained, and and finally given what appears to be the most thorough documentation.

Note: to use either of the last two, you'll need to install the native libraries, GMP and MPFR.


See help(integer):

 Note that on almost all implementations of R the range of
 representable integers is restricted to about +/-2*10^9: ‘double’s
 can hold much larger integers exactly.

so I would recommend using numeric (i.e. 'double') -- a double-precision number.

Updated in 2022: This issue still stands and will unlikely ever change: integer in R is (signed) int32_t (and hence range limited). double in a proper double. Package int64 aimed to overcome this by using S4 and a complex (integer) type to give us 64 bit resolution (as in int64_t). Package bit64 does the same by using a double internally and many packages from data.table to database interfaces or JSON parsers (including our RcppSimdJson) use it. Our package nanotime relies on it to provide int64_t based timestamps (i.e nanoseconds since epoch). In short there is not other way. Some JSON packages stick with string representation too ("expensive", need to convert later).


After this question was asked, packages int64 by Romain Francois and bit64 by Jens Oehlschlägel are now available.


Dirk is right. You should be using the numeric type (which should be set to double). The other thing to note is that you may not be getting back all the digits. Look at the digits setting:

> options("digits")
$digits
[1] 7

You can extend this:

options(digits=14)

Alternatively, you can reformat the number:

format(big.int, digits=14)

I tested your number and am getting the same behavior (even using the double data type), so that may be a bug:

> as.double("123456789123456789")
[1] 123456789123456784
> class(as.double("123456789123456789"))
[1] "numeric"
> is.double(as.double("123456789123456789"))
[1] TRUE


I was trying to find a workaround for this issue from last two days and finally I found it today. We have 19 digits long ids in our SQL database and earlier I used RODBC to get bigint data from the server. I tried int64 and bit64, also defined options(digits=19), but RODBC kept on giving issues. I replaced RODBC with RJDBC, and while retrieving bigint data from SQL server, I manipulated SQL query by using casting bigint data to string.

So here is sample code:

#Include stats package
require(stats);
library(RJDBC);
#set the working directory
setwd("W:/Users/dev/Apps/R/Data/201401_2");

#Getting JDBC Driver
driver <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver", "W:/Users/dev/Apps/R/Data/sqljdbc/enu/sqljdbc4.jar");

#Connect with DB
connection <- dbConnect(driver, "jdbc:sqlserver://DBServer;DatabaseName=DB;", "BS_User", "BS_Password");
#Query string


  sqlText <- paste("SELECT DISTINCT Convert(varchar(19), ID) as ID
 FROM tbl_Sample", sep="");

#Execute query
queryResults <- dbGetQuery(connection, sqlText);

With this solution, I got bigint data without any modification but it didn't work with RODBC. Now the speed of SQL server interaction with R has affected because RJDBC is slower than RODBC but its not too bad.


I fixed few issues related to integers in rpy2 (Python can swich from int to long when needed, but R does does not seem to be able to do that. Integer overflows should now return NA_integer_.

L.


There are many options you can use for R for big number. You can also use as.numeric(). The problem with as.numeric() is that I found a bug in the function for version R 3.02. If you multiply numbers using as.numeric() data type and the numbers happen to produce a result that is around 16 digits in length you will get an error result. This bug of as.numeric() has been tested against many libraries.

There is another option.

I wrote two programs for R, one is called infiX and the other is infiXF for R. This library currently only support multiplication calculation. They both calculate numbers to the precise decimal. Been tested 100,000+ times. infiX will deal with the number in string format where infiXF will take it to the file system base.

When you store the number in memory, you are limited to 8 - 128 Gb depend on your memory. Sometimes even less if the compiler does not let you utilize all the available resources. When you calculate numbers on a text file base, you can calculate 1/5 of the hard drive size. The only problem is, the time it would need for a calculation.

For example, if I was calculating 1 terabyte of digits to another terabyte of digits. That is about 2 trillion digits. That is doable on a 8 terabytes hard-drive. Nevertheless, do I have the time to do the calculation?

InfiX for R can be found here. http://kevinhng86.iblog.website/2017/02/21/working-with-number-infinity-multiplication-optimised-the-code-r/

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号