long-time R and Python user here. I use R for my daily data analysis and Python for tasks heavier on text processing and shell-scripting. I am working with increasingly large data sets, and these files are often in binary or text files when I get them. The type of things I do normally is to apply statistical/machine learning algorithms and create statistical graphics in most cases. I use R with SQLite sometimes and write C for iteration-intensive tasks; before looking into Hadoop, I am considering investing some time in NumPy/Scipy because I've heard it has better memory management [and the transition to Numpy/Scipy for one with my background seems not that big] - I wonder if anyone has experience using the two and could comment on the imp开发者_如何学Gorovements in this area, and if there are idioms in Numpy that deal with this issue. (I'm also aware of Rpy2 but wondering if Numpy/Scipy can handle most of my needs). Thanks -
R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.
Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).
Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.
If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).
I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).
I use NumPy daily and R nearly so.
For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.
For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:
data = NP.loadtxt(data1, delimiter=",") # 'data' is a NumPy array
data -= NP.mean(data, axis=0)
data /= NP.max(data, axis=0)
Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).
You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.
So, a few recommendations--in general and in particular, given the facts in your question:
install both NumPy and Scipy. As a rough guide, NumPy provides the core data structures (in particular the ndarray) and SciPy (which is actually several times larger than NumPy) provides the domain-specific functions (e.g., statistics, signal processing, integration).
install the repository versions, particularly w/r/t NumPy because the dev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via easy_install, which i assume you already.
NumPy/SciPy have several modules specifically directed to Machine Learning/Statistics, including the Clustering package and the Statistics package.
As well as packages directed to general computation, but which are make coding ML algorithms a lot faster, in particular, Optimization and Linear Algebra.
There are also the SciKits, not included in the base NumPy or SciPy libraries; you need to install them separately. Generally speaking, each SciKit is a set of convenience wrappers to streamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).
精彩评论