RStudio have released version 1.0 of their eponymous R IDE. They are calling it their ...biggest [release] ever! It certainly has a number of very significant features. Integrated support for Spark Spark and R are core tools for data scientists. While Spark has an R API, support for the machine learning libraries is lagging. So, it's great to hear that RStudio now has integrated support … [Read more...]
RDDs, DataFrames and Datasets
There are now three Spark APIs for working with large volumes of data RDD DataFrame Dataset Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog. RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs … [Read more...]
Status of Spark MLlib wrappers in SparkR
Wrappers for Spark's MLlib machine learning library in SparkR have been slow to arrive. However, the future looks bright. The imminent 2.0 release will bring k-means support to SparkR and the 2.1 release is scheduled to include wrappers for the following machine learning stalwarts Alternating Least Squares (ALS) Decision Trees Gaussian Mixture Models Isotonic Regression Latent Dirichlet … [Read more...]
Free Apache Spark Analytics Made Simple e-book
Databricks have just published a free e-book entitled "Apache Spark Analytics Made Simple". Contents include An introduction to the Spark API for analytics Tips and tricks to simplify unified data access Real-world case studies of how various companies are using Spark with Databricks to transform their business There are more to come. Titles are Mastering Advanced Analytics with Apache … [Read more...]