There are now three Spark APIs for working with large volumes of data RDD DataFrame Dataset Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog. RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs … [Read more...]
Microsoft announces major commitment to Apache Spark
Microsoft have just announced an extensive commitment for Spark to power Microsoft’s big data and analytics offerings including Cortana Intelligence Suite, Power BI, and Microsoft R Server Spark 1.6.1 is available on Azure HDInsight and integration with R Server is following. This will allow R functions to be run at scale over thousands of Spark nodes. … [Read more...]
Microsoft R Open 3.2.5 released
Microsoft R Open 3.2.5 is now available. While there are no substantial changes to core R, the CRAN snapshot includes some new packages, such as deeplearning---a deep neural network implementation for regression and classification. … [Read more...]
Microsoft R Server documentation is now online
The complete Microsoft R Server documentation is now available on MSDN---and is publicly accessible. It includes comprehensive details of the RevoScaleR High Performance Analytics package. RevoScaleR includes the following analysis functions rxSummary (basic summary statistics) rxLinMod (linear modeling) rxLogit (logistic regression modeling) rxGlm (generalized linear modeling) rxCovCor … [Read more...]
Designing a data strategy
Learning Tree just published my article on why it's important to think carefully about the data and techniques employed in data-driven decision-making. Throwing machine learning at existing data doesn't cut it. … [Read more...]