There are now three Spark APIs for working with large volumes of data RDD DataFrame Dataset Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog. RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs … [Read more...]
Kaggle provide home for high quality public datasets
Kaggle have launched Kaggle Datasets---a repository of "high quality public datasets". The repository will support: Access: simple, consistent access to the data with clear licensing Analysis: a way to explore the data without downloading it Results: visibility of previous work performed using the data Conversation: forums for discussing the nuances of the data … [Read more...]