There are now three Spark APIs for working with large volumes of data
- RDD
- DataFrame
- Dataset
Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog.
RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs make sense when working with unstructured data, such as media or text streams. They are also the best approach if your problem fits neatly within the functional programming paradigm.
However, for the majority of data science tasks, it is likely that the DataFrame and Dataset APIs will be more appropriate. Dataset is a strongly-typed API, whereas DataFrame is untyped. A DataFrame can be thought of as a Dataset of generic (untyped) objects. From Spark 2.0 onward the Dataset and DataFrame APIs will be unified.
Datasets imposes more constraints on the structure of the data. They are not as flexible as RDDs. However, those constraints allow the API to have higher-level functionality and support enhanced compile-time checks and significant run-time performance optimizations.
So, at the risk of oversimplifying, use the Dataset API unless it’s making you jump through hoops. If it is, feel free to use the RDD API. It’s not disappearing anytime soon.
It should be noted that the Spark libraries (such as MLlib) are still being updated to work with the Dataset API, so, in the short term, RDDs may still make sense even when working with structured data.