Decision Mechanics

Insight. Applied.

  • Services
    • Decision analysis
    • Big data analysis
    • Software development
  • Articles
  • Blog
  • Privacy
  • Hire us

RDDs, DataFrames and Datasets

July 14, 2016 By editor

There are now three Spark APIs for working with large volumes of data

  • RDD
  • DataFrame
  • Dataset

Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog.

RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs make sense when working with unstructured data, such as media or text streams. They are also the best approach if your problem fits neatly within the functional programming paradigm.

However, for the majority of data science tasks, it is likely that the DataFrame and Dataset APIs will be more appropriate. Dataset is a strongly-typed API, whereas DataFrame is untyped. A DataFrame can be thought of as a Dataset of generic (untyped) objects. From Spark 2.0 onward the Dataset and DataFrame APIs will be unified.

Datasets imposes more constraints on the structure of the data. They are not as flexible as RDDs. However, those constraints allow the API to have higher-level functionality and support enhanced compile-time checks and significant run-time performance optimizations.

So, at the risk of oversimplifying, use the Dataset API unless it’s making you jump through hoops. If it is, feel free to use the RDD API. It’s not disappearing anytime soon.

It should be noted that the Spark libraries (such as MLlib) are still being updated to work with the Dataset API, so, in the short term, RDDs may still make sense even when working with structured data.

Print Friendly, PDF & Email

Share this:

  • Email
  • Twitter
  • LinkedIn
  • Facebook

Filed Under: Big data, Data science, Machine learning Tagged With: DataFrame, dataset, RDD, Spark

Search

Subscribe to blog via e-mail

Subscribe via RSS

Recent posts

  • Data Wrangler
  • The Trolley Problem
  • Counting votes using Excel
  • Accuracy vs precision
  • It’s not because we have insufficient data…

Copyright © 2025 · Decision Mechanics Limited · info@decisionmechanics.com