Decision Mechanics

Insight. Applied.

  • Services
    • Decision analysis
    • Big data analysis
    • Software development
  • Articles
  • Blog
  • Privacy
  • Hire us

RDDs, DataFrames and Datasets

July 14, 2016 By editor

There are now three Spark APIs for working with large volumes of data

  • RDD
  • DataFrame
  • Dataset

Which one should we use? Good question. Jules Damji provides a pretty comprehensive answer in an article on the Databricks blog.

RDD was the original API for working with large volumes of data. The first thing to note is that the RDD API is not being deprecated. It has an important role to play. RDDs make sense when working with unstructured data, such as media or text streams. They are also the best approach if your problem fits neatly within the functional programming paradigm.

However, for the majority of data science tasks, it is likely that the DataFrame and Dataset APIs will be more appropriate. Dataset is a strongly-typed API, whereas DataFrame is untyped. A DataFrame can be thought of as a Dataset of generic (untyped) objects. From Spark 2.0 onward the Dataset and DataFrame APIs will be unified.

Datasets imposes more constraints on the structure of the data. They are not as flexible as RDDs. However, those constraints allow the API to have higher-level functionality and support enhanced compile-time checks and significant run-time performance optimizations.

So, at the risk of oversimplifying, use the Dataset API unless it’s making you jump through hoops. If it is, feel free to use the RDD API. It’s not disappearing anytime soon.

It should be noted that the Spark libraries (such as MLlib) are still being updated to work with the Dataset API, so, in the short term, RDDs may still make sense even when working with structured data.

Filed Under: Big data, Data science, Machine learning Tagged With: DataFrame, dataset, RDD, Spark

Kaggle provide home for high quality public datasets

January 21, 2016 By editor

Kaggle have launched Kaggle Datasets—a repository of “high quality public datasets”.

The repository will support:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility of previous work performed using the data
  • Conversation: forums for discussing the nuances of the data

Filed Under: Data analysis Tagged With: dataset

Search

Subscribe to blog via e-mail

Subscribe via RSS

Recent posts

  • Spreadsheet error delays opening of children’s hospital
  • 16,000 coronavirus cases missed by Excel
  • 20 cognitive biases that affect your decision-making
  • The science of decision-making and data
  • Confidence intervals

Copyright © 2021 · Decision Mechanics Limited · info@decisionmechanics.com