OneNet (now called Prajna) is a distributed functional programming platform being developed at Microsoft. As such, it has a lot of similarities to Apache Spark. Both platforms are built using functional languages---F#, in the case of OneNet, and Scala in Spark---which are also the primary languages for developers using the platforms. OneNet will have support for specializied computing devices, … [Read more...]
Happy 5th birthday, Spark
The Apache Spark project was first open-sourced on 31 March 2010. While much has been made of how quickly interest in Spark has grown, it's worth pausing to remember that it's been around for a whole five years. The project has had time to mature and expand into areas where there are the practical requirements (e.g. data frames, ML pipelines). Looking forward to what the team comes up with in … [Read more...]
Birth of a Theorem
I've been listening to extracts from "Birth of a Theorem: A Mathematical Adventure" on BBC Radio 4's Book of the Week. It's Cédric Villani's account of the years leading up to his award of the Fields Medal---the most coveted prize in mathematics. We rarely get a chance to see the creative process at work. All we get to see is the final result---wrapped up in a neat little bow. We don't see the … [Read more...]
Big data and narrative
As the amount of data we collect continues to explode, attention needs to shift to making sense of it. Tools like Hadoop and Spark allow us to analyse these huge datasets, but they don't really make sense of it. Senior managers want insights. They want to have their business illuminated by the data. At present, this is done by data scientists analyzing the data and weaving it into a … [Read more...]
Dirty data is the biggest challenge facing data scientists
A recent survery of data scientists by CrowdFlower found that, when it comes to challenges Dirty data is the #1 hurdle… My own experience leads me to agree. Access to good quality data remains a huge problem. Organizations would do well to invest in improving the quality of their data before boosting their analytics capabilities. Garbage in, garbage out. One error I see made regularly … [Read more...]