Databricks

My boss at work told me to maybe go into some newer technologies related to my data interests. One such item was Databricks. The name sounds cool. But I did not know much about them. Databricks is a company founded by the creators of Apache Spark. Recently they raised $250M in capital on a $6.2 valuation. Their yearly revenue is $200M.

Databricks produces a big data web based platform. You use it to work with Apache Spark. It provided what they call Unified Analytics. It develops pipelines across different storage systems. The pipelines are used to build ML models. The data is passed to third parties like Tableau, RStudio, and SnoFlake.

Databricks consists of a workspace, a runtime, and a cloud service. The latter is a managed service to host the tools. The runtime is built on an optimized version of Apache Spark. You choose which version of Spark to run on. There is a shared virtual notebook interface like the iPython UI.

So you can't understand Databricks without knowing what Spark is. That is a story of its own. In summary, Spark is a big data analytics engine. It is an alternative to MapReduce. Latency is reduced compared to competing products like Hadoop. It is open source. The engine runs on a cluster computing framework. It provides a distributed dataset. It also has a machine learning library.

And in case you did not know, iPython stands for Interactive Python. It is a command shell interface originally developed for Python. However it can now be used with multiple programming languages. It has tools to perform parallel computing jobs.