Why I Think Spark Will Have the Staying Power of SQL

Spark is to SQL what calculus is to algebra.

Michael Vedomske -

October 27, 2017

Why-I-Think-Spark-Will-Have-the-Staying-Power-of-SQL

Old-timer Just Keeps on Tickin’

SQL has been around for almost 40 years. SQL has been around in commercial form since 1979. That’s when Relational Software, Inc. (which later became Oracle) released Oracle version 2 (which was the marketing renaming of what was really version 1).

Think about that for a second. A hotshot fresh-out-of-undergrad SQL-skilled new hire would be retiring in just a couple years. Some people still build their careers off of SQL skills. In other words, SQL had incredible staying power. It still does.

Enter the Young Gun

So what does this have to do with AI? Well, I’m going to go out on a limb and say that we’re three years into a similar journey with another landscape shifting technology: Spark. Spark was initially released as an Apache project in May of 2014. I happened to be a fresh hire (albeit PhD, not undergrad) and Spark was HOT. I mean, it was exactly what our company (and many others) needed and every release just got better.

I have a few reasons that I believe will help Spark stay meaningful through the years.

1. Daddy Warbucks Got Yer Back

SQL was supported by a strong company (read, had commercial support) while also taking advantage of the open source efforts of outside contributors and was eventually standardized. Spark has the commercial support of Databricks which is currently valued at nearly $1B only a few years into its existence. And as an Apache project, it is developed at an extremely rapid clip by a vibrant open-source community.

What’s probably even more important is the fact that it is used at so many large companies. In other words, it has weaseled it’s way into the core toolset of much of the world’s GDP. And that’s just the beginning, because according to DataBricks, they’re still working on reaching the other 99%.

2. One Stop ML Shop

One of the things that made it really great was it could pass through HQL and then soon had it’s own SQL-like language, Spark-SQL. For the first time, data wrangling and machine learning could be executed on big data in one place in well-known languages at extraordinary speed. It was the holy grail of big data science.

Spark meets two primary needs:

Easy data wrangling (in a familiar approach: SQL)
Many of your favorite machine learning algorithms at scale.

In other words, SQL’s staying power, and natural way of thinking about data, is what will help Spark also have staying power. Yes, most data stores are no-SQL, but the fact that you can use non-relational databases and think about the data in them as if they were relational is what makes it so powerful. Notice, SQL is still the reference here. All databases are referenced by their relation to SQL, that’s saying something.

Spark can handle pretty much any data store you throw at it and data scientists can use a common way of thinking about data (SQL) for handling it. You don’t have to use the SQL-like interface, but it’s there, and many take advantage of it. Don’t care for the SQL/HQL aproach? That’s fine, you can use Spark like many use bash for data wrangling. Spark spans many skill levels.

3. It Feels Familiar

Because Spark has a machine learning library, you can use it much like you would familiar data science languages like R and Python. The usefulness here goes beyond just syntax, it’s the process that makes it so user-friendly.

Interactively playing with and exploring data is one of the most powerful parts of R and Python. You can very quickly start to peel back the layers and find the stories within the data. Before Spark, that process was painful and slow (sorry MapReduce 🙁 … ). Suddenly with Spark, working with very large data sets felt much more like what we experienced in R and Python. Sure, there was still some waiting, but nothing close to what it was before.

The second powerful parts of R and Python are the packages that contain numerous algorithms for machine learning (and just about any other data-related task you can think of). Spark does this as well, though in a more limited way (due to the parallelization it requires). Spark makes big data feel a little smaller. In today’s parlance, the user experience is solid.

Apache Spark Architecture – See You In 40 Years

SQL made working with data much simpler. For the first time, people could use a straighforward logic and language for getting at previously hidden knowledge. Spark is the next natural step of that evolution. In this step, the hidden knowledge is less explicit, and is found via feature engineering, machine learning, and dipping into vast stores of previously untapped data. Because Spark makes doing these things simple in the way that SQL made the first step of data exploration simple.

Spark is to SQL what calculus is to algebra. And that’s why I think Spark will have the staying power of SQL.

EXPLORE

ABOUT

COMMUNITY

SUBMIT CONTENT

CONTENT

Old-timer Just Keeps on Tickin’

Enter the Young Gun

1. Daddy Warbucks Got Yer Back

2. One Stop ML Shop

3. It Feels Familiar

Apache Spark Architecture – See You In 40 Years

New Episode

Deploying IoT at the Edge

Related Articles

IoT Swims Laps Around ...

How Much Does IoT Adop...

4 Ways IoT is Making B...

Related Articles

More Articles

Latest IoT News

Latest IoT News

“A long way from Silicon Valley” – operators get enterprise (and service), says Vodafone

IoT Solutions in Oil and Gas Industry

Too many models

5th Edition Connected Africa announces Telecom Innovation & Excellence Awards 2024

TechCrunch Minute: Meta’s new Llama 3 models give open source AI a boost

Swedish mining company Boliden taps Industry 4.0 startup Radtonics for private 5G

Fact of the Day – 4/19/2024

At the Forefront of Innovation: Dr. Köckler’s Take on Hannover Messe’s Global Impact

What are Cyber Physical Systems (CPSs) and Cyber Physical Infrastructure (CPIs)?

Decoding the Industrial IoT: Navigating the complex world of wireless technology options

ITRI partners with Arm to open IoT certification lab in Taipei

Siemens goes big on industrial AI at Hannover Messe – new apps, services, partners

May 2024 Industrial IoT & ICS Cybersecurity Events

Best Udemy Phyton Courses

Internet users are getting younger; now the UK is weighing up if AI can help protect them

Quantinuum raises US$300m in equity funding

Boots on the ground – who’s who in the supply of private 5G networks

Meta releases Llama 3, claims it’s among the best open models available

Fact of the Day – 4/18/2024

Unlocking Customer Experience: The Critical Role of Your Supply Chain

What is the NIS2 Directive? Here The Reference Guide!

IoT Now Contract Win List – March 2024

Vodafone’s “big bet” to hive-off and hyper-scale IoT with Microsoft

Ending Distraction on the Road

CONTENT

EXPLORE

ABOUT

ABOUT

COMMUNITY

SUBMIT CONTENT

Search IoT For All