Apache Spark

Technology, Information and Internet

Berkeley, CA 16,058 followers

Unified engine for large-scale data analytics

View all 2 employees

About us

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines. The most widely-used engine for scalable computing Thousands of companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to the open source project from industry and academia. Ecosystem Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Website: https://spark.apache.org/
External link for Apache Spark
Industry: Technology, Information and Internet
Company size: 51-200 employees
Headquarters: Berkeley, CA
Type: Nonprofit
Specialties: Apache Spark, Big Data, Machine Learning, SQL Analytics, Batch, and Streaming

Locations

Primary

Berkeley, CA, US

Get directions

Employees at Apache Spark

See all employees

Updates

Apache Spark

16,058 followers
1y
Report this post
Want to learn more about #pyspark #memory profiling? Have questions about PySpark #UDFs, profiling hot loops of a UDF, profile memory of a UDF? This #AMA is a follow-up to the popular post How to Profile PySpark https://lnkd.in/gQ2MzXG5 #apachespark

PySpark Profiling AMA

www.linkedin.com

5 Comments

Like Comment Share
Apache Spark

16,058 followers
2mo
Report this post
It's well known that you can create Spark functions with Scala/Java/Python, but did you know that you can also create Spark functions with SQL? Just use CREATE FUNCTION and use familiar SQL syntax. Apache Spark is a full-featured SQL engine. You can use Spark if you prefer SQL and don't want to use a programming language.
11 Comments

Like Comment Share
Apache Spark

16,058 followers
2mo
Report this post
You can remove NULL values from PySpark arrays with array_compact. This was added in PySpark 3.4. You will frequently want to discard the NULL values in a PySpark array rather than write logic to deal with them. array_compact makes getting rid of NULL values quite easy.
5 Comments

Like Comment Share
Apache Spark

16,058 followers
2mo
Report this post
Spark Connect's architecture allows the client to be fully decoupled from the Spark Driver. The client and Spark Driver connect via gRPC. The client sends unresolved logical plans to the Spark Driver which responds by streaming Arrow record baches. Decoupling the client and Spark Driver provides user benefits like better remote development, IDE integrations, easier debugging, and less onerous update requirements (the Spark Driver can update software versions and the client doesn't necessarily have to update).
2 Comments

Like Comment Share
Apache Spark

16,058 followers
2mo
Report this post
PySpark now has built-in testing utility functions to assess the equality of DataFrames. In the past, PySpark users had to rely on external libraries like chispa or spark-testing-base to get this DataFrame comparison functionality for unit testing, but now it's built right into Spark. The built-in Spark unit-testing functionality needs to be expanded and it would be a great area to contribute to the project!
13 Comments

Like Comment Share
Apache Spark

16,058 followers
2mo
Report this post
It's easy to run Spark Connect locally, which is great if you're using Spark Connect in production and want an identical local runtime. You just need to start the Spark driver locally and then connect the client to the driver when establishing the Spark Session. Spark Connect decouples the client and driver.
Like Comment Share
Apache Spark

16,058 followers
3mo
Report this post
PySpark supports for Python user-defined table functions (UDTF) as of the 3.5 release. A PySpark UDTF returns a table as output (many rows and columns). It doesn't just return a single value. PySpark User Defined Functions (UDFs) just output a single value. The following example shows how to create a Python UDTF that returns a table with two columns. See the "Introducing Python User-Defined Table Functions" post for more information and an example of how to use UDTFs with langchain.
2 Comments

Like Comment Share
Apache Spark

16,058 followers
3mo
Report this post
Spark 3.5 added some awesome new array helper functions. For example, array_append and array_prepend make it easy to add elements to an array column in a PySpark DataFrame. The apachespark array_* functions make it quite easy to manipulate array columns any way you'd like.
6 Comments

Like Comment Share
Apache Spark

16,058 followers
3mo
Report this post
PySpark queries leverage the power of a query optimizer, so they can be easier to write compared with pandas queries that require manual optmizations. For example, when reading a Parquet file, PySpark will automatically apply column pruning and row group filtering based on the query logic. These are important enhancements that can greatly speed up a query. When you read a Parquet file with pandas, you need to apply these enhancements manually. This can be tedious for large queries. It can also be error-prone - incorrect row group filtering logic can cause your query to return the wrong result.
4 Comments

Like Comment Share
Apache Spark

16,058 followers
3mo
Report this post
It's simple to unit-test Spark DataFrame transformations. DataFrame transformations take a DataFrame as input and return a DataFrame. You can break up a Spark codebase into many DataFrame transformations, unit test them individually, and compose them for different analyses. This makes your code less complex and easier to test. Organizations that abstract core business logic into unit-tested reusable chunks are more productive. Here's a simple example:
5 Comments

Like Comment Share

Apache Spark

Technology, Information and Internet

Berkeley, CA 16,058 followers

Unified engine for large-scale data analytics

About us

Locations

Employees at Apache Spark

Hyukjin Kwon

ASF member, Apache Spark PMC member and committer

Arul Gampala

Apache at Apache Spark

Updates

PySpark Profiling AMA

www.linkedin.com

Join now to see what you are missing

Similar pages

Databricks

Delta Lake

Apache Airflow

Apache Iceberg

MLflow

Snowflake

dbt Labs

Data Engineering Jobs

The Apache Software Foundation

DuckDB