Want to learn more about #pyspark #memory profiling? Have questions about PySpark #UDFs, profiling hot loops of a UDF, profile memory of a UDF? This #AMA is a follow-up to the popular post How to Profile PySpark https://lnkd.in/gQ2MzXG5 #apachespark
Apache Spark
Technology, Information and Internet
Berkeley, CA 16,058 followers
Unified engine for large-scale data analytics
About us
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines. The most widely-used engine for scalable computing Thousands of companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to the open source project from industry and academia. Ecosystem Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.
- Website
-
https://spark.apache.org/
External link for Apache Spark
- Industry
- Technology, Information and Internet
- Company size
- 51-200 employees
- Headquarters
- Berkeley, CA
- Type
- Nonprofit
- Specialties
- Apache Spark, Big Data, Machine Learning, SQL Analytics, Batch, and Streaming
Locations
-
Primary
Berkeley, CA, US
Employees at Apache Spark
Updates
-
It's well known that you can create Spark functions with Scala/Java/Python, but did you know that you can also create Spark functions with SQL? Just use CREATE FUNCTION and use familiar SQL syntax. Apache Spark is a full-featured SQL engine. You can use Spark if you prefer SQL and don't want to use a programming language.
-
Spark Connect's architecture allows the client to be fully decoupled from the Spark Driver. The client and Spark Driver connect via gRPC. The client sends unresolved logical plans to the Spark Driver which responds by streaming Arrow record baches. Decoupling the client and Spark Driver provides user benefits like better remote development, IDE integrations, easier debugging, and less onerous update requirements (the Spark Driver can update software versions and the client doesn't necessarily have to update).
-
PySpark now has built-in testing utility functions to assess the equality of DataFrames. In the past, PySpark users had to rely on external libraries like chispa or spark-testing-base to get this DataFrame comparison functionality for unit testing, but now it's built right into Spark. The built-in Spark unit-testing functionality needs to be expanded and it would be a great area to contribute to the project!
-
It's easy to run Spark Connect locally, which is great if you're using Spark Connect in production and want an identical local runtime. You just need to start the Spark driver locally and then connect the client to the driver when establishing the Spark Session. Spark Connect decouples the client and driver.
-
PySpark supports for Python user-defined table functions (UDTF) as of the 3.5 release. A PySpark UDTF returns a table as output (many rows and columns). It doesn't just return a single value. PySpark User Defined Functions (UDFs) just output a single value. The following example shows how to create a Python UDTF that returns a table with two columns. See the "Introducing Python User-Defined Table Functions" post for more information and an example of how to use UDTFs with langchain.
-
PySpark queries leverage the power of a query optimizer, so they can be easier to write compared with pandas queries that require manual optmizations. For example, when reading a Parquet file, PySpark will automatically apply column pruning and row group filtering based on the query logic. These are important enhancements that can greatly speed up a query. When you read a Parquet file with pandas, you need to apply these enhancements manually. This can be tedious for large queries. It can also be error-prone - incorrect row group filtering logic can cause your query to return the wrong result.
-
It's simple to unit-test Spark DataFrame transformations. DataFrame transformations take a DataFrame as input and return a DataFrame. You can break up a Spark codebase into many DataFrame transformations, unit test them individually, and compose them for different analyses. This makes your code less complex and easier to test. Organizations that abstract core business logic into unit-tested reusable chunks are more productive. Here's a simple example: