Apache Spark

Apache Spark

Technology, Information and Internet

Berkeley, CA 16,058 followers

Unified engine for large-scale data analytics

About us

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines. The most widely-used engine for scalable computing Thousands of companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to the open source project from industry and academia. Ecosystem Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Website
https://spark.apache.org/
Industry
Technology, Information and Internet
Company size
51-200 employees
Headquarters
Berkeley, CA
Type
Nonprofit
Specialties
Apache Spark, Big Data, Machine Learning, SQL Analytics, Batch, and Streaming

Locations

Employees at Apache Spark

Updates

  • View organization page for Apache Spark, graphic

    16,058 followers

    It's well known that you can create Spark functions with Scala/Java/Python, but did you know that you can also create Spark functions with SQL? Just use CREATE FUNCTION and use familiar SQL syntax. Apache Spark is a full-featured SQL engine. You can use Spark if you prefer SQL and don't want to use a programming language.

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    Spark Connect's architecture allows the client to be fully decoupled from the Spark Driver. The client and Spark Driver connect via gRPC. The client sends unresolved logical plans to the Spark Driver which responds by streaming Arrow record baches. Decoupling the client and Spark Driver provides user benefits like better remote development, IDE integrations, easier debugging, and less onerous update requirements (the Spark Driver can update software versions and the client doesn't necessarily have to update).

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    PySpark now has built-in testing utility functions to assess the equality of DataFrames. In the past, PySpark users had to rely on external libraries like chispa or spark-testing-base to get this DataFrame comparison functionality for unit testing, but now it's built right into Spark. The built-in Spark unit-testing functionality needs to be expanded and it would be a great area to contribute to the project!

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    It's easy to run Spark Connect locally, which is great if you're using Spark Connect in production and want an identical local runtime. You just need to start the Spark driver locally and then connect the client to the driver when establishing the Spark Session. Spark Connect decouples the client and driver.

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    PySpark supports for Python user-defined table functions (UDTF) as of the 3.5 release. A PySpark UDTF returns a table as output (many rows and columns). It doesn't just return a single value. PySpark User Defined Functions (UDFs) just output a single value. The following example shows how to create a Python UDTF that returns a table with two columns. See the "Introducing Python User-Defined Table Functions" post for more information and an example of how to use UDTFs with langchain.

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    Spark 3.5 added some awesome new array helper functions. For example, array_append and array_prepend make it easy to add elements to an array column in a PySpark DataFrame. The apachespark array_* functions make it quite easy to manipulate array columns any way you'd like.

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    PySpark queries leverage the power of a query optimizer, so they can be easier to write compared with pandas queries that require manual optmizations. For example, when reading a Parquet file, PySpark will automatically apply column pruning and row group filtering based on the query logic. These are important enhancements that can greatly speed up a query. When you read a Parquet file with pandas, you need to apply these enhancements manually. This can be tedious for large queries. It can also be error-prone - incorrect row group filtering logic can cause your query to return the wrong result.

    • No alternative text description for this image
  • View organization page for Apache Spark, graphic

    16,058 followers

    It's simple to unit-test Spark DataFrame transformations. DataFrame transformations take a DataFrame as input and return a DataFrame. You can break up a Spark codebase into many DataFrame transformations, unit test them individually, and compose them for different analyses. This makes your code less complex and easier to test. Organizations that abstract core business logic into unit-tested reusable chunks are more productive. Here's a simple example:

    • No alternative text description for this image

Similar pages