Verwandte Artikel zu Big Data With Pyspark: Processing Large Datasets: A...

Big Data With Pyspark: Processing Large Datasets: A Hands-On Guide To Distributed Data Engineering, Machine Learning And Big Data Pipelines With Apache Spark And Python - Softcover

 
9798290030715: Big Data With Pyspark: Processing Large Datasets: A Hands-On Guide To Distributed Data Engineering, Machine Learning And Big Data Pipelines With Apache Spark And Python

Inhaltsangabe

You'll Learn

  • Understand the Foundations of Big Data and Distributed Computing: Gain a solid grasp of Big Data concepts, including the 5 Vs, the challenges of traditional systems, and the fundamental principles of distributed computing like parallelism, fault tolerance, and scalability.

  • Master the PySpark Ecosystem: Learn the architecture of Apache Spark, its core components (Spark SQL, Structured Streaming, MLlib, GraphFrames), and how the PySpark API seamlessly integrates with Python.

  • Set Up Your PySpark Environment: Get hands-on experience setting up a complete development environment on your local machine and learn how to run applications in various cloud platforms like Databricks, AWS EMR, and Google Cloud Dataproc.

  • Process Data with RDDs and DataFrames: Master Spark's core data structures, from the low-level RDDs to the powerful and optimized DataFrames. Learn to apply a wide range of transformations and actions for data manipulation.

  • Perform Advanced Data Wrangling and Feature Engineering: Acquire skills in data cleaning, handling missing values and duplicates, and performing complex transformations using Spark SQL, Window Functions, and User-Defined Functions (UDFs), including high-performance Pandas UDFs.

  • Connect to Diverse Data Sources: Read and write data from various formats (CSV, JSON, Parquet) and connect to external systems like relational databases (JDBC), NoSQL stores (Cassandra, MongoDB), and cloud storage (S3, ADLS).

  • Build Real-Time Data Pipelines: Implement modern, fault-tolerant data ingestion with Structured Streaming, including handling event time, watermarking, and performing stateful transformations for real-time analytics.

  • Apply Machine Learning at Scale with MLlib: Learn to build and evaluate distributed machine learning pipelines for classification, regression, and clustering tasks using Spark's MLlib library.

  • Analyze Graph-Structured Data: Explore the power of GraphFrames to model and analyze complex relationships, run graph algorithms like PageRank, and find patterns in network data.

  • Optimize PySpark Applications for Performance: Dive deep into performance tuning, including understanding DAGs and shuffles, managing partitioning, optimizing joins, and configuring memory settings to make your code run faster and more efficiently.

  • Monitor, Debug, and Deploy Applications: Utilize the Spark UI to monitor your jobs, troubleshoot common errors, and learn to package and deploy your PySpark applications to different cluster managers like YARN and Kubernetes.

  • Solve Real-World Big Data Problems: Apply your knowledge through practical case studies, including building a recommendation engine, a real-time fraud detection system, and an ETL pipeline, to solidify your skills and build a portfolio.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

EUR 8,52 für den Versand von USA nach Deutschland

Versandziele, Kosten & Dauer

Suchergebnisse für Big Data With Pyspark: Processing Large Datasets: A...

Beispielbild für diese ISBN

Publishing, PythQuill
Verlag: Independently published, 2025
ISBN 13: 9798290030715
Neu Softcover
Print-on-Demand

Anbieter: California Books, Miami, FL, USA

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. Print on Demand. Bestandsnummer des Verkäufers I-9798290030715

Verkäufer kontaktieren

Neu kaufen

EUR 20,18
Währung umrechnen
Versand: EUR 8,52
Von USA nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: Mehr als 20 verfügbar

In den Warenkorb