S3+Spark(For MAC)
3 min readEdit on GitHub
Install Apache Spark on macOS and connect it with AWS S3 using the s3a:// connector.
1. Install Apache Spark
hljs sh
brew install apache-sparkCheck version:
hljs sh
spark-shell --versionYou should see: Spark version 4.0.1 (or similar)
2. Configure AWS Credentials
Verify AWS is configured:
hljs sh
aws configure listThis should show your credentials from:
text
~/.aws/credentials
~/.aws/configSpark automatically picks these credentials.
3. Download S3 Connector JARs
IMPORTANT: Spark 4.0.1 uses Hadoop 3.4.1, so you need matching versions of the JARs.
Create a directory for JARs in Spark's installation:
hljs sh
sudo mkdir -p /opt/homebrew/Cellar/apache-spark/4.0.1_1/libexec/jars/s3-connector
cd /opt/homebrew/Cellar/apache-spark/4.0.1_1/libexec/jars/s3-connectorDownload the required JARs (compatible versions):
hljs sh
# Hadoop AWS 3.4.1
sudo curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.4.1/hadoop-aws-3.4.1.jar -o hadoop-aws-3.4.1.jar
# AWS SDK v1 Bundle
sudo curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.780/aws-java-sdk-bundle-1.12.780.jar -o aws-java-sdk-bundle-1.12.780.jar
# AWS SDK v2 Bundle
sudo curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.29.52/bundle-2.29.52.jar -o bundle-2.29.52.jarOR copy JARs directly to Spark's jars folder (simpler method):
hljs sh
# If you already downloaded them to ~/big_data_course/jars/
sudo cp ~/big_data_course/jars/*.jar /opt/homebrew/Cellar/apache-spark/4.0.1_1/libexec/jars/4. Configure Spark Defaults (Direct Access - No Script Needed)
Create/edit Spark's default configuration file:
hljs sh
sudo nano /opt/homebrew/Cellar/apache-spark/4.0.1_1/libexec/conf/spark-defaults.confAdd these lines:
hljs properties
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider
spark.hadoop.fs.s3a.connection.timeout 60000
spark.hadoop.fs.s3a.connection.establish.timeout 60000
spark.hadoop.fs.s3a.socket.timeout 60000Save and exit (Ctrl+X, then Y, then Enter).
Note: Using
software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider (AWS SDK v2) avoids deprecation warnings. The old v1 provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain still works but shows warnings.5. Start Spark with S3 Support (Direct Access)
Now you can use Spark commands directly without any extra parameters:
PySpark Shell (Direct)
hljs sh
pysparkSpark Shell (Direct)
hljs sh
spark-shellSpark Submit (Direct)
hljs sh
spark-submit your_script.pyThat's it! S3 support is now permanently enabled for all Spark sessions.
6. Test Connection to S3
PySpark Example:
hljs python
from pyspark.sql import SparkSession
# No extra configuration needed - uses spark-defaults.conf
spark = SparkSession.builder.appName("S3Test").getOrCreate()
# Read from S3
df = spark.read.text("s3a://your-bucket-name/path/file.txt")
df.show()
# Write to S3
df.write.mode("overwrite").parquet("s3a://your-bucket-name/output/")Scala Example (in spark-shell):
hljs scala
// Read from S3
spark.read.text("s3a://your-bucket-name/path/file.txt").show()
// Write to S3
spark.range(10).write.mode("overwrite").csv("s3a://your-bucket-name/test-output/")Simple Command Line Test:
hljs sh
# Start PySpark and test immediately
pysparkThen in PySpark:
hljs python
spark.read.text("s3a://your-bucket-name/file.txt").show()Important Notes
- Always use
s3a://protocol (nots3://ors3n://) - All 3 JARs are required - missing any will cause errors
- Version compatibility matters - Spark 4.0.1 needs Hadoop 3.4.1 JARs
- AWS credentials must be configured via
aws configure
Troubleshooting
Error: ClassNotFoundException: S3AFileSystem
- Make sure all 3 JAR files are included in
--jarsparameter
Error: NoSuchMethodError
- Version mismatch - use Hadoop 3.4.1 JARs (not 3.4.2)
Error: Access Denied
- Check AWS credentials:
aws configure list - Verify S3 bucket permissions
Installation complete! Test your connection with the examples above.