Today we’ll be covering how to pass configuration options to Spark from your Great Expectations (GX) workflow! This is useful anytime you need to use additional flags with Spark.
All of these options are things you could do by connecting directly to your Spark instance and providing flags that way, then yielding the resultant data_frame.
But by accessing these options from within GX itself, we can make our days a little easier. Like if you’re tuning something—you can declare a Spark option in line while you’re doing that.
The example we’ll reference for this post is here.
NOTE: You can reference tests and/or docs to find other stellar examples!
How do you do it?
You can pass Spark configuration options from GX both at runtime in your workflow and through your configuration. Pick the option that better fits your preference and/or environment!
At runtime
Let’s review the aforementioned Python example of providing a Spark configuration to the underlying Spark instance:
1 ...2 spark_config: Dict[str, str] = {3 …4 "spark.sql.catalogImplementation": "hive",5 "spark.executor.memory": "768m",6 ...7 }
As you can see, providing the flag to Spark is easy: just add the name of the flag as a key to the spark_config dictionary, with your desired value as the other half of the pair.
Since you’re working with Spark, don’t forget to read up on the deets of batch_spec_passthrough!
In your configuration
You can also provide the configuration options for Spark in your datasource configuration directly. We’ll use data_context_parameterized_expectation_suite.add_datasource() as our example:
1 data_context_parameterized_expectation_suite.add_datasource(2 dataset_name,3 class_name="SparkDFDatasource",4 spark_config=spark_config,5 force_reuse_spark_context=False,6 module_name="great_expectations.datasource",7 batch_kwargs_generators={},8 )9 datasource_config =10data_context_parameterized_expectation_suite.get_datasource(11 dataset_name12 ).config
In short, in this scenario you provide the earlier configuration options for Spark to your datasource configuration directly. The specifics of the configuration are stored in great_expectations/great_expectations.yml.
Tips
It always helps to validate the configuration! We can easily do that by following this example:
1 ...2 source: SparkDFDatasource =3SparkDFDatasource(spark_config=spark_config)4 spark_session: SparkSession = source.spark5 ...
After which a quick check to confirm the connection is just one spark_session.sparkContext._jsc.sc().isStopped() away.
Before you go
You can view an example with many more options declared for Spark here:
1expected_spark_config: Dict[str, str] = {2 "spark.app.name":3"default_great_expectations_spark_application",4 "spark.default.parallelism": "4",5 "spark.driver.memory": "6g",6 "spark.executor.id": "driver",7 "spark.executor.memory": "6g",8 "spark.master": "local[*]",9 "spark.rdd.compress": "True",10 "spark.serializer.objectStreamReset": "100",11 "spark.sql.catalogImplementation": "hive",12 "spark.sql.shuffle.partitions": "2",13 "spark.submit.deployMode": "client",14 "spark.ui.showConsoleProgress": "False",15 }
Get the full list of possible options, provided by the lovely folks at the Apache Software Foundation, here.
Great Expectations is part of an increasingly flexible and powerful modern data ecosystem. This is just one example of the ways in which Great Expectations is able to give you greater control of your data quality processes within that ecosystem.
We’re committed to supporting and growing the community around Great Expectations. It’s not enough to build a great platform; we want to build a great community as well. Join our public Slack channel here, find us on GitHub, sign up for one of our weekly cloud workshops, or head to https://greatexpectations.io/ to learn more.