backgroundImage

Specify nonstandard delimiters for CSVs in Great Expectations

Great Expectations offers you direct access to the reader methods/options of your Execution engine via the batch_spec_passthrough parameter.

Austin Robinson
November 15, 2022
Black and white photo of the semicolon key on a typewriter
It’s just a simple parameter adjustment for GX to handle CSVs where the delimiter isn’t actually a comma. (📸: Connor Pope via Unsplash)

Great Expectations can use the Pandas and SparkEngines to access and validate your CSV file data. To do this, you query your data by providing a Batch Request containing all the necessary details to return the expected Batch of data using your Datasource configuration. 

For many files, the data is indeed separated by commas, and you can proceed with your normal GX workflow.

Somewhat frequently, though, supposedly comma-separated values are in fact separated by something else. In that case, parsing expecting a comma delimiter will generally concatenate all columns into a single column: every row’s values become a single entry in that row.

You can adjust the delimiters of a file using the reader methods and options of your Execution Engine. Great Expectations provides the batch_spec_passthrough parameter within a Batch Request to offer you direct access to those reader methods & options.

To tell Great Expectations how to handle CSVs with non-comma delimiters, simply pass the reader_options appropriate to your file into the batch_spec_passthrough parameter of your Batch Request.

1batch_request = RuntimeBatchRequest(
2   datasource_name="my_filesystem_datasource",
3   data_connector_name="default_runtime_data_connector_name",
4   data_asset_name="example_data_asset",
5   runtime_parameters={"path": "path/to/data.csv"},
6   batch_identifiers={"default_identifier_name": 1234567890},
7   batch_spec_passthrough={"reader_options": {"sep": ";"}},
8)


Note that you can pass any reader options supported by the execution engine you’re using (Pandas or Spark). 

For example, a batch_spec_passthrough setting a semicolon as the delimiter and interpreting blank lines as null values, read with the Pandas Execution Engine, might look something like:

1batch_request = RuntimeBatchRequest(
2   datasource_name="my_filesystem_datasource",
3   data_connector_name="default_runtime_data_connector_name",
4   data_asset_name="example_data_asset",
5   runtime_parameters={"path": "path/to/data.csv"},
6   batch_identifiers={"default_identifier_name": 1234567890},
7   batch_spec_passthrough={"reader_options": {"sep": ";", "skip_blank_lines":
8False}},
9)


A similar process with the Spark Execution Engine, with tabs as the delimiter, might look like:

1batch_request = RuntimeBatchRequest(
2   datasource_name="my_filesystem_datasource",
3   data_connector_name="default_runtime_data_connector_name",
4   data_asset_name="example_data_asset",
5   runtime_parameters={"path": "path/to/data.csv"},
6   batch_identifiers={"default_identifier_name": 1234567890},
7   batch_spec_passthrough={"reader_options": {"delimiter": "/t", "mode":
8"PERMISSIVE"}},
9)


For more on your options with batch_spec_passthrough, check out the Pandas pd.read_csv() documentation here, and the Spark DataFrameReader documentation here.



Great Expectations is part of an increasingly flexible and powerful modern data ecosystem. This is just one example of the ways in which Great Expectations is able to leverage that ecosystem to give you greater control of your data quality processes.

We’re committed to supporting and growing the community around Great Expectations. It’s not enough to build a great platform; we want to build a great community as well. Join our public Slack channel here, find us on GitHub, sign up for one of our weekly cloud workshops, or head to https://greatexpectations.io/ to learn more.

Like our blogs?

Sign up for emails and get more blogs and news

Great Expectations email sign-up

Hello friend of Great Expectations!

Our email content features product updates from the open source platform and our upcoming Cloud product, new blogs and community celebrations.

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Banner Image

Search our blog for the latest on data management


©2024 Great Expectations. All Rights Reserved.