Contingency tables in SQL with Workbench
This tutorial can be run as a Jupyter notebook from the 5s-TES notebooks repository
Federated analysis on contingency tables is relatively simple. Counts are easy to federate: each TRE calculates their local count for some group, then these are aggregated by adding the counts together. Each cell of a contingency table is a count, so the table can be federated by requesting these counts, and then statistical analyses can be performed on the aggregate.
The example data we will work with here is produced by the following SQL query:
hypertension_neoplasm_query = """
WITH hypertension AS (
SELECT
person_id,
CASE
WHEN person_id IN (
SELECT person_id
FROM "DelphiDemo".condition_occurrence
WHERE condition_concept_id = 320128
) THEN 'has_hypertension' ELSE 'no_hypertension' END AS hypertension_status
FROM "DelphiDemo".person
)
SELECT
CASE
WHEN p.person_id IN (
SELECT person_id
FROM "DelphiDemo".condition_occurrence
WHERE condition_concept_id = 139750
) THEN 'with'
ELSE 'without'
END AS neoplasm_status,
hypertension.hypertension_status,
COUNT(p.person_id) as n
FROM "DelphiDemo".person p
JOIN hypertension ON p.person_id = hypertension.person_id
GROUP BY neoplasm_status, hypertension_status
"""
This is built into a TES message using the Five Safes TES workbench, to run on the Delphi dataset. This is designed to run on a container which will run SQL queries, such as this one: harbor.federated-analytics.ac.uk/5s-tes-analysis-tools/5s-tes-analysis-tools-tre-sqlpg:1.0.0, which is encoded into the template.
Expand for example JSON
{
"name": "hypertension x neoplasm contingency table",
"description": "Simple SQL Task",
"outputs": [
{
"url": "s3://",
"path": "/outputs",
"type": "DIRECTORY",
"name": "Output",
"description": "Output results"
}
],
"executors": [
{
"image": "harbor.federated-analytics.ac.uk/5s-tes-analysis-tools/5s-tes-analysis-tools-tre-sqlpg:1.0.0",
"command": [
"--Output=/outputs/output.csv",
"--Query=\nWITH hypertension AS (\n SELECT\n person_id,\n CASE\n WHEN person_id IN (\n SELECT person_id\n FROM \"DelphiDemo\".condition_occurrence\n WHERE condition_concept_id = 320128\n ) THEN 'has_hypertension' ELSE 'no_hypertension' END AS hypertension_status\n FROM \"DelphiDemo\".person\n)\n\nSELECT\n CASE\n WHEN p.person_id IN (\n SELECT person_id\n FROM \"DelphiDemo\".condition_occurrence\n WHERE condition_concept_id = 139750\n ) THEN 'with'\n ELSE 'without'\n END AS neoplasm_status,\n hypertension.hypertension_status,\n COUNT(p.person_id) as n\nFROM \"DelphiDemo\".person p\n JOIN hypertension ON p.person_id = hypertension.person_id\nGROUP BY neoplasm_status, hypertension_status\n"
]
}
],
"volumes": [],
"tags": {
"project": "DelphiDemo",
"tres": "Nottingham TRE 01|Nottingham TRE 02"
},
"creation_time": "2026-05-14T16:02:13.342420+00:00"
}
The data returned by this analysis can be read into tables using the contingency_table_utils module supplied.
paths = wb.fetch_outputs()
contingency_paths = [v[0] for k, v in paths.items()]
pre_contingency_tables = [ContingencyTable(pd.read_csv(path)) for path in contingency_paths]
The data can be retrieved from each like this:
pre_contingency_tables[0].data and pre_contingency_tables[1].data.
The data from each of the two TREs is as follows:
Nottingham TRE 01:
| neoplasm_status | hypertension_status | n |
|---|---|---|
| without | no_hypertension | 48955 |
| with | no_hypertension | 532 |
| with | has_hypertension | 13 |
| without | has_hypertension | 270 |
Nottingham TRE 02:
| neoplasm_status | hypertension_status | n |
|---|---|---|
| without | no_hypertension | 49012 |
| with | no_hypertension | 496 |
| with | has_hypertension | 7 |
| without | has_hypertension | 247 |
This is useful, but only gives informtation about the data in each TRE individually. In order to derive more value from the data, and perform statistical tests across both groups (the complete data set), the data must be aggregated.
aggregate_tables checks that your tables have the same variables, and sums the counts if they do.
aggregated = aggregate_tables(pre_contingency_tables)
contingency_table = aggregated.contingency_table["n"]
display(contingency_table)
|hypertension_status | has_hypertension | no_hypertension|
| neoplasm_status | ||
|---|---|---|
| with | 20 | 1019 |
| without | 517 | 97967 |
The contingency_table property organises this data into the format for statistical analyses.
This format can be used for scipy.stats contingency table functions.