Data / External / 2 June 2026
CRDC School Arrest Rates — Bayesian Estimates
Model-based estimates of school-based arrest rates for U.S. school districts and states, by race and sex, derived from the Civil Rights Data Collection.
About this dataset
School-based arrests are one of the sharpest edges of school discipline data, and they are also some of the sparsest. Most district-by-race-by-sex cells in the Civil Rights Data Collection contain very small counts, where a raw rate of “2 arrests out of 41 students” is too noisy to compare against a district ten times its size. This dataset addresses that by replacing raw rates with Bayesian hierarchical estimates that partially pool across districts: small or noisy cells are pulled toward the broader pattern, and every estimate carries an explicit credible interval instead of a false-precision point.
The release covers eight demographic cells — race ∈ {AM, BL, HI, WH} crossed with sex ∈ {F, M} — across three CRDC collection years (2015–16, 2017–18, and 2021–22), for U.S. school districts (LEAs) and states.
What’s in the release #
Two artifacts are published on Hugging Face under release
civilytics-crdc-arrests-2025.1:
summary.duckdb(~260 MB) — a compact DuckDB with the tables that power the live API:arrest_summary(LEA grain, ~2.27M rows),state_summary,district_dim(names + geography), and ametatable. This is the file to grab if you want point estimates and intervals.parquet/— the full raw posterior draws (500 per group), Hive-partitioned bymodel_id / YEAR / LEA_STATEacross ~1,387 shards. Reach for this if you need to propagate uncertainty through your own downstream computation.
Quick start #
You can query a single slice straight from Hugging Face with DuckDB — no full download required:
INSTALL httpfs; LOAD httpfs;
-- Raw draws for TX, Black males, default model, 2021-22:
SELECT *
FROM read_parquet(
'hf://datasets/civilytics/crdc-school-arrest-rates/parquet/model_id=nat_m2_mod/YEAR=21-22/LEA_STATE=TX/*.parquet'
)
WHERE RACE = 'BL' AND SEX = 'M'
LIMIT 20;A live API serves the summary estimates at
crdc-api.civilytics.org
(OpenAPI/Swagger at
/api/v1/__docs__/), and the full methodology and data dictionary are documented
at pages.civilytics.org/crdc-arrests
.
Go to the data
Provenance
Source: U.S. Department of Education, Office for Civil Rights — Civil Rights Data Collection (CRDC) , school years 2015–16, 2017–18, and 2021–22. District names and geography are joined from the NCES Common Core of Data (CCD) district directories. The underlying CRDC counts are U.S. federal government public records.
Rates are produced by Bayesian hierarchical binomial models fit with
brms
/Stan, which partially pool sparse
counts across districts to stabilize estimates and carry full posterior
uncertainty. Ten specifications (nat_m1–nat_m5, sg_m1–sg_m5) are
provided; the defaults (nat_m2 / sg_m2) use the most-recent year plus a
referral-rate covariate. Estimates include count and rate point estimates with
highest-posterior-density intervals at the 50%, 80%, and 95% levels.
Known caveats
- State summaries are draw-wise population aggregates (summed across LEAs
within each posterior draw), which is distinct from the model’s
(1|LEA_STATE)random effect — don’t read them as a state-level random intercept. - Sample restrictions apply: districts with enrollment below 30, plus other data-quality business rules documented in the data dictionary, are excluded.
- Estimates are model-based, not raw counts. They are designed for cross-district comparison with calibrated uncertainty, not as the literal reported arrest tallies.