Data / External / 2 June 2026

CRDC School Arrest Rates — Bayesian Estimates

Model-based estimates of school-based arrest rates for U.S. school districts and states, by race and sex, derived from the Civil Rights Data Collection.

Hosted on Hugging Face. This page is a citation-ready stub. The data, documentation, and downloads live at https://huggingface.co/datasets/civilytics/crdc-school-arrest-rates.
Versionv2025.1 · Released · LicenseODC-BY 1.0 · Hosted onHugging Face
Demographic cells 8race × sex
School years 32015–22
Posterior draws 500per group
Models 10specs

About this dataset

School-based arrests are one of the sharpest edges of school discipline data, and they are also some of the sparsest. Most district-by-race-by-sex cells in the Civil Rights Data Collection contain very small counts, where a raw rate of “2 arrests out of 41 students” is too noisy to compare against a district ten times its size. This dataset addresses that by replacing raw rates with Bayesian hierarchical estimates that partially pool across districts: small or noisy cells are pulled toward the broader pattern, and every estimate carries an explicit credible interval instead of a false-precision point.

The release covers eight demographic cells — race ∈ {AM, BL, HI, WH} crossed with sex ∈ {F, M} — across three CRDC collection years (2015–16, 2017–18, and 2021–22), for U.S. school districts (LEAs) and states.

What’s in the release #

Two artifacts are published on Hugging Face under release civilytics-crdc-arrests-2025.1:

  • summary.duckdb (~260 MB) — a compact DuckDB with the tables that power the live API: arrest_summary (LEA grain, ~2.27M rows), state_summary, district_dim (names + geography), and a meta table. This is the file to grab if you want point estimates and intervals.
  • parquet/ — the full raw posterior draws (500 per group), Hive-partitioned by model_id / YEAR / LEA_STATE across ~1,387 shards. Reach for this if you need to propagate uncertainty through your own downstream computation.

Quick start #

You can query a single slice straight from Hugging Face with DuckDB — no full download required:

INSTALL httpfs; LOAD httpfs;

-- Raw draws for TX, Black males, default model, 2021-22:
SELECT *
FROM read_parquet(
  'hf://datasets/civilytics/crdc-school-arrest-rates/parquet/model_id=nat_m2_mod/YEAR=21-22/LEA_STATE=TX/*.parquet'
)
WHERE RACE = 'BL' AND SEX = 'M'
LIMIT 20;

A live API serves the summary estimates at crdc-api.civilytics.org (OpenAPI/Swagger at /api/v1/__docs__/), and the full methodology and data dictionary are documented at pages.civilytics.org/crdc-arrests .

Go to the data

Hugging Face →

Provenance

Source: U.S. Department of Education, Office for Civil Rights — Civil Rights Data Collection (CRDC) , school years 2015–16, 2017–18, and 2021–22. District names and geography are joined from the NCES Common Core of Data (CCD) district directories. The underlying CRDC counts are U.S. federal government public records.

Rates are produced by Bayesian hierarchical binomial models fit with brms /Stan, which partially pool sparse counts across districts to stabilize estimates and carry full posterior uncertainty. Ten specifications (nat_m1nat_m5, sg_m1sg_m5) are provided; the defaults (nat_m2 / sg_m2) use the most-recent year plus a referral-rate covariate. Estimates include count and rate point estimates with highest-posterior-density intervals at the 50%, 80%, and 95% levels.

Known caveats

  • State summaries are draw-wise population aggregates (summed across LEAs within each posterior draw), which is distinct from the model’s (1|LEA_STATE) random effect — don’t read them as a state-level random intercept.
  • Sample restrictions apply: districts with enrollment below 30, plus other data-quality business rules documented in the data dictionary, are excluded.
  • Estimates are model-based, not raw counts. They are designed for cross-district comparison with calibrated uncertainty, not as the literal reported arrest tallies.