Data Used in the Tutorials
Here are the data sources used throughout the tutorials, in order of appearance:
Census Population Estimates by County
A classic dataset used in data journalism education. From the Census Bureau’s Population Estimates office.
USDA Rural Development Investments Data
Data from the USDA covering grants, loans and other “investments” from the agency. Full data covers 2012 to 2024, though the tutorials use only 5 years. There is a data dictionary if needed.
NTSB Airplane Crash Data
The National Transportation Safety Board has civilian accident investigations going back to 1962 searchable online. The tutorials only use the last 5 years. Once you enter a search, the results are downloadable.
U.S. Department of Education College Scorecard Data
Warning: This dataset is not for the faint of heart. You can bulk download it all here, but it’s big (390 MB zipped) and really, really, clunky. There’s multiple tables, and the main table is more than 3000 columns wide. I use this data because it’s highly relevant to college students, but I have to do a lot of pre-processing to make it usable in this format.
U.S. Census Bureau Small Area Health Insurance Estimates (SAHIE) Program
Another Census Bureau program that is not The Census but very important. Data is cleanly organized and well documented. You can get it from an API or using Census data tools. Data going back to 2008.
U.S. Environmental Protection Agency Greenhouse Gas Reporting Program Data
Data from large emitting facilities regulated by the EPA. Data is currently 2023 only.
National Conference of State Legislatures Data on State Legislative Partisan Composition
Data comes in PDFs but are not very hard to format. Note: The Great State of Nebraska has a unicameral legislature, members of which are called senators, which is why the tutorials only focus on state senate compositions.
Groundhog Day Predictions Dataset from Groundhog-Day.com
The fun one of the bunch. If you follow the Get The Data link here you can bulk download the predictions going back many years. However, the further back, the thinner the data gets.