Calculating public burden using OIRA data -- Part Two

An experiment in using open data to make government better

Yesterday, I published an article about using open government data to hunt for paper-based information requests by the government. Based on the data, it looked like there are still a lot of hours spent filling out paper-based forms. As I noted, though, I ran out of time to do careful analysis. So, today, let's explore deeper.

First, we'll create a histogram to look for the distributions of requests. To do so, we'll use pandas to examine the results data, and specifically the histogram method.

In [1]:

# Set up the graphing environment. Because I'm using jupyter notebooks, first I need to tell
# it to show the graphs inline. I also use the `ggplot` style, because it's less hideous. 
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')

In [2]:

import pandas as pd
data = pd.read_json('results.json')
data.burden.plot.hist()

Out[2]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fca4799d358>

Wait. Hold on right there. That's not what you'd expect to see. That looks like there's an outlier. Let's see what that might be... To do so, we look for the top ten burdens.

In [3]:

data[["burden", "title"]] .sort_values('burden', ascending=False).head(10)

Out[3]:

	burden	title
720	2997500000	U. S. Business Income Tax Return
729	48731780	IRA Contribution Information
719	34115874	Form 1099-DIV--Dividends and Distributions
718	24951529	Return of Organization Exempt From Income Tax ...
248	20036012	2017-2018 Free Application for Federal Student...
509	13500230	National Fire Incident Reporting System (NFIRS...
717	10880812	Employer's Annual Tax Return for Agricultural ...
497	9902378	Arrival and Departure Record
449	7736084	Physician Quality Reporting System (PQRS) (CMS...
713	7041290	Customer Due Diligence Requirements for Financ...

Oh dear. Looks like we've got a pretty obvious mistake here: "U.S. Business Income Tax Return" can definitely be filed electronically. Same with the other things on the list. And that one outlier accounts for 3 billion of the 3.3 billion hours. Oof. So what gives?

Well, it turns out that the way that OIRA displays the burden data is that if any of the forms that are part of an information collection request is not electronically available, then the burden for all of the forms gets aggregated. And unfortunately, there doesn't seem to be an obvious way to back out the other forms. So, that's not very useful, unfortunately.

Let's see what the total burden is if you remove the top 20% of information collection requests.

In [4]:

"{:,} hours".format(data.burden.sum() - data.sort_values('burden', ascending=False).head(220).burden.sum())

Out[4]:

'5,589,316 hours'

So, that feels a lot more sane, and a lot less exciting. There are only 5,589,316 hours of public burden for everything but the top 20% of information collection requests.

In the end, this is a great lesson in how a data schema can lead to incorrect conclusions.

Still, we have some good data near the bottom of the chart.

In [5]:

data.sort_values('burden').head(890).burden.plot.hist(bins=30)

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fca467f8080>

In other words, there are a lot of information requests that account for a couple hundred hours of public burden. Not a surprising result, but perhaps even more useful in the end. This result means that there are about 200 forms in the middle that account for much of the remaining burden hours. Now, that seems like a good place to start.