Calculating public burden using OIRA data -- Part Two
An experiment in using open data to make government better
Yesterday, I published an article about using open government data to hunt for paper-based information requests by the government. Based on the data, it looked like there are still a lot of hours spent filling out paper-based forms. As I noted, though, I ran out of time to do careful analysis. So, today, let's explore deeper.
First, we'll create a histogram to look for the distributions of requests. To do so, we'll use pandas to examine the results data, and specifically the histogram method.
# Set up the graphing environment. Because I'm using jupyter notebooks, first I need to tell
# it to show the graphs inline. I also use the `ggplot` style, because it's less hideous.
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import pandas as pd
data = pd.read_json('results.json')
data.burden.plot.hist()
Wait. Hold on right there. That's not what you'd expect to see. That looks like there's an outlier. Let's see what that might be... To do so, we look for the top ten burdens.
data[["burden", "title"]] .sort_values('burden', ascending=False).head(10)
Oh dear. Looks like we've got a pretty obvious mistake here: "U.S. Business Income Tax Return" can definitely be filed electronically. Same with the other things on the list. And that one outlier accounts for 3 billion of the 3.3 billion hours. Oof. So what gives?
Well, it turns out that the way that OIRA displays the burden data is that if any of the forms that are part of an information collection request is not electronically available, then the burden for all of the forms gets aggregated. And unfortunately, there doesn't seem to be an obvious way to back out the other forms. So, that's not very useful, unfortunately.
Let's see what the total burden is if you remove the top 20% of information collection requests.
"{:,} hours".format(data.burden.sum() - data.sort_values('burden', ascending=False).head(220).burden.sum())
So, that feels a lot more sane, and a lot less exciting. There are only 5,589,316 hours of public burden for everything but the top 20% of information collection requests.
In the end, this is a great lesson in how a data schema can lead to incorrect conclusions.
Still, we have some good data near the bottom of the chart.
data.sort_values('burden').head(890).burden.plot.hist(bins=30)
In other words, there are a lot of information requests that account for a couple hundred hours of public burden. Not a surprising result, but perhaps even more useful in the end. This result means that there are about 200 forms in the middle that account for much of the remaining burden hours. Now, that seems like a good place to start.