TBS policy requirements, tagged!

You can view the data by two groupings:

Data from Treasury Board of Canada Secretariat. Wrangled by Lucas Cherkewski (also on Twitter).

Methodology

(For the most up-to-date version of this, see the project README.)

  1. Policies were downloaded from TBS’s site as XML. (Inspired by @sleepycat’s work.) They’re now stored in this repo directly, for easy reproducing. (Ignore the .xml.txt—that’s so my R script would cooperate.)
  2. load.R
    1. Loads every policy as a string of plaintext.
    2. Breaks the policy into pieces. Tries to do this intelligently, looking at HTML elements. Ultimately we break it down into a sentence level, though lists kinda throw this off.
    3. Checks for the name of each policy, and assigns each row a number.
    4. Manually adds in a few policy requirements/lines that exist only as attributes on <section> elements. (Weird stuff.)
    5. Runs scripts/assign_responsibility.R.
  3. scripts/assign_responsibility.R
    1. Pulls together the list of "responsible actors" and "responsible signals". Does some pluralizing and concatenating to create our search strings.
    2. Working policy by policy, rolls through the requirements. If a requirement has a string that matches one of the "responsible signals", it marks that requirement as "describing responsibility" (is_clause_describing_responsibility). Then, it assigns a "responsible actor" based on the signal/actor combos (responsible_actor_standardized).
    3. Notes which clause was the source of the responsible actor for each clause. (Each time we assign responsibility, i.e. where is_clause_describing_responsibility is true, we set that clause’s row number as responsible_clause. Then we fill down the empty spaces.)