Evolution of the Handbook and Contributing
The Handbook is intended to dynamically evolve over time. Releases are called “versions”, starting with v1.0
. Because many of the aspects that influence data sharing evolve over time, and because the initial set of case studies cannot cover the full breadth of administrative data sharing, future contributions and updates are envisioned.
Availability
The Handbook’s primary location is on the web (https://admindatahandbook.mit.edu/book/). However, it is also a book, and we will, at regular intervals, make available printed copies and ebook versions. The online version and the ebook will be assigned Digital Object Identifiers (DOI). Online chapters will receive their own DOI for ease of citability. Both the ebook and the hard copy will be assigned ISBN numbers. The hard copy will be available for purchase through a number of outlets. The ebook will be available for download from the Handbook’s primary location for free.
Contributions
Invited or contributed additions will be incorporated into the Handbook by a J-PAL-based editorial team. Not all contributions can be accepted. Each contribution will be subject to editorial review, and may be sent out to referees. Contributions will be asked to follow the Content, Format, and Style Guidelines and the Case Study Template. The entire process is mediated by a Github workflow, and while referee reports and editorial deliberations may not be public, the contents of the decision letter will be part of the public Github workflow.140
Updates will be accepted on a flow basis, and updates published on a flow basis as well (see Versioning). If there have been any updates, then a new version of the print copy will be issued once a year.
We would appreciate it if any improvements, corrections, and other contributions were sent back to us, for instance via the “Issues” feature on Github.
Contributing authors retain the copyright to their publication. Contributions need to be licensed liberally, at least under a Creative Commons - Attribution Non-commercial license.
Once accepted, a contribution will appear in at least one major version of the Handbook. The Editors retain the right to remove contributions in later versions of the Handbook, and do not need to notify authors of such decisions.
Content, Format, and Style Guidelines
This handbook is intended as a reference for both researchers (users of administrative data) and data providers (holders of administrative data), as well as organizations that would like to facilitate administrative data collaborations, such as universities. The technical chapters are intended to support these different audiences in learning about options, making decisions based on their individual circumstances, and finding additional resources. The case studies in this handbook should focus on showcasing:
- Mutually beneficial partnerships that have led to innovative research projects while also answering pressing policy questions or helping the data provider understand their own data better;
- A process of finding innovative, robust, and scalable solutions to technical, financial, legal, or ethical challenges which have led to sustainable access to administrative data that does not rely on just one research team or the personal championship of one individual in the partner organization;
- Access to innovative types of administrative data or completely new data sources, including the creation of new datasets, describing the process of making this data useful for both the data provider and the research team;
- Administrative data used in conducting novel RCTs, standalone or linked with primary data collection.
Content Guidelines
The Handbook aims to provide practical, actionable guidance to researchers, administrators, policy makers, and other practitioners faced with the issue of implementing a new data access mechanism, or improving an existing mechanism.
Case studies and chapters should therefore speak to the diverse points of view and interests of data providers, researchers, and third-party data hosting organizations (data custodians, intermediaries). For example, data intermediaries will be interested in particular in umbrella agreements and templates, and automated processes that reduce the cost of individual data use cases. Data providers are interested in access modalities and review processes that ensure that their constituents and interests will be protected and the research is in alignment with their policy priorities. Researchers are usually interested in data representativeness and quality and the right to freely publish findings, as well as in secure and unbureaucratic access processes.
Case studies should be specific enough so that they can serve a starting point for implementing the showcased data access mechanism in another location or with another partner.
- What information would be useful for implementing the same solutions described in the chapter?
- What advice do the authors have in hindsight on the advantages or disadvantages of their approach?
- What considerations and priorities informed their decisions? Understanding the parameters and environment of the setup described will help others assess if their situation is the same or different; from budget and personnel capacity to data characteristics and legal environment.
It may not be possible to describe every aspect of the case study in equal detail and depth. In this case the authors should focus on areas in which they innovated or have particularly deep knowledge to share. We expect case studies to be about 15-20 pages long. Case study submissions can include tools, materials, and text samples where appropriate; these can be added in an appendix or as an online download of any length.
All case studies should contain a description of the different data sets being provided for access. This description could be provided in table form if there are several data sets, or can link to an online catalog of such datasets, if a large number of data sets are made available. Where feasible, it should include:
- Name/title of the data set (or group of data sets)
- The population and sample size covered
- Time period covered and frequency
- Any notable information on the data, such as
- the sensitivity of the data and if PII or location information is included in researcher access;
- whether there are inclusion rules, updates over time, exclusions, deletions, or data purges that have a significant influence on sample selection or data availability and reliability.
Style Guidelines
Handbook chapters should adhere to an academic writing style for a well-informed, but not expert audience. Readers may not necessarily be familiar with the authors’ academic discipline. This means the writing should be clear, precise, concise, and well organized, but avoid jargon.
Please support assertions with sources and references wherever possible. For example, if a case study mentions a policy decision that was influenced by the research, a newspaper report, legal decision, or public announcement should be cited.
Authors should properly expand all acronyms upon first use and keep an international audience in mind, as well as readers from a different discipline and career background (e.g. readers may be civil servants, research economists, technical consultants, public health specialists, etc.). Jargon or country-specific organizations, customs, legal frameworks, and institutions need to be sufficiently explained (e.g. the specifics of the education system or legal code).
All case studies should follow the same template to make the contents quickly and easily accessible.
Formatting Guidelines
Submit final drafts as either a Microsoft Word doc or a Markdown or RMarkdown file.
Section and subsection headings should use the MS Word “Heading 2” and “Heading 3” formatting styles. For submissions in Markdown, use Markdown level 2 and 3 heading syntax (## and ###). Do not include numbering in section headers (e.g., not “2.1.3 Data Use Examples”). Section numbers will be automatically generated later and are likely to change.
The title of each case study chapter should reference the admin data provider and the data hosting organization (if different). As an example, authors could use a title naming the data provider, plus an informative subtitle, such as, [title] “City of Cape Town, South Africa” [subtitle] “Aligning internal data capabilities with external research partnerships.”
All authors should list their current primary affiliation. Sample: “Shawn Cole, John G. McLean Professor of Business Administration, Harvard Business School.” The chapter template contains a section (“About the authors”) allowing authors to provide additional information there. Please also add a short paragraph to the summary of the case study (see below) that describes each author’s role in the project and dates of involvement (if not affiliated anymore).
Index
A list of index terms must be submitted as a list, ideally in a separate file. List the primary word or phrase, plus alternatives, separated by a comma. Use a new line for each new index word/phrase. In addition, please go through your chapter and tag each index word by bolding it and enclosing it in “…”. When we implement the index, our publishing platform will automatically tag each occurrence of these words.
Sample index list:
- Family Education Rights and Privacy Act, FERPA
- Data Use Agreement, Data Use License, DUA
Sample occurrence in text: “The template DUA explicitly references the +Family Education Rights and Privacy Act|.”
What are good index words or phrases? Index words or phrases should be key terms from the case study that a reader may be searching for across cases studies and that would otherwise hard to find. The article should not just mention the word in passing, but should provide substantive content related to the indexed term. Terms chosen for the index should not be extremely frequent or general terms, such as “administrative data”. They should also not be used to reference passages that can easily be found using the table of contents and section headers.
Index terms might be important technical words and phrases, for example:
- Umbrella DUA, institutional DUA
- Consent, informed consent
- Personally identifiable information, personal data, PII
- Data enclave
They could reference specific laws relevant to administrative data sharing:
- Family Education Rights and Privacy Act (FERPA)
- Code of Federal Regulations 20 (Section 603)
- Health Insurance Portability and Accountability Act (HIPAA)
- Workforce Innovation and Opportunities Act (WIOA)
- Supplementary Nutrition Assistance Program (SNAP)
- Temporary Assistance to Needy Families (TANF)
- Ohio Revised Code
- Title 13, United States Code
Other options might be important institutions (U.S. Census, state government, NIH, etc.), technical roles (data manager, data officer) etc., or acronyms (assuming they were abbreviated for frequent use) as long as the chapter provides useful information about these terms related to administrative data access.
Bibliography
The bibliography must be submitted as a separate file compatible with reference management tools. Acceptable formats are bibtex, RIS, EndNote, XML.
References in the text should use the citation handle from the reference database after the symbol “@”. Citations go inside square brackets and are separated by semicolons. Each citation must have a key, composed of ‘@’ + the citation identifier from the database, and may optionally have a prefix, a locator, and a suffix. For example, “see &ColeDhalSautVilh2020” might reference the bibliography entry for this handbook. Full information on how to include inline citations can be found here.
The content of bibliography entries should follow the Chicago Manual of Style. Note however that we will handle styling within the manuscript.
- URLs pointing to online documents should cite the document with all the usual information (author, publication date, etc.).
- Legal code: expand on acronyms and ideally point to online repositories, so interested parties can easily find the text of the legal code. Example:
- 5 United States Code (U.S.C.) § 552b. Accessed at https://www.law.cornell.edu/uscode/text/5/552b on 02-28-2020.
- Only when citing a generic webpage, use a footnote citation; where appropriate, indicate the date when last consulted. (e.g. “Available at www.google.com.” or “See J-PAL’s policies for RCT registration, accessed at www.povertyactionlab.org/page/information-affiliates on 03-21-2020.”)
Versioning
As an online-first publication, we adopt the following versioning policy:
- The Handbook follows a variant of Semantic Versioning rules. A version is identified by a numeric version number in the format “
vX.Y.Z
” (Major.Minor.Patch). - As new chapters are added or existing chapters modified or removed, new versions of the Handbook are released.
- Minor edits and corrections are identified internally with a
patch
number (e.g.,v1.0.1
) but not identified in the publicly available version. - Additions of case studies and chapters necessarily lead to an increment in the
minor
number (e.g.,v1.1
). However, such additions can be added in batches. - Whenever a chapter is removed, a new
major
version is released (e.g.,v2.0.0
). - When the curators estimate that a sufficiently large number of additions warrant highlighting through a new
major
number, such amajor
version may also be released.
- Minor edits and corrections are identified internally with a
- Much of the development of the Handbook occurs openly. To make clear that a particular online version of the Handbook is not officially published, development versions have a
-dev
sufffix, e.g.,v1.2.0-dev
if leading up to a proposedv1.2.0
release.- Nearly completed versions that are being reviewed will have a
-rc[n]
suffix (e.g.,v1.2.0-rc1
), whererc
stands for “release candidate.”
- Nearly completed versions that are being reviewed will have a
Only major versions are printed. Minor versions are made available in ebook formats. Patch versions may be made available in ebook formats.
Case Study Template
Case studies should follow the case study template. Below is an updated checklist and guidance for authors and reviewers. In a Word document, please use heading formatting (heading 2 and heading 3) for each section header. Do not number your sections. The template can be downloaded as Word.
Summary (approximately one single-spaced page)
Provide an overview of the key elements of the case study.
The goal is to give potential readers a quick idea if the case study describes a scenario similar to their own, what topics it covers, and how they might apply its lessons. The authors should:
- Give an overview of the data:
- Summary of data content
- Interesting uses of the data
- Data provider and data host: name, location, and range of activities
- For example: “This chapter describes a partnership between the University of [country] and the traffic and transportation department of [city] in [country], an agency with 68 employees in charge of traffic and parking regulations, road use permits, construction, and public bus transport for xx residents and yy annual visitors.”
- Briefly summarize key parameters and process of making data usable and accessible:
- Timeline and current status
- Hosting arrangement: data provider, researcher, or other third party?
- Cost and resources used, staffing
- Involved parties; e.g. data collectors and data curators (if different from data provider)
- Access modalities
- Summarize main insights and contributions of this chapter:
- Key conclusions and recommendations for others
- Any new resources or systems innovations described
- Potentially highlight elements not covered
- Reference/describe 1-3 main sources for the case study, if existing:
- Role of the authors in the process above
- An existing publication about the case study or a website with information about data access
- Role of the authors in the process above
Introduction
Motivation and Introduction
Explain what drove the original project to make the data available.
Reasons could include a legal mandate, a promising research project, or a pivot towards data-driven decision making. Was the initiator a researcher or the data provider (business, non-profit, government), and what were their reasons or interests? What were the desired outcomes of the project?
This section may name a legal mandate to make data accessible, but the details (how access is organized and who has access) should be discussed in the next section “Legal Context”.
Data Use Examples
If available, provide examples of particularly interesting, innovative, or policy-relevant uses of the data.
Connect these examples with a description of the available data (see section “content guidelines” point 4 above). Particularly valuable are examples where
- Experiments were conducted using the data, in collaboration with the data provider or independently
- Administrative data was linked with other data sets or survey data
- The data and research were used for policy improvements.
Figures or tables can be used to support the data use examples. Do not describe prospective or speculative uses. Cite published papers as well as any policy decisions that were made based on analysis of the data.
Making Data Usable for Research
Describe the steps for transforming the operational or administrative data records into usable data files (including metadata and documentation) for analysis.
This section should discuss both the preparation of the data itself and the data documentation and metadata. Focus on information that can help others understand the general process and the time and cost involved. Steps might be:
- Retrieving individual data records or operational data sets and standardizing formats
- Standardizing variables and matching records across data sets
- Cleaning, verification, reconciliation, and quality control
- Analysis file documentation
- Data documentation and metadata generation and publication
- Automation of the above steps, if any
Describe the resources and time needed to carry out these steps, from staffing to software and systems. Who does this work – researchers, in-house staff, contractors? Please provide examples of or references to documentation and metadata, if useful and available.
Modifications that are needed to protect the data before researchers can access them (e.g. de-identification) or before research findings and outputs can be published should be described below under “Safe Data” and “Safe Outputs.
Legal and Institutional Framework
Institutional Setup
Describe how data access is organized from an institutional standpoint.
Who are the parties to the data access mechanism in this case study? Is the data provider also the data custodian/data host, or has a third party (data intermediary) taken on this role? Who provides or provided legal advice to the parties?
As an example, in some cases the data provider (say, a private firm) conducts all preparation of the data for research in house, and contracts directly with researchers (“two-party model”). In other cases, the data provider or data providers (say, different ministries in a national government) contract a data custodian (data intermediary) who aggregates and prepares data and negotiates data access with researchers, representing the data provider(s).
Legal Context For Data Use
Describe the legal context that permits (or mandates) if and how the data provider gives others access to the data.
This section should focus on the legal obligations of the data provider towards those whose personal or proprietary information is contained in the data (respondents, taxpayers, firms, etc.), or those who have curated or created the data (if different). Please provide references to applicable laws (see above on legal citations).
- Does the legal framework mandate data ownership or privacy protections for constituents, clients, etc. whose data are being shared, and how clear and established are these protections?
- Can third parties request purging of records or do data have to be deleted at certain time intervals (e.g. juvenile justice, deleted tweets)? Note: data purging was moved here from a separate section. If the data provider puts such rules in place without a legal mandate, please discuss in the “five safes” section.
- Does the data provider legally need to have consent from the “producers” of the data to share the data with researchers?
- Are there any legal restrictions on who can be given access to the data? Reference legal basis here, but discuss safety restrictions on research access, and how they are implemented, in detail under the “Five safes” rubric.
- Are there any legal sanctions prescribed for data intermediaries or employees of the data provider for allowing for unauthorized sharing, access, or uses of the data? Note: sanctions for data users can be referenced here but their implementation should be described in detail in the “five safes” section.
Note: the agreement between the data provider and the users of the data (e.g. researchers) should be described below and in the “five safes” section (with reference to data sensitivity and confidentiality). The legal framework for access should be described in a way that is intelligible to lay persons and an international readership.
Legal Framework for Granting Data Access
Describe what legal agreements regulate the use of the data by others.
This section should focus on the legal relationship between the data provider and the data users (researchers, other agencies or parts of government, the public etc.). Typically, data access is granted through or governed by some form of agreement - a non-disclosure agreement (NDA), a data use agreement (DUA), a protocol, proposal, or framework. This section should discuss such agreements, or, if no individual agreement is typically made, how else data use is governed (e.g. by applicable laws, informal agreements). If possible, the authors should provide an example or template, pruned of any identifying information.
- How much flexibility and individual negotiation goes into a typical data use agreement?
- Can the data provider impose sanctions (financial, reputational, penal) for unauthorized uses of the data? How were those chosen? Have they ever been imposed? Note: discuss details or reference back to this part in the “five safes section.
- Can the provider revoke data access (for cause, without cause, etc.)?
- Are there any payments, conditions for access, or other obligations to the researcher specified in the use agreement, for example co-authorship on research output? Note: discuss details or reference back to this part in the “five safes section where this relates to data protection.
- Does the data provider assert Intellectual Property (IP) on the original data, or on any derivative data or products created by researchers with access to the data, such as tables, research papers, source code, etc.? How is IP enforced? Note: this was previously in a separate “IP” section.
- Is there any review of outputs, e.g. does the data provider have a veto right over outputs? What are the review criteria? Note: review for disclosure risk and sensitive data should be discussed in detail in the “five safes” framework. Any review related to business or political strategy, or intellectual property rights, of the provider should be discussed here.
- Is there a specific copyright or license on the data and code generated by the project?
Authors may perceive some overlap between this section and the “Five safes” section. This and the previous subsection should focus on the broad legal justification and framework for granting research access. For instance, if the criteria described in detail in the Five Safes section are prescribed by a law, then describe the law here, and the details of the access under the Five Safes. Describe here if the formal agreement is called and formulated as a license, a contract, a data use agreement. This section should also describe any protections for the data provider – for instance, liability limitation if a breach occurs by a researcher, intellectual property rights, non-competition clauses.
Protection of Sensitive and Personal Data: The “Five Safes” Framework
This section should focus on aspects of the data access process that relate to ethics, data security, privacy, and protection of confidentiality. It is organized using the “Five Safes” framework. Please assess the cost and importance of each of the five aspects of data security below in terms of importance and cost on a scale from 1 (lowest) to 5 (highest).
Safe Projects - Evaluating data analysis projects for appropriateness
Describe the application and review process and approval criteria for data access requests.
If available, include information on application management systems, any IRB or other review requirements, and which expertise is sought for the review of applications (e.g. legal or statistics experts). Describe how long the approval process takes and what it costs. Does the data provider require consent from the “producers” of the data to share the data with researchers?
If there is an online portal or application process information is publicly posted, please provide a reference/citation. Note that this section should focus on protections for those in the data, e.g. is the data use appropriate, are sensitive data protected in the project, and do the benefits of using the data outweigh the risks. For other aspects of project review – in particular intellectual property rights or strategic business interests – please use the “legal framework” section above.
Safe People - Evaluating researchers who seek data access
Describe what conditions and credentials are required to be granted data access.
- What are the requirements to apply for access and how, and by whom, are they verified - citizenship, professional standing, background security checks…?
- Do researchers need to have human subjects training or any other kind of training? Who provides this training? How frequently does it need to be renewed?
- Does the organization trust researchers for access procedures, for disclosure avoidance procedures, for data handling? Does it audit these? How are violations sanctioned? Note: refer back to the legal framework for sanctions if necessary.
- Do repeat users get fast tracked? How many users can be handled?
If the data provider uses a “circle of trust” model, please describe it.
Safe Settings - How can the data be accesssed?
Describe the technical and physical infrastructure for data access and the handling of security breaches.
- Please describe software, systems, and IT resources and weigh their advantages and disadvantages and their cost.
- How is access restricted, e.g. is it possible only at a specific physical location, a secure computer, remote submission, or via secure remote interactive access (sometimes called a virtual data enclave), etc.? Are there multiple access settings?
- Are the various safe settings only used for researchers, or are they also used internally (by staff of the data provider or the data intermediary)?
- How is researcher-provided software and code handled?
- Security: What are considered to be breaches of the safe settings? How do you handle them, who is notified about suspected or actual breaches? What are the penalties? Refer back to “legal” section if needed.
Refer to the handbook chapter on “Physical security” if necessary. Be concrete and provide references if possible to help others who want to implement the same solutions.
Safe Data - How is disclosure risk the data verified and mitigated?
Provide concrete information on the data processing steps involved in modifying the analysis data files for different levels of (researcher) access.
- Describe the resources and staff capacity invested in creating researcher-accessible data files. Are these files created as part of your data processing pipeline, or for researchers only; either on-demand, or on a schedule?
- Are synthetic or fake datasets used, combined with remote processing or validation?
- What methods are used to make the data “safe”? Are data elements stripped or coarsened? To what degree are the researcher-accessible data altered from the original data, and what are the consequences for the usability of the data?
- Discuss in particular if it is possible – and what steps are involved – to link the administrative data with other external data sets, such as survey data, that have personal identifiers. Is it possible to conduct field experiments with the data (e.g. by linking treatment information with the administrative data set, or by using the administrative data as a sample frame)? Note: added question.
Note: these are processing steps carried out after the data was made usable for research in order to remove PII and/or obscure indirect identifiers. Discuss other data processing for analysis in the “making data usable” section.
Safe Outputs- How is disclosure risk in statistical analysis results and tabulations verified and mitigated?
Describe processes, requirements, and tools used to verify the disclosure risk from published analysis results.
- Are safe-output rules incorporated into data use agreements?
- How (and how often) are safe outputs verified?
Discuss the practical implementation, including e.g. how long the review takes and who conducts it. Note: refer back to the legal section above regarding output review as needed. The present section should focus on how review for data protection is implemented.
Data Life-Cycle and Replicability
Preservation and Repoducibility of Researcher-Accessible Files
Describe to what degree and over which time period researcher-accessible files and master files are preserved.
- Is there active curation, or only preservation of current bitstreams?
- Does the data provider perform the archival/curation functions itself, is another entity within the organization responsible for this activity, or is a third-party (national, regional, university archive) responsible for this?
- Are persistent identifiers generated for these files?
- Can researcher-accessible files be consistently regenerated, or are they snapshots of a dynamic database or data pipeline? Is the query mechanism reproducible (can similar files be recreated, can older files be retrieved), or manual (with risk of inconsistency, variation across personnel)?
Preservation and Reproducibility of Researcher-Generated Files
Are researcher-generated intermediate and final data files and tables, and processing code and programs preserved?
- Can other researchers access these files, with or without the permission of the original researchers?
- Does the data provider generate persistent identifiers for any of these?
Sustainability and Continued Success
Outreach
Provide examples of successful outreach activities that have made “safe people” aware of the access point and data offerings.
Discuss outreach by data intermediaries/researchers to data providers as part of “Metrics of Success”.
Revenue
Discuss what makes the data access mechanism financially sustainable.
How are incremental costs covered, is there full cost recovery? What is the pricing model?
Where possible, precise numbers should be used.
Metrics of Success
Describe how the success of the data access project is evaluated.
What metrics for success are used, and who is the audience for metrics - funders, data provider, legislature?
Provide sample statistics where possible. Note: section was renamed.
Disclaimer
The views expressed in this paper are those of the authors and not those of any sponsors.
For the technically inclined: it will be the commit message of the pull request, or the rejection of such pull request.↩︎