Notes of Sunlight /ORGPedia Workshop on Open Organizational Data and Identifiers (April 8, 2011)

The April 8th workshop co-hosted by the Sunlight Foundation and the ORGPedia Project (a project of the Democracy Design Workshop at New York Law School and Rensselaer Polytechnic Institute) brought together 30 representatives of government agencies and interest groups that promote corporate accountability. Representatives of the four agencies in attendance (Environmental Protection Agency (EPA), Securities and Exchanges Commission (SEC), Federal Communications Commission (FCC), Department of Labor (DoL)) shared their reflections first and, after lunch, the dot orgs spoke. Each organization reflected on identifier schemes currently in use as well as opportunities from and challenges to adopting either a single, new, uniform identifier or an open identification ecosystem to facilitate translating between different identifiers.

Existing Identifiers

The EPA’s Office of Information Collection has built an ingestion pipeline to collect (not always very clean) data across 32 federal databases, 57 state databases, and 2.7 million facilities in a Resource Description Framework (RDF) datastore ( It has no unique identifier at the organizational level (2 million+ entities) but it does have an ID for each facility and physical plant. Many firms operate thousands of facilities. Whereas firms change from Firm to Firm LLC to Firm Inc., physical facilities remain constant and tangible. This concept of linking identifiers to concrete artifacts such as physical facilities or chartering documents was a point of emphasis for the EPA. The EPA has used Data Universal Numbering System (DUNS) information in the past via Toxics Release Inventory (TRI) but found it to be unreliable. The proprietary nature of DUNS identifiers also posed logistical problems for EPA.

Mine Safety in the Department of Labor issues a mine ID number and an operator ID number. Mine ID stays constant and linked to the physical facility. For example, people wanted to know the safety record of the Upper Big Branch Mine, which was easy to provide. But the mine also has an owner who, in turn, contracts to an operator who, in turn, has subcontractors all of whom might change hands and this is much harder to track. Of DoL’s constituent agencies, Mine Safety and Health Administration (MSHA) is among the most advanced. Occupational Safety and Health Administration (OSHA) has used DUNS in the past, but it proved to be costly and of poor quality. There is considerable variation between agencies in how information is stored – for instance in the location and form associated with audit trail data. Settling on a standard identifier solution across DOL agencies would necessarily impose costs on some but not others, depending on the compatibility of each agency’s legacy identifier systems.

No standard Uniform Resource Indicators (URI) exist for “assets” such as pension or health plans though it would be good to identify and link these to firms, too.

FCC has its own number called FCC Registration Number (FRN). There are 1.3 million issued FRNs, which includes licensees and also the lawyers to whom the licensees delegate authority. Has begun using “common names” for licensees, which implicitly groups together various corporate subentities.

The Internal Revenue Service (IRS) uses Employer Identification Numbers (EIN) but there’s no “decoder” for these numbers because of privacy legislation. In any case, EINs don’t provide information related to structure and hierarchy.

SEC will soon have 10,000 companies filling in Extensible Business Reporting Language (XBRL). Has a numbering scheme for operating companies (the Central Index Key (CIK)) and a different scheme for investment advisors and for funds. There is currently a non-governmental industry initiative driving the creation of a new International Standards Organization (ISO)-compliant standard numbering system – this is included in both the Office of Financial Research (OFR) Request for Proposal (RFP) and another rulemaking related to hedge funds and counterparties to swaps. SEC/ISO process is going to number the ultimate parent but not track any intermediate parents by means of a centralized registration authority. This scheme will cover 1.8-2.5 million companies that participate in financial markets internationally. It will include a single number, the ultimate parent, information about place of incorporation, and date of organization. Hierarchy information will not be embedded in the ID itself; what is reported will be “very, very limited.” Right now it looks like ISO will go down to the branch level for institutions but not to the trading desk level. Relevant ISO committee is Technical Committee (TC) 68. In total, the standard has 6 attributes and will be publicly searchable, which include place of incorporation and date of submission. “The real problem is hierarchical information.” ISO will not have any authority to compel participants to keep the information correct or up to date.

Subsidy Tracker from Good Jobs First (GJF) is tracking 64,000 company specific entries and tracing across different state subsidy programs who gets what money.  Hierarchy information would be a huge help to this effort. GJF is interested in including open identifier proposals in its model legislation.

The National Institute on Money in State Politics’ project tracks lobbyists and lobbying expenditure data. Once someone becomes a large donor, they get an ID. has been forced to create a sophisticated entity resolution pipeline in order to make up for the lack of common identifiers. Says 95% of this can be automated, but remainder is problematic.


Typically, corporate activity is tracked by statute. In other words, firms have to complete the same, often duplicative paperwork, across different components of the same agency as well as across agencies.

In addition, there’s no way to compare data across regulatory components or agencies. If we really want to understand, for example, a firm’s record, we have to be able to take a systems perspective and examine environmental compliance across air and water and workplace safety compliance across Wage and Hour, OSHA and other components.

When Exxon-Valdez happened, it was a huge burden to figure out how the company had been performing across the agency. Similarly, a third-party operated BP Deepwater Horizon, making it difficult to provide meaningful safety information. But with an open identifier schema it will become possible to track compliance records.

Also if there’s some consistency to the concept of entity, we can overlay census data on spending data and see if government money is being spent where it needs to go.

We can get early warnings of corporate malfeasance.

Government is doing a lot of data cleaning and processing after data is collected, especially as data is being gathered at both the federal and state level. There are data quality problems at every agency. The goal now is to push that to the front end and try to improve the forms so that data, when solicited, is being entered in more usable formats. For example, with fillable forms it is possible to harmonize between “Corp” and “Corporation.”

This will also make it possible to auto-populate fields on subsequent forms and reduce the data collection burden.

Several agencies like FCC and SEC are creating a Chief Data Officer in each bureau and office. They are starting enterprise data dictionaries, which will help with this work, too.

If a legal entity identifier becomes available, it will also become possible to track hierarchical information and understand the relationship among entities, such as counter-parties to swaps or subsidies to entities by sector across multiple states.


The group articulated that to achieve the goals of greater corporate accountability, any identifier project(s) will need to:

  1. Clearly articulate how the public is going to benefit.
  2. Eschew exclusively creating a “big identifier system,” which will be too brittle and simplistic in favor of creating an “identifier ecosystem.”
  3. Combine a unique identifier, on the one hand, with the ability to map between schemes on the other.
  4. Standardize around the minimum number of fields, i.e. an entity number, while providing extensibility so that agencies can handle the process of tracking specific, niche information relevant only to one regulatory regime.
  5. Articulate a data dictionary/vocabulary to make sense of key corporate governance relationships.
  6. Make facility a trackable field to enable linking between physical plants and legal entities.
  7. Link back to formation documents, as filed and as amended over time.
  8. Have a field for common name, i.e. Nextel or Verizon even though they do business under the names of thousands of other entities. Common name isn’t the same as corporate name.
  9. Need to support not just identifying particular entities, but identifying hierarchies and changes to those hierarchies.
  10. Separate identifier from authentication and login.
  11. Need to address:
    1. How do you get people into the system?
    2. Where does it get assigned?
    3. Is it compulsory?
    4. How to pay for it? Are registration fees acceptable? Usage fees?
    5. How will the numbering scheme be paid for?
    6. When tracking relationships, will we track vertical or also horizontal?
    7. How to deal with errors, i.e. Will a single keystroke accidentally return the wrong result?
    8. What level of resolution, both in terms of entity size and time intervals, is minimally acceptable? E.g. should individuals be included in such a system?
    9. How do you keep the system going?
  12. Consider microgrants to enable groups to document and communicate their own experiences with running identifier systems.

Some Ideas for Next Steps

Develop an index of ontologies, i.e. a catalogue of existing identifier systems and the fields that they contain.

Develop a data dictionary of corporate governance relationships, i.e. parent-child, facility-owner.

Explore opportunity to create baseline identifier system through the IRS EIN system by means of a rule or statutory change.

In addition, explore the alternative with the National Organization of State Secretaries of State of building in a single field identifier into state level registrations of 18 million entities.

Draft model legislative language for making open corporate data a requirement in legislation.

In connection with relevant partners, design and run pilot projects to test linked data strategies for understanding hierarchies.

In connection with relevant partners, develop a plan for a sustainable open corporate data ecosystem.


Comments to OFR Request for Information (RFI) on legal entity identification for financial contracts (9.2 MB .PDF file)

International Organization for Standardization, Financial Services Working Group TC 68 (TC 68/SG 1 – Identifiers): ISO site for the technical committee assigned to standardize the field of banking, securities, and other financial services. Contains both published and unpublished ISO standards.

Regulatory Compliance, Presidential Memoranda (January 18, 2011): directs agencies to share enforcement and compliance information across the Government.

EPA Facilities Registry: a prototype of a linked data approach to modeling and publishing  information on facilities.

The U.S. Department of Labor’s Enforcement Data: makes the enforcement data collected by DoL’s accessible and searchable, with the intent that the public devises new and creative ways of using the data.

GS1 (UPC): a non-profit organization that administers the Universal Product Code system, providing a globally unique identification number reserved for a single company.


Beth Noveck (NYLS)

Tom Lee (Sunlight Foundation)

Daniel Schuman (Sunlight Foundation)

Jim Harper (Cato)

Francis Avila (Dancing Mammoth)

David Roberts (DOL)

Ed Bender (NIMSP)

Matt Reed (SEC)

Kevin Webb (Open Plans)

Jed Miller (Revenue Watch)

Skye Bender-deMoll (CorpWatch)

Greg Elin (FCC)

Steve Young (EPA)

David Smith (EPA)

Shana Harbour (EPA)

Craig Jennings (OMB Watch)

Susi Alger (CRP)

Jihan Andoni (CRP)

Reed Rushing (CAP)

Phil Mattera (GJF)

Jim Hendler (RPI)