The OrgPedia Open Organizational Data Project

(download as a PDF)

Funded by the Alfred P. Sloan Foundation, the OrgPedia project is developing a free, not-for-profit online directory of data about domestic and international, public and private companies.

The OrgPedia prototype makes available and downloadable via the web a rich array of information about each firm, including:

  • “Business card” information about name, location, and any entity identifier (ie. ID numbers used by different agencies as well as private numbering schemes like Open  Symbology and, eventually, the FINRA financial services LEI)
  • Ownership information about who owns and manages the firm and whom it owns and operates.
  • Other open government data, including securities and patent filings, environmental and workplace safety records .

Using a linked data approach, OrgPedia builds the directory from three sources:

  • Government databases of information collected from firms as part of transparent, regulatory processes.
  • Private but open databases of information (ie. New York Times financial data).
  • Public contributions to fill in gaps in the business card directory. Public contributions are clearly distinguished from authenticated government data.

Like Wikipedia, Orgpedia uses common collaboration features, such as tagging and rating, to provide a check on data quality by enabling companies to flag and correct inaccuracies and report mistakes back to government authorities.  A developer’s API enables the information to be easily exported to the websites and tools of other organizations, wishing to use this data for their own projects.  Unlike Wikipedia, OrgPedia is not a single repository of information but, rather, a clearinghouse with information linking back to myriad sources.  OrgPedia, therefore, promotes, rather than competes with, business intelligence companies who offer analytics above and beyond OrgPedia’s goal of providing a directory of firms in the public interest.OrgPedia builds upon parallel efforts within the financial services community to develop a global, open, legal entity identifier.  We anticipate that a broader array of agencies from patent to environmental regulators will want to take advantage of the new universal LEI. Firms, too, will want to use the LEI to obtain information about those with whom they do business. OrgPedia provides the extensible infrastructure and public interest governance to offer a directory of more firms, both public and private; to link the new LEI to existing, legacy LEI systems already in use; and to show additional data vital to every regulator, such as location and address.

By providing an independent source for authenticated data with features to correct data quality problems and fill in gaps, OrgPedia has the potential to bring down the costs of compliance for both companies and regulators by giving them a reliable source of corporate information in a single place. The hope is also to empower consumers and businesses seeking better data about the corporations with whom they do business and in whom they invest. OrgPedia will also be useful to economists and journalists doing research. Researchers can build their own customized view of the information in the system using a standards-based query interface. OrgPedia also provides a source for data-driven policymaking by enabling regulators to compare data across different agencies and databases.

Currently a prototype with thousands of live records, OrgPedia aims to expand its scope through collaboration with government and business. As part of the project, we have also been exploring how to better serve users with particular privileges (ie. government users entitled to see EIN identifiers) or having access to paid identifiers (such as DUNS numbers) and developing a strategy (legal, technical, policy, financial) for ensuring future growth of the platform and the community that will steward it.

Principal Investigators: Beth Noveck and Jim Hendler


  • The Problem
  • What a Solution Means
  • Our Approach
  • Events
  • Resources
  • Articles about OrgPedia

The Problem

There is a vast amount of data about companies, firms, and organizations that the U.S. and other governments collect and publish now in reusable formats, but which is effectively unavailable to the public because the absence of a universal, legal entity identifier makes it prohibitively labor intensive and expensive to analyze, visualize and compare that data across jurisdictions, regulatory agencies, or even within the same agency. In order to promote corporate performance and accountability, data-driven regulatory policy and risk management, and enhance economic productivity, there is an urgent need to design an implementation strategy for an open and universal mechanism for identifying organizations and the relationships between them.

The OrgPedia Project is embarking on a process to plan the legal, policy and technology framework for a data exchange that will facilitate comparison of data across regulatory schemes and public reuse and annotation of that data.

The current impediment to organizational data transparency is that there are multiple numbering schemes in place for entity identification, many of them considered inadequate, some of them proprietary, and none of which “talk” to one another.

Most legacy systems today serve specific purposes and user communities, and thus effectively create data “silos” that are inaccessible to one another. For example, the Federal Communications Commission has its own entity identifier called the FCC Registration Number (FRN). There are 1.3 million issued FRNs, which include licensees and also the lawyers to whom the licensees delegate authority. The Securities and Exchange Commission has a numbering scheme for tracking operating companies (the Central Index Key (CIK)) and a different scheme for investment advisors and for funds. The Environmental Protection Agency regulates a much larger number of organizations (2+ million) but it does not have a unique ID for the firm. Instead, it has an identification number for each facility and physical plant. None of these can be compared.

This disparity has arisen because the US government collects data about firms pursuant to specific statutes (and the regulations that implement them). Historically, the practices of collecting and organizing the information collected differ by statutory regime and across regulating organizations. The division of the EPA that manages Clean Air collects information differently from the division that manages Clean Water. There is no cooperation between the states that register companies and the other bodies that regulate them.

In short, every government agency “does its own thing,” collecting (sometimes duplicative) administrative information. In addition, the complexity of corporate structures (ie. changes in control) presents unique challenges to any effort to analyze or compare datasets. The problem is ever more acute in an era of multi-national corporations operating under different names and operating structures in a variety of countries.

Without a common numbering scheme (or a way to get existing schemes to communicate), there can be no collaborative ecosystem of shared data.

Contributing to the problem is that many agencies pay the private firm Dunn and Bradstreet to “rent” on an annualized basis use of its Data Universal Numbering System (DUNS) to track entities. “The D-U-N-S Number is D&B’s unique, nine-digit, location-specific business identification number, which D&B assigns as a means of identifying and tracking companies globally throughout their lifecycle.” The problem with the DUNS numbering scheme is that it is both unreliable and expensive. Agencies and interest groups complain that there are mistakes in the DUNS framework. It is also estimated that the Federal government spends $52.8 million dollars each year on DUNs numbers. This is on top of the fees companies pay to register and renew their DUNS number. Dunn and Bradstreet asserts copyright protection in its numbering scheme. This means that whereas an authorized government user can integrate the DUNS numbering scheme into a government system, the numbering scheme may not be made public, thereby hindering the ability of third parties to reuse government data.

This opportunity cost is significant. It is obscuring corporate accountability, impeding more data-driven decision making, journalism and research, and reducing economic productivity.

Without a mechanism for consistently tracking US, multinational and international firms and the relationship between these entities over time, the public cannot easily run comparative analyses across databases and gain insight into the activities of organizations. Combining data requires time-consuming analysis. But with a consistent naming convention, journalists could do empirically-based reporting (sometimes called “computational journalism”) on the activities of corporations; interest groups could track more accurately issues of relevance to them; officials could better assess and streamline regulatory policy to mitigate risk to the public; and researchers could achieve greater understanding of the nature and evolution of firms and organizations.

Right now, if the public wants to search to gain insight into how much the government spends in contracts on McDonnell-Douglas, and to “mash” that up with how much the company has given in campaign contributions or with the company’s environmental track record, it can’t get the full picture since McDonnell-Douglas merged with Boeing and each company is still tracked separately.

Similarly, the financial services regulatory community has no common identifier between broker-dealer reports and SEC financial reports thereby making risk assessment error prone and requiring manual identification of the ownership relationships among non-broker-dealer financial industry affiliates and parent companies. Representatives of the EPA reported in the April 8th OrgPedia planning meeting that when the Exxon-Valdez disaster happened, it was a huge burden to figure out the company’s compliance record across the agency.

For example, in our March 30th workshop, we learned that the New York Times did a Pulitzer Prize-winning piece on worker death at a manufacturing firm. The prize was a reward for doing the next to impossible job of investigating the environmental compliance record of that firm, even though preliminary analysis showed that the firm had been turning in the same topic release statements to regulators each year rather than developing new figures. With the ability to compare government contractor databases against lobbying records, for example, it will be possible to see when a parent company lobbies and its subsidiary gets a lucrative government contract. Having the ability to overcome the “numbering” problem and work with this data will make it much simpler to get early warnings of corporate malfeasance and enable the public to hold corporations accountable.

What a Solution Means

Solving this problem could also benefit firms, first, by bringing down the costs of regulatory compliance and reducing the paperwork load on regulated entities. In other words, with a consistent way to track who is who in the corporate namespace, eventually agencies can pre-populate fillable forms with name, address and other information obviating the need for regulated entities to fill out the same information over and over again.

Second, this work may also have the added benefit of helping firms to improve their own business practices by enabling them to understand their customers, suppliers and competitors better. Many businesses consume data about firms, especially those in the financial services space, whose investing depends on business intelligence, OrgPedia will eventually facilitate increased insight into the behavior and evolution of firms. The same business intelligence that we facilitate through our work will be freely available to companies just as much as to public interest organizations and government. Recent research from MIT shows that data-driven decision making increases corporate productivity by 5-6%.

Third, the hope is that by making government data about companies more transparent and useful, we will spur economic growth among existing companies that depend on business intelligence as well as new entrants into the marketplace who build services on top of the data we are helping to make available. Finally, compliance officers and corporate social responsibility teams within firms will better be able to monitor their performance relative to industry peers, and share and adopt best practices.

In an ideal world, in order to make the data collected by government entities susceptible to comparison and analysis, every company–domestic and international–would have its own number. In addition, each company would have another authenticated number telling us which entity is the parent and which is the subsidiary. There would be consistent fields for facilities and all the other information that each agency collects, and every agency would use consistent nomenclature (Inc. not Incorporated, Co. not Company) across every system.

While there are specific efforts underway (and others called for under Dodd-Frank) to create a consistent identifier system, such as the Securities and Exchange Commission work with their international counterparts to create a numbering scheme for public companies, these efforts apply in only one domain for a fraction of entities and often in a limited geographical range (ie. UK Companies House only tracks UK companies). They do little to address the comprehensive opportunity, which opens before us.

Our Approach

The OrgPedia project aims to design the model for an open and universal data exchange to facilitate comparison of data across regulatory schemes. Since all such numbering ontologies cannot be identified up front, especially if we want a system that works outside the United States, we want to design an extensible system that willto encourage others to participate in the ecosystem.

Based on our due diligence, we are starting from the assumption that, in the first instance, we need an infrastructure and universal exchange language rather than a uniform identifier that allow data to be shared and communicated for different purposes. Further, it is clear that the system must combine the best features of open-systems, such as wikis, which allow input from an extended user community, but it must also support the import of “authoritative” data from the SEC, EPA, DOJ, and other government agencies.

This exchange language approach will not require the creation of a new über-database of information about firms. It will allow people with different interests to extract only what they need to know about organizations rather than trying to design one system that’s right for everyone. It will facilitate research by providing a secure way to mash up data across existing databases. It will not require that existing systems be replaced, only be modestly upgraded or augmented by middleware. It will allow for extensibility, so that as new fields are added, the system can adapt. It will move us forward toward the goal of greater corporate transparency, without having to wait until we’ve enacted legislation for a single, legal entity identifier, and then subsequently wait years for its full ubiquitous adoption.


