Publishing Open Data – Do you really need an API?

As open data is gaining momentum an increasing number of organizations are thinking about ways to make their data available for others to use. Here are some thought on how to approach design issues when making open government data available.

TL;DR See if it is possible to publish your open data as file dumps instead of building advanced API:s that force entrepreneurs to integrate their apps with your infrastructure.

A fictional background

It was supposed to be a regular day for John at the server facility at the government weather agency. But when he came in to work that morning his colleague Mike was in a panic. -”Look! We are in the middle of a DDoS attack. The API-server is flooded and the database server is on it’s knees. The meteorologists can not work” . John started looking at server logs. Between 7 and 8 a.m. there was a sharp increase in traffic. Loads of API calls were made from a lot of different IP:s. Then, all of a sudden server load decreased and everything was back to normal.

What happened?

A year ago the weather agency had started to make their data available as part of the agency’s open government initiative. They were in a rush at the time and had decided to create an API for their weather data by setting up an internet-facing API-server. The API design had tried to take into consideration potential use cases that entrepreneurs may have but it was hard to know what people wanted. They had settled on three generic API calls.

Fast forward a year and they discover that one entrepreneur had built a very successful mobile app used by several hundred thousand users. Every morning it wakes you up by announcing today’s weather. To get the data necessary the app developer had to make two API calls which all of the installed apps did every morning to wake up their owners. That promptly crashed the API server which wasn’t designed to cope with the load.

Let’s call this the Direct integration API model.

The Direct Integration API Model

Forcing developers to integrate their apps directly with an agency API may have bad consequences

The direct integration API model has several drawbacks:

  1. Since the API:s [2] are designed for direct integration this is what developers did. The government agency is now (unknowingly) a critical component in making sure the apps are working.
  2. By forwarding API calls to the one and only database server internal applications are affected when there is a high API load [1].
  3. High load from one application will impact all other applications using the same API infrastructure.
  4. API:s were designed from hypothetical use case scenarios forcing applications to make multiple API calls to get the data required.

Even if the data was offloaded to a separate database server to isolate external load from internal systems developers still need to rely on the capacity of the API .

A better way to publish open government data

An alternative to the direct integration API model is to publish data dumps in files. “Boring!” may be the initial reaction from developers but they will thank you later. In this model data from the database is exported, transformed to an open readable format [1] (e.g. CSV), properly named and stored on the web server [2]. This means entrepreneurs can get all your data, load it into their own system and design their API according to their use case. Also, high loads will hit their own infrastructure without affecting other apps.

As an added bonus it is very simple to publish data dumps on a web server.

Publishing data files makes it possible for developers to design and set up their own API

If files and URL:s are named consistently it is easy for entrepreneurs to pick up data over time (e.g. http://data.example.com/weather/country/2012-03-01.csv). An alternative is to create an API that is designed for mirroring of data and its changes (e.g. event sourcing over Atom).

These two models gives us some background for design considerations for publishing open data.

Design considerations

1. Do you really need an API? API projects can become expensive and typically compete with other IT projects that may have a higher priority. Also, designing an API involves making decisions on API use cases. Do you know how users will use your data? Will your API design prevent users from making efficient use of your data in their applications? What is your plan to cope with load?

2. Make it easy for entrepreneurs to keep a local copy of your data up to date. By providing wisely named data dumps it is simple to keep a local database up to date. More advanced scenarios include using existing protocols and technology for low latency distribution (e.g. pubsubhubbub).

3. Isolate internal systems from the effects of external data publishing.

4. Make sure you can change your technology without breaking URLs. People are building software that depends on your URL:s. Don’t force them to rewrite their software just because you are switching to a new platform. Early warning signs is the existence of platform-specific fragments like “aspx”, “jsp” in your URLs. Get rid of those.

There are of course other things to take into consideration such as semantic descriptions but that is a matter for a later post.

Hopefully this will save you both money and time as file exports may be a lot cheaper than creating API:s. What do you think?

Also see: Publishing Open Government Data by W3C

In the next blog post I will look at cases where an API makes a lot of sense (real time data, collaborative processes) .

Related Posts:

  • http://twitter.com/ldodds Leigh Dodds

    Hi Peter,
    Really interesting post as it highlights a number of different issues. Firstly there’s the issue of sustainability around data access provision. The owner of the data may not be best placed to provide the kind of infrastructure required to support a successful API.

    Decoupling public and private access to data, to remove issues of external usage impacting internal operations; and to provide scope for entrepreneurs to provide additional value-added services over data are both good design and operational decisions.

    I totally agree that open government data needs to be available in a number of different forms. Raw data dumps provide for decoupling but also enable easier large scale analysis over data. Appropriately designed, scale-able APIs can ensure that re-users can get access to current (I hesitate to say “live”) data. Approaches to synchronization or syndication fill the gaps in between.

    But reading your post I was reminded of this article by Roy Fielding on the design web resource to support “RESTful” APIs:

    http://roy.gbiv.com/untangled/2008/paper-tigers-and-hidden-dragons

    Really an API is just a resource. With appropriate resource design you can have several different ways to consume data that embrace the needs for different views over it. APIs don’t have to be backed by live database infrastructure, appropriate layers of caching are essential, and even generation of flat-files are equally appropriate ways to deliver data. 

    To my mind your time-stamped CSV downloads are still an API; they’re all resources. It’s just that typically when designing these systems we equate an API with “live queries onto a dataset”, when that doesn’t have to be the case.

    Cheers,

    L.

  • http://www.peterkrantz.com/ Peter Krantz

    Leigh, I agree. API is a very broad term including properly named CSV dumps. At the center is the use case you want to support with your API. A first step may be to try data dumps. Next up may be to publish resources related to other resources in the world (compare to the 5-star model).

    But maybe a distinction should be made between “functional API:s” (as in “get me all reports containing the phrase ‘banana’ from last month”) and resource exposure/syndication (as in “here is an Atom feed with updates of our resources”).

  • http://twitter.com/Nik_G Nik Garkusha

    Does Government really need an API?

    It depends: type of data/sets, size, relationships, insfrastructure, skills to support, frequency of updates, end-use scenarios, etc.

    Some of the thoughts on where APIs offer advantage:

    1. Querying large datasets for relevant bits of data (15GB dataset vs. a 100KB slice of that data)

    2. Real-time or frequent update scenarios (GPS bus tracking, current weather vs your historic data scenario)

    3. Exposing relationships – the agency is best suited for defining/exposing relationships in the data vs. meta-data (a slice of multiple data sets with relationships captured)

    4. Powering Gov’t own applications for citizens (using own APIs for visualizing/interacting w/ data)

    You can still download data via APIs that are properly designed.

    However, not having an API should not be roadblock to publishing data. Downloads are fine for most common scenarios, but API offer the next level of dynamic platform for open gov data.

  • http://www.peterkrantz.com/ Peter Krantz

    Agreed. In the next blog post I will look at cases where an API makes a lot of sense (real time data, collaborative processes) as you describe above.

  • Pingback: Do you need an API? It depends! | OpenHalton

  • Henrik Olsson

    Fortunately, there is a simple and powerful solution in the cloud: Windows Azure Marketplace(https://datamarket.azure.com/). Data owners can publish to this platform, and decide if and if so how much they want to charge consumers. The billing is handled by the platform. Developers can use a REST based API (OData) to consume the data.

    At Softronic, we leaverage this platform in our offer http://www.offentligadata.se/.

  • Pingback: To API, or Not to API « Civic Innovations

  • peterkz

    Charging for data does not make it open. Also, getting access to data form Azure involves accepting license terms with Microsoft and force developers to integrate with API:s on a URL that the government agency is not in control of.

    So, if the agency want’s to switch to a different platform all entrepreneurs need to update their applications (see above).

    Some minor changes to the platform could make it a great data hosting service though:

    1) Ability to host it under a URL the Agency is in control of (e.g. data.example.com)

    2) Making it possible to use the data without entering into an agreement with a third party (e.g. Microsoft)

  • http://twitter.com/herbwatkins Herb Watkins

    I’d add that with the model you are proposing, the entrepreneur has the flexibility of optimizing the data model for the use case.  I can see instances where data transformation and nosql databases might be optimal.  

  • peterkz

    Precisely my point. API design should be based on use cases. Entrepreneurs would typically know more about their particular use case (and it probably varies between them).

    Without the power to design the API you may be forced to make unnecessary API calls to get the data you want. That could both impact the user experience and lead to unnecessary load.

  • http://apievangelist.com kinlane

    Definitely a question every open data initiative should ask.  What formats would be most valuable for our users?  Can we afford to scale this?

    I get a lot of government agencies who oppose using APIs because of exactly the questions you pose….and because they see APIs as a gateway to the “open” part of what they are trying to do.

    One other piece I would add to the pro-API sie–is providing real-time metrics on how your data is being used, not just by your developers, but their users.  Sure there is overhead in these operations, but the insight gained from knowing how people will use your data can go a long way in helping you better acquire, onboard, structure and deliver your data in the future.  

    I was talking with the data guys from the Census….and they do just as you say, provide a healthy download infrastructure with no API.  I started talking about the myriad of ways their data is being used out there by NYT, InfoChimps, Google and others…and they were blown away.  They said they’d love to build relationships with people to understand more, and get insight into how to change their gathering process…to better suit.

    So I think while there is a lot of overhead in delivering an API…and IS NOT suitable for all initiatives…but APIs can build closer relationships with developers and end users that may move your initiative in new directions.  Ways you will never know when just offering a download.

  • http://twitter.com/lukec Luke Closs

    In your example above, the entrepreneur is taking a very bold and risky move to base their user experience on a service they do not control.  Just because an API is there does not mean an app builder must use it in that hard coupled fashion.

    A smarter app developer would probably have a central server hit the API once a night and then produce an internal API used by the apps that was optimized for speed and minimal data use.

    Providing an API is a risky thing too because what happens if people USE IT?  You need to have a strategy for dealing with all these scalability and abuse problems that come with APIs.  Don’t blame users for (ab)using what you provide to them.

  • http://www.3scalesolutions.net stevenwillmott

    Interesting post Peter and I agree with Leigh. Essentially whatever delivery infrastructure you has *is* an API – it may not be REST, SOAP or whatever – but it’s a managed data source. 

    The important things to get right about the data source are: keeping it up to date, trying to to avoid it disappearing, making sure usage rules are clear, where the data comes from etc. 

    Sometimes a REST or SOAP API is the best way to deliver the data (e.g. real time traffic data) and sometimes flat files updated every 6M are fine – really the most important thing is the “endpoint” delivering the data in a dependable way.

  • peterkz

    There are two problems. One is the separation of API load from internal systems where I of course agree. The other is if a government agency should spend tax payer’s money on scaling infrastructure to support load generated by the apps built by entrepreneurs. The added side effect is that poor load capacity will affect all entrepreneurs. Not so sure about this one.

  • peterkz

    Great input! Not sure I agree. You can build a community around your data independently of how you expose it. Agencies already do this by running mailing lists, Facebook groups and participating in hack events. Having an open API does not automatically create relationships with developers?

  • http://apievangelist.com kinlane

    i agree you can build a community without API.  not at all implying that.  but that you can gain insight through usage that won’t always be volunteered by your community.  patterns devs won’t see and volunteer up, and won’t be available if only downloaded. patterns of consumption, app usage, design patterns that can go into future data gathering and delivery…making for better data all around.  i don’t see any of this as either or….just more questions that need to be considered…just as you pose..

  • Pingback: Government Data: Web APIs vs. Bulk Data Files

  • Christopher Pontac

    It’s great the interest this has generated.

    It seems to me that data dumps offer an API too.  The question you really ask is what sort of APIs to have – should we have an API which has a single operation that fetches a complete dataset, and should we have one which has many operations, and allows consumers to address identified elements within the dataset?  You could, of course, have both.

    Whatever you do, you will almost certainly not think of offering the data direct from the operational database, whether you’re offering a fetch-it-all API or something finer-grained. Publishing operational data is a very different exercise from running the operation which generates it. If your publishing data-freshness NFRs really demand immediate public access to operational data, then you need to think again.

    I think that there are a couple of other considerations when deciding what sort of APIs to have. 

    One is that you need to make a judgement as to how important ease of use is to your target consumers.  A decision which lifts a burden from a data publisher will place that burden on the data consumer.  If it’s easier for the publisher, it’s harder for the consumer.  This may or may not be important in a particular case, however I think that if you want to encourage *everyone* to use your data, you should think very hard about how to make the data easy to use.  These days an awful lot of development happens in small and very small enterprises (one-developer and few-developer bands).  These are very sensitive to ease of use considerations.  If you choose to dump your data becasue it saves you trouble, you should be sure that the trouble you’ve transferred to your target consumers does not actually deter them. Complex data transformations and geographical processing, for example, can represent considerable amounts of work.

    The other consideration is the API design.  A badly-designed API is a huge deterrent to use.  A data dump (with a single-operation fetch-it-all API) will be easier to use than a difficult API. A dump will allow potential consumers to test their transformations against the entire dataset until they’re happy, and doesn’t demand the use of awkward protocols (SOAP and WS* are a barrier to many).  And, as you say, there are the non-functional disadvantages of fine-grained API to do with a hard-to-control load (from the publisher’s point of view), and the third-party dependency (from the consumer’s point of view).  The former is the most important, as everyone has indicated.  After all, an application which uses a transformed dataset will still be dependent on the hosting for it, wherever that is.

    If you want to encourage the widest possible use of the data by third-party developers, you will indeed do your best to avoid having to predict the uses to which it will be put.  You may also think that your data is most useful if you can put the weight of other published data behind it. 

    I think that publishing your data according to Linked Data principles is one way to make it attractive to developers without constraining its use.  Hosting it on a platform designed to cope with fluctuating load is a way to manage the unpredictable load problem.

    The text-book advantages of Linked Data are all very helpful in the ease-of-use considerations:

    - It is schema-less.  A consumer can integrate data from many sources whether or not the publishers have to come to detailed agreements about data schema. No transformations are needed. Two Linked datasets can always be combined if they have any points of reference in common.

    - You gain public, stable and useful identifiers to the things which are important to you (URIs).  A machine which harvests Linked Data can be sure, for instance, that the bus stop which the Department of Transport is referring to in their data is the same as the one which that restaurant says is outside.

    - Data is retrieved over http.  Much easier that SOAP or WS*, and going with the trend of public APIs

    - Data is self documenting – the data model stays with the data

    - The publisher can exercise as fine a degree of access control and access tracking as they need to.

    - The components needed to manage and query Linked Data all obey Open standards (RDF, OWL and SPARQL).  There are free tools which support them.

    - A sophisticated consumer can add their own data by making inferences on the data using those same tools.

    There are plenty of times when a data dump will be a good choice.  I suggest however, that you should consider publishing your data as Linked Data before making your decision.

    Cheers,

    Christopher Pontac
    Technical Architect SciSys UK Ltd
    christopher.pontac@scisys.co.uk

    (I posted something similar to this at sunlightlabs’ blog as well)

  • Daniel Bennett

    This discussion of whether to use an API or just expose the data in well structured formats at predictable and/or discoverable permanent URLs needs to be fleshed out. First if the data is in a relational database, the act of picking a method to instantiate the data in static files creates many of the same issues as developing an API and how to present the data. 

    In addition, an API does one major thing that static files do not–provide search results. By adding in search to a system that presents the data as static files generally, you are creating a an API of sorts. Searches and APIs generally share the need for an indexed database. 

    If the new data is published in static files as fast as an API can discover the data, then there is no advantage to the API on speed of access. If the index for the API or the search of otherwise static files is up to date, then again there is no advantage on speed.

    However, there are some potential huge advantages to static files. 
    * Indexes can be external to the publishing site, for example Google and any person with some resources can scrape, download  the files for their own indexing or application. 
    * The static files are more likely to have permanent URLs (and if in HTML/XML) fragment URLs that can be used for not just linking, but metadata object ids, external digital signatures for better authentication. Since a ReSTful API can provide the same capabilities, the issue is the sustainability of the URLs/server side technology and the stability of the database being queried.

    I would prefer that well documented, valid HTML or XML with XSLT into HTML data be published. In this way, the URLs could be permanent, the data be easy to see for non-technical viewers in a browser, the data can be better cited and authenticated and some of us who see a future of pulling data live into apps will be helped by this. Also, lots of data is not simply tabular, but better represented as objects, rich documents, richly semantic and/or graphical. 
    Daniel Bennett
    daniel@citizencontact.com

  • Pingback: Implicaciones técnicas de las iniciativas Open Data | Carlos Iglesias' Web Whisperer

  • Pingback: Cuando el diseño se encuentra con el Open Data | Carlos Iglesias' Web Whisperer

  • carterson2

    how about WSDL’s? I am struggling with picking a good API for use at wikispeedia.org