Tuesday, July 8, 2008

Protocol Buffer : flexible, efficient, automated mechanism for serializing structured data

Protocol buffers are Google's data interchange format. Protocol buffers are a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications p frotocols, data storage, and more. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

Protocol Buffers allow you to define simple data structures in a special definition language (C++, Java and Python), then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.
Protocol buffers were designed to solve many of these problems:
  • New fields could be easily introduced, and intermediate servers that didn't need to inspect the data could simply parse it and pass through the data without needing to know about all the fields.
  • Formats were more self-describing, and could be dealt with from a variety of languages (C++, Java, etc.)

As the system evolved, it acquired a number of other features and uses:

  • Automatically-generated serialization and deserialization code avoided the need for hand parsing.
  • In addition to being used for short-lived RPC (Remote Procedure Call) requests, people started to use protocol buffers as a handy self-describing format for storing data persistently (for example, in Bigtable).
  • Server RPC interfaces started to be declared as part of protocol files, with the protocol compiler generating stub classes that users could override with actual implementations of the server's interface.
Protocol buffers are now Google's lingua franca for data – at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.

Find more information about protocol buffers here:
  • http://google-opensource.blogspot.com/2008/07/protocol-buffers-googles-data.html
  • http://code.google.com/apis/protocolbuffers/docs/overview.html

No comments: