Tuesday, November 16, 2010

GFS the Google File System in 199 Lines of Python

GFS, the Google File System, sits as the backbone of the entire Google infrastructure. However, for many it is a mystery, especially for those lucky enough to be more acquainted with high-level python code than low-level C operating system sources. But have no fear, we shall break through the veil and describe an implementation of GFS in 199 lines of python. Naturally, you may want to read about the theory and design of GFS in the original Google Research GFS paper. But we will aim to give the core concepts, with real working python code, in this article. Complete runnable python 2.6 source code, if you wish it, is available at the end of this post.


A brief summary of GFS is as follows. GFS consists of three components: a client, a master, and one or more chunkservers. The client is the only user-visible, that is programmer-accessible, part of the system. It functions similarly to a standard POSIX file library. The master is a single server that holds all metadata for the filesystem. By metadata we mean the information about each file, its constituent components called chunks, and the location of these chunks on various chunkservers. The chunkservers are where the actual data is stored, and the vast majority of network traffic takes place between the client and the chunkservers, to avoid the master as a bottleneck. We will give more detailed descriptions below by going through the GFS client, master, and chunkserver as implemented in python classes, and close with a test script and its output.

The client class is the only user-visible portion of the GFS library. It mediates all requests between the client for filesystem access and the master and chunkservers for data storage and retrieval. It is important to note that GFS appears very familiar to programmers of normal filesystems, there is no distributed knowledge required. All of this is abstracted away behind the client implementation. Of course there are some exceptions to this such as the localized chunk knowledge used to allocate processing of files most efficiently, such as in the map reduce algorithm, but we have avoided such complexity in this implementation. What is most critical is to note how the normal read, write, append, exist, and delete calls are available in their common forms, and how these are implemented by the client class; we also simplify open, close and create by subsuming them under the previous methods. The gist of each method is the same: ask the master for the metadata including chunk IDs and chunk locations on the chunkservers, then update any necessary metadata with the master, and finally transaction actual data flow only with the chunkservers.

class GFSClient:
    def __init__(self, master):
        self.master = master

    def write(self, filename, data): # filename is full namespace path
        if self.exists(filename): # if already exists, overwrite
            self.delete(filename)
        num_chunks = self.num_chunks(len(data))
        chunkuuids = self.master.alloc(filename, num_chunks)
        self.write_chunks(chunkuuids, data)

    def write_chunks(self, chunkuuids, data):
        chunks = [ data[x:x+self.master.chunksize] \
            for x in range(0, len(data), self.master.chunksize) ]
        chunkservers = self.master.get_chunkservers()
        for i in range(0, len(chunkuuids)): # write to each chunkserver
            chunkuuid = chunkuuids[i]
            chunkloc = self.master.get_chunkloc(chunkuuid)
            chunkservers[chunkloc].write(chunkuuid, chunks[i])

    def num_chunks(self, size):
        return (size // self.master.chunksize) \
            + (1 if size % self.master.chunksize > 0 else 0)

    def write_append(self, filename, data):
        if not self.exists(filename):
            raise Exception("append error, file does not exist: " \
                 + filename)
        num_append_chunks = self.num_chunks(len(data))
        append_chunkuuids = self.master.alloc_append(filename, \
            num_append_chunks)
        self.write_chunks(append_chunkuuids, data)

    def exists(self, filename):
        return self.master.exists(filename)

    def read(self, filename): # get metadata, then read chunks direct
        if not self.exists(filename):
            raise Exception("read error, file does not exist: " \
                + filename)
        chunks = []
        chunkuuids = self.master.get_chunkuuids(filename)
        chunkservers = self.master.get_chunkservers()
        for chunkuuid in chunkuuids:
            chunkloc = self.master.get_chunkloc(chunkuuid)
            chunk = chunkservers[chunkloc].read(chunkuuid)
            chunks.append(chunk)
        data = reduce(lambda x, y: x + y, chunks) # reassemble in order
        return data

    def delete(self, filename):
        self.master.delete(filename)

The master class simulates a GFS master server. This is where all the metadata is stored, the core node of the entire system. Client requests initiate with the master, then after metadata is retrieved, they talk directly to the individual chunkservers. This avoids the master being a bottleneck as the metadata is typically short and low latency. The metadata is implemented as a series of dictionaries, although in a real system you'd have filesystem backing of the dicts. The notification of chunkservers becoming available and unavailable via heartbeats, chunkserver authentication and localization info for efficient storage are all simplified here so that the master itself is allocating chunkservers. However we still preserve the direct client read/write to the chunkservers, bypassing the master, to show how the distributed system is working.

class GFSMaster:
    def __init__(self):
        self.num_chunkservers = 5
        self.max_chunkservers = 10
        self.max_chunksperfile = 100
        self.chunksize = 10
        self.chunkrobin = 0
        self.filetable = {} # file to chunk mapping
        self.chunktable = {} # chunkuuid to chunkloc mapping
        self.chunkservers = {} # loc id to chunkserver mapping
        self.init_chunkservers()

    def init_chunkservers(self):
        for i in range(0, self.num_chunkservers):
            chunkserver = GFSChunkserver(i)
            self.chunkservers[i] = chunkserver

    def get_chunkservers(self):
        return self.chunkservers

    def alloc(self, filename, num_chunks): # return ordered chunkuuid list
        chunkuuids = self.alloc_chunks(num_chunks)
        self.filetable[filename] = chunkuuids
        return chunkuuids

    def alloc_chunks(self, num_chunks):
        chunkuuids = []
        for i in range(0, num_chunks):
            chunkuuid = uuid.uuid1()
            chunkloc = self.chunkrobin
            self.chunktable[chunkuuid] = chunkloc
            chunkuuids.append(chunkuuid)
            self.chunkrobin = (self.chunkrobin + 1) % self.num_chunkservers
        return chunkuuids

    def alloc_append(self, filename, num_append_chunks): # append chunks
        chunkuuids = self.filetable[filename]
        append_chunkuuids = self.alloc_chunks(num_append_chunks)
        chunkuuids.extend(append_chunkuuids)
        return append_chunkuuids

    def get_chunkloc(self, chunkuuid):
        return self.chunktable[chunkuuid]

    def get_chunkuuids(self, filename):
        return self.filetable[filename]

    def exists(self, filename):
        return True if filename in self.filetable else False

    def delete(self, filename): # rename for later garbage collection
        chunkuuids = self.filetable[filename]
        del self.filetable[filename]
        timestamp = repr(time.time())
        deleted_filename = "/hidden/deleted/" + timestamp + filename
        self.filetable[deleted_filename] = chunkuuids
        print "deleted file: " + filename + " renamed to " + \
             deleted_filename + " ready for gc"

    def dump_metadata(self):
        print "Filetable:",
        for filename, chunkuuids in self.filetable.items():
            print filename, "with", len(chunkuuids),"chunks"
        print "Chunkservers: ", len(self.chunkservers)
        print "Chunkserver Data:"
        for chunkuuid, chunkloc in sorted(self.chunktable.iteritems(), key=operator.itemgetter(1)):
            chunk = self.chunkservers[chunkloc].read(chunkuuid)
            print chunkloc, chunkuuid, chunk

The chunkserver class is the smallest in this project. This represents an actual distinct box running in a massive datacenter, connected to a network reachable by the master and client. In GFS, the chunkservers are relatively "dumb" in that they know only about chunks, that is, the file data broken up into pieces. They don't see the whole picture of the entire file, where it is across the whole filesystem, the associated metadata, etc. We implement this class as a simple local storage, which you can examine after running the test code by looking at the directory path "/tmp/gfs/chunks". In a real system you'd want persistent storage of the chunk info for backup.

class GFSChunkserver:
    def __init__(self, chunkloc):
        self.chunkloc = chunkloc
        self.chunktable = {}
        self.local_filesystem_root = "/tmp/gfs/chunks/" + repr(chunkloc)
        if not os.access(self.local_filesystem_root, os.W_OK):
            os.makedirs(self.local_filesystem_root)

    def write(self, chunkuuid, chunk):
        local_filename = self.chunk_filename(chunkuuid)
        with open(local_filename, "w") as f:
            f.write(chunk)
        self.chunktable[chunkuuid] = local_filename

    def read(self, chunkuuid):
        data = None
        local_filename = self.chunk_filename(chunkuuid)
        with open(local_filename, "r") as f:
            data = f.read()
        return data

    def chunk_filename(self, chunkuuid):
        local_filename = self.local_filesystem_root + "/" \
            + str(chunkuuid) + '.gfs'
        return local_filename

We use main() as a test for all the client methods, including exceptions. We first create a master and client object, then write a file. This write is performed by the client in the same way as the real GFS: first it gets the chunk metadata from the master, then writes chunks directly to each chunkserver. Append functions similarly. Delete is handled in the GFS fashion, where it renames the file to a hidden namespace and leaves it for later garbage collection. A dump displays metadata content. Note that this is a single-threaded test, as this demonstration program does not support concurrency, although that could be added with appropriate locks around the metadata.

def main():
    # test script for filesystem

    # setup
    master = GFSMaster()
    client = GFSClient(master)

    # test write, exist, read
    print "\nWriting..."
    client.write("/usr/python/readme.txt", """
        This file tells you all about python that you ever wanted to know.
        Not every README is as informative as this one, but we aim to please.
        Never yet has there been so much information in so little space.
        """)
    print "File exists? ", client.exists("/usr/python/readme.txt")
    print client.read("/usr/python/readme.txt")

    # test append, read after append
    print "\nAppending..."
    client.write_append("/usr/python/readme.txt", \
        "I'm a little sentence that just snuck in at the end.\n")
    print client.read("/usr/python/readme.txt")

    # test delete
    print "\nDeleting..."
    client.delete("/usr/python/readme.txt")
    print "File exists? ", client.exists("/usr/python/readme.txt")

    # test exceptions
    print "\nTesting Exceptions..."
    try:
        client.read("/usr/python/readme.txt")
    except Exception as e:
        print "This exception should be thrown:", e
    try:
        client.write_append("/usr/python/readme.txt", "foo")
    except Exception as e:
        print "This exception should be thrown:", e

    # show structure of the filesystem
    print "\nMetadata Dump..."
    print master.dump_metadata()

And putting it all together, here is the output of the test script run from the python interpreter. Pay special attention to the master metadata dump at the end, where you can see how the chunks are spread across chunkservers in jumbled order, only to be reassembled by the client in the order specified by the master metadata.

$ python gfs.py

Writing...
File exists?  True

        This file tells you all about python that you ever wanted to know.
        Not every README is as informative as this one, but we aim to please.
        Never yet has there been so much information in so little space.


Appending...

        This file tells you all about python that you ever wanted to know.
        Not every README is as informative as this one, but we aim to please.
        Never yet has there been so much information in so little space.
        I'm a little sentence that just snuck in at the end.


Deleting...
deleted file: /usr/python/readme.txt renamed to /hidden/deleted/1289928955.7363091/usr/python/readme.txt ready for gc
File exists?  False

Testing Exceptions...
This exception should be thrown: read error, file does not exist: /usr/python/readme.txt
This exception should be thrown: append error, file does not exist: /usr/python/readme.txt

Metadata Dump...
Filetable: /hidden/deleted/1289928955.7363091/usr/python/readme.txt with 30 chunks
Chunkservers:  5
Chunkserver Data:
0 f76734ce-f1a7-11df-b529-001d09d5b664 mation in
0 f7671750-f1a7-11df-b529-001d09d5b664  you ever
0 f7670bd4-f1a7-11df-b529-001d09d5b664
        T
0 f767b656-f1a7-11df-b529-001d09d5b664 le sentenc
0 f7672182-f1a7-11df-b529-001d09d5b664  is as inf
0 f7672b0a-f1a7-11df-b529-001d09d5b664 se.

1 f767b85e-f1a7-11df-b529-001d09d5b664 e that jus
1 f76736b8-f1a7-11df-b529-001d09d5b664 so little
1 f767193a-f1a7-11df-b529-001d09d5b664 wanted to
1 f7670f3a-f1a7-11df-b529-001d09d5b664 his file t
1 f767236c-f1a7-11df-b529-001d09d5b664 ormative a
1 f7672cf4-f1a7-11df-b529-001d09d5b664   Never ye
2 f7671b2e-f1a7-11df-b529-001d09d5b664 know.

2 f7671142-f1a7-11df-b529-001d09d5b664 ells you a
2 f767ba48-f1a7-11df-b529-001d09d5b664 t snuck in
2 f7672556-f1a7-11df-b529-001d09d5b664 s this one
2 f76738e8-f1a7-11df-b529-001d09d5b664 space.

2 f7672ee8-f1a7-11df-b529-001d09d5b664 t has ther
3 f767bcb4-f1a7-11df-b529-001d09d5b664  at the en
3 f7671d40-f1a7-11df-b529-001d09d5b664     Not ev
3 f76730d2-f1a7-11df-b529-001d09d5b664 e been so
3 f7673adc-f1a7-11df-b529-001d09d5b664
3 f767135e-f1a7-11df-b529-001d09d5b664 ll about p
3 f7672740-f1a7-11df-b529-001d09d5b664 , but we a
4 f7672920-f1a7-11df-b529-001d09d5b664 im to plea
4 f767b3f4-f1a7-11df-b529-001d09d5b664 I'm a litt
4 f767bea8-f1a7-11df-b529-001d09d5b664 d.

4 f7671552-f1a7-11df-b529-001d09d5b664 ython that
4 f76732e4-f1a7-11df-b529-001d09d5b664 much infor
4 f7671f8e-f1a7-11df-b529-001d09d5b664 ery README

Now of course we are lacking some of the complexities of GFS necessary for a fully functional system: metadata locking, chunk leases, replication, master failover, localization of data, chunkserver heartbeats, deleted file garbage collection.  But what we have here demonstrates the gist of GFS and will help give you a better understanding of the basics.  It can also be a starting point for your own explorations into more detailed distributed filesystem code in python.

Full source code can be downloaded here: gfs.py

11 comments:

  1. def num_chunks(self, size):
    return (size + self.master.chunksize - 1) // self.master.chunksize

    ReplyDelete
  2. Tried several anti-virus programs, none of them have helped at all. Options?

    ReplyDelete
  3. how to implement the realization/download of google ebooks from onix?

    ReplyDelete
  4. Why when I use Google and search for something the first page that comes up is a file on my system.?

    ReplyDelete
  5. How do I save a not responding Google Sketchup 8 file?

    ReplyDelete
  6. when trying to play a worldwinner.com game i get three errors error downloading file,error retrieving game con?

    ReplyDelete
  7. Fantastic post. Here’s a tool that lets you build your cloud database apps without coding http://www.caspio.com/

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. I attempted something similar and here's it is: https://superuser.blog/distributed-file-system-python/
    It's on GitHub also :)

    ReplyDelete