A brief summary of GFS is as follows. GFS consists of three components: a client, a master, and one or more chunkservers. The client is the only user-visible, that is programmer-accessible, part of the system. It functions similarly to a standard POSIX file library. The master is a single server that holds all metadata for the filesystem. By metadata we mean the information about each file, its constituent components called chunks, and the location of these chunks on various chunkservers. The chunkservers are where the actual data is stored, and the vast majority of network traffic takes place between the client and the chunkservers, to avoid the master as a bottleneck. We will give more detailed descriptions below by going through the GFS client, master, and chunkserver as implemented in python classes, and close with a test script and its output.
The client class is the only user-visible portion of the GFS library. It mediates all requests between the client for filesystem access and the master and chunkservers for data storage and retrieval. It is important to note that GFS appears very familiar to programmers of normal filesystems, there is no distributed knowledge required. All of this is abstracted away behind the client implementation. Of course there are some exceptions to this such as the localized chunk knowledge used to allocate processing of files most efficiently, such as in the map reduce algorithm, but we have avoided such complexity in this implementation. What is most critical is to note how the normal read, write, append, exist, and delete calls are available in their common forms, and how these are implemented by the client class; we also simplify open, close and create by subsuming them under the previous methods. The gist of each method is the same: ask the master for the metadata including chunk IDs and chunk locations on the chunkservers, then update any necessary metadata with the master, and finally transaction actual data flow only with the chunkservers.
class GFSClient: def __init__(self, master): self.master = master def write(self, filename, data): # filename is full namespace path if self.exists(filename): # if already exists, overwrite self.delete(filename) num_chunks = self.num_chunks(len(data)) chunkuuids = self.master.alloc(filename, num_chunks) self.write_chunks(chunkuuids, data) def write_chunks(self, chunkuuids, data): chunks = [ data[x:x+self.master.chunksize] \ for x in range(0, len(data), self.master.chunksize) ] chunkservers = self.master.get_chunkservers() for i in range(0, len(chunkuuids)): # write to each chunkserver chunkuuid = chunkuuids[i] chunkloc = self.master.get_chunkloc(chunkuuid) chunkservers[chunkloc].write(chunkuuid, chunks[i]) def num_chunks(self, size): return (size // self.master.chunksize) \ + (1 if size % self.master.chunksize > 0 else 0) def write_append(self, filename, data): if not self.exists(filename): raise Exception("append error, file does not exist: " \ + filename) num_append_chunks = self.num_chunks(len(data)) append_chunkuuids = self.master.alloc_append(filename, \ num_append_chunks) self.write_chunks(append_chunkuuids, data) def exists(self, filename): return self.master.exists(filename) def read(self, filename): # get metadata, then read chunks direct if not self.exists(filename): raise Exception("read error, file does not exist: " \ + filename) chunks = [] chunkuuids = self.master.get_chunkuuids(filename) chunkservers = self.master.get_chunkservers() for chunkuuid in chunkuuids: chunkloc = self.master.get_chunkloc(chunkuuid) chunk = chunkservers[chunkloc].read(chunkuuid) chunks.append(chunk) data = reduce(lambda x, y: x + y, chunks) # reassemble in order return data def delete(self, filename): self.master.delete(filename)
The master class simulates a GFS master server. This is where all the metadata is stored, the core node of the entire system. Client requests initiate with the master, then after metadata is retrieved, they talk directly to the individual chunkservers. This avoids the master being a bottleneck as the metadata is typically short and low latency. The metadata is implemented as a series of dictionaries, although in a real system you'd have filesystem backing of the dicts. The notification of chunkservers becoming available and unavailable via heartbeats, chunkserver authentication and localization info for efficient storage are all simplified here so that the master itself is allocating chunkservers. However we still preserve the direct client read/write to the chunkservers, bypassing the master, to show how the distributed system is working.
class GFSMaster: def __init__(self): self.num_chunkservers = 5 self.max_chunkservers = 10 self.max_chunksperfile = 100 self.chunksize = 10 self.chunkrobin = 0 self.filetable = {} # file to chunk mapping self.chunktable = {} # chunkuuid to chunkloc mapping self.chunkservers = {} # loc id to chunkserver mapping self.init_chunkservers() def init_chunkservers(self): for i in range(0, self.num_chunkservers): chunkserver = GFSChunkserver(i) self.chunkservers[i] = chunkserver def get_chunkservers(self): return self.chunkservers def alloc(self, filename, num_chunks): # return ordered chunkuuid list chunkuuids = self.alloc_chunks(num_chunks) self.filetable[filename] = chunkuuids return chunkuuids def alloc_chunks(self, num_chunks): chunkuuids = [] for i in range(0, num_chunks): chunkuuid = uuid.uuid1() chunkloc = self.chunkrobin self.chunktable[chunkuuid] = chunkloc chunkuuids.append(chunkuuid) self.chunkrobin = (self.chunkrobin + 1) % self.num_chunkservers return chunkuuids def alloc_append(self, filename, num_append_chunks): # append chunks chunkuuids = self.filetable[filename] append_chunkuuids = self.alloc_chunks(num_append_chunks) chunkuuids.extend(append_chunkuuids) return append_chunkuuids def get_chunkloc(self, chunkuuid): return self.chunktable[chunkuuid] def get_chunkuuids(self, filename): return self.filetable[filename] def exists(self, filename): return True if filename in self.filetable else False def delete(self, filename): # rename for later garbage collection chunkuuids = self.filetable[filename] del self.filetable[filename] timestamp = repr(time.time()) deleted_filename = "/hidden/deleted/" + timestamp + filename self.filetable[deleted_filename] = chunkuuids print "deleted file: " + filename + " renamed to " + \ deleted_filename + " ready for gc" def dump_metadata(self): print "Filetable:", for filename, chunkuuids in self.filetable.items(): print filename, "with", len(chunkuuids),"chunks" print "Chunkservers: ", len(self.chunkservers) print "Chunkserver Data:" for chunkuuid, chunkloc in sorted(self.chunktable.iteritems(), key=operator.itemgetter(1)): chunk = self.chunkservers[chunkloc].read(chunkuuid) print chunkloc, chunkuuid, chunk
The chunkserver class is the smallest in this project. This represents an actual distinct box running in a massive datacenter, connected to a network reachable by the master and client. In GFS, the chunkservers are relatively "dumb" in that they know only about chunks, that is, the file data broken up into pieces. They don't see the whole picture of the entire file, where it is across the whole filesystem, the associated metadata, etc. We implement this class as a simple local storage, which you can examine after running the test code by looking at the directory path "/tmp/gfs/chunks". In a real system you'd want persistent storage of the chunk info for backup.
class GFSChunkserver: def __init__(self, chunkloc): self.chunkloc = chunkloc self.chunktable = {} self.local_filesystem_root = "/tmp/gfs/chunks/" + repr(chunkloc) if not os.access(self.local_filesystem_root, os.W_OK): os.makedirs(self.local_filesystem_root) def write(self, chunkuuid, chunk): local_filename = self.chunk_filename(chunkuuid) with open(local_filename, "w") as f: f.write(chunk) self.chunktable[chunkuuid] = local_filename def read(self, chunkuuid): data = None local_filename = self.chunk_filename(chunkuuid) with open(local_filename, "r") as f: data = f.read() return data def chunk_filename(self, chunkuuid): local_filename = self.local_filesystem_root + "/" \ + str(chunkuuid) + '.gfs' return local_filename
We use main() as a test for all the client methods, including exceptions. We first create a master and client object, then write a file. This write is performed by the client in the same way as the real GFS: first it gets the chunk metadata from the master, then writes chunks directly to each chunkserver. Append functions similarly. Delete is handled in the GFS fashion, where it renames the file to a hidden namespace and leaves it for later garbage collection. A dump displays metadata content. Note that this is a single-threaded test, as this demonstration program does not support concurrency, although that could be added with appropriate locks around the metadata.
def main(): # test script for filesystem # setup master = GFSMaster() client = GFSClient(master) # test write, exist, read print "\nWriting..." client.write("/usr/python/readme.txt", """ This file tells you all about python that you ever wanted to know. Not every README is as informative as this one, but we aim to please. Never yet has there been so much information in so little space. """) print "File exists? ", client.exists("/usr/python/readme.txt") print client.read("/usr/python/readme.txt") # test append, read after append print "\nAppending..." client.write_append("/usr/python/readme.txt", \ "I'm a little sentence that just snuck in at the end.\n") print client.read("/usr/python/readme.txt") # test delete print "\nDeleting..." client.delete("/usr/python/readme.txt") print "File exists? ", client.exists("/usr/python/readme.txt") # test exceptions print "\nTesting Exceptions..." try: client.read("/usr/python/readme.txt") except Exception as e: print "This exception should be thrown:", e try: client.write_append("/usr/python/readme.txt", "foo") except Exception as e: print "This exception should be thrown:", e # show structure of the filesystem print "\nMetadata Dump..." print master.dump_metadata()
And putting it all together, here is the output of the test script run from the python interpreter. Pay special attention to the master metadata dump at the end, where you can see how the chunks are spread across chunkservers in jumbled order, only to be reassembled by the client in the order specified by the master metadata.
$ python gfs.py Writing... File exists? True This file tells you all about python that you ever wanted to know. Not every README is as informative as this one, but we aim to please. Never yet has there been so much information in so little space. Appending... This file tells you all about python that you ever wanted to know. Not every README is as informative as this one, but we aim to please. Never yet has there been so much information in so little space. I'm a little sentence that just snuck in at the end. Deleting... deleted file: /usr/python/readme.txt renamed to /hidden/deleted/1289928955.7363091/usr/python/readme.txt ready for gc File exists? False Testing Exceptions... This exception should be thrown: read error, file does not exist: /usr/python/readme.txt This exception should be thrown: append error, file does not exist: /usr/python/readme.txt Metadata Dump... Filetable: /hidden/deleted/1289928955.7363091/usr/python/readme.txt with 30 chunks Chunkservers: 5 Chunkserver Data: 0 f76734ce-f1a7-11df-b529-001d09d5b664 mation in 0 f7671750-f1a7-11df-b529-001d09d5b664 you ever 0 f7670bd4-f1a7-11df-b529-001d09d5b664 T 0 f767b656-f1a7-11df-b529-001d09d5b664 le sentenc 0 f7672182-f1a7-11df-b529-001d09d5b664 is as inf 0 f7672b0a-f1a7-11df-b529-001d09d5b664 se. 1 f767b85e-f1a7-11df-b529-001d09d5b664 e that jus 1 f76736b8-f1a7-11df-b529-001d09d5b664 so little 1 f767193a-f1a7-11df-b529-001d09d5b664 wanted to 1 f7670f3a-f1a7-11df-b529-001d09d5b664 his file t 1 f767236c-f1a7-11df-b529-001d09d5b664 ormative a 1 f7672cf4-f1a7-11df-b529-001d09d5b664 Never ye 2 f7671b2e-f1a7-11df-b529-001d09d5b664 know. 2 f7671142-f1a7-11df-b529-001d09d5b664 ells you a 2 f767ba48-f1a7-11df-b529-001d09d5b664 t snuck in 2 f7672556-f1a7-11df-b529-001d09d5b664 s this one 2 f76738e8-f1a7-11df-b529-001d09d5b664 space. 2 f7672ee8-f1a7-11df-b529-001d09d5b664 t has ther 3 f767bcb4-f1a7-11df-b529-001d09d5b664 at the en 3 f7671d40-f1a7-11df-b529-001d09d5b664 Not ev 3 f76730d2-f1a7-11df-b529-001d09d5b664 e been so 3 f7673adc-f1a7-11df-b529-001d09d5b664 3 f767135e-f1a7-11df-b529-001d09d5b664 ll about p 3 f7672740-f1a7-11df-b529-001d09d5b664 , but we a 4 f7672920-f1a7-11df-b529-001d09d5b664 im to plea 4 f767b3f4-f1a7-11df-b529-001d09d5b664 I'm a litt 4 f767bea8-f1a7-11df-b529-001d09d5b664 d. 4 f7671552-f1a7-11df-b529-001d09d5b664 ython that 4 f76732e4-f1a7-11df-b529-001d09d5b664 much infor 4 f7671f8e-f1a7-11df-b529-001d09d5b664 ery README
Now of course we are lacking some of the complexities of GFS necessary for a fully functional system: metadata locking, chunk leases, replication, master failover, localization of data, chunkserver heartbeats, deleted file garbage collection. But what we have here demonstrates the gist of GFS and will help give you a better understanding of the basics. It can also be a starting point for your own explorations into more detailed distributed filesystem code in python.
Full source code can be downloaded here: gfs.py
def num_chunks(self, size):
ReplyDeletereturn (size + self.master.chunksize - 1) // self.master.chunksize
Thanks for the optimization.
ReplyDeleteTried several anti-virus programs, none of them have helped at all. Options?
ReplyDeletehow to implement the realization/download of google ebooks from onix?
ReplyDeleteWhy when I use Google and search for something the first page that comes up is a file on my system.?
ReplyDeleteHow do I save a not responding Google Sketchup 8 file?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI attempted something similar and here's it is: https://superuser.blog/distributed-file-system-python/
ReplyDeleteIt's on GitHub also :)
Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT.
ReplyDeletepython training in bangalore
aws training in bangalore
artificial intelligence training in bangalore
data science training in bangalore
machine learning training in bangalore
hadoop training in bangalore
devops training in bangalore
thanks for given a such amazing post
ReplyDeleteaws course in Bangalore
aws training center in Bangalore
cloud computing courses in Bangalore
amazon web services training institutes in Bangalore
best cloud computing institute in Bangalore
cloud computing training in Bangalore
aws training in Bangalore
aws certification in Bangalore
best aws training in Bangalore
aws certification training in Bangalore
Thanks for the post. I liked it
ReplyDeletetableau course in Marathahalli
best tableau training in Marathahalli
tableau training in Marathahalli
tableau training in Marathahalli
tableau certification in Marathahalli
tableau training institutes in Marathahalli
python training in bangalore | python online training
ReplyDeleteaws online training in Bangalore | aws online training
artificial intelligence training in bangalore | artificial intelligence online training
machine learning training in bangalore | machine learning online training
data science training in bangalore
| data science online training
Samsun
ReplyDeleteNevşehir
Van
Bartın
Edirne
QE0H
Malatya
ReplyDeleteKırıkkale
Aksaray
Bitlis
Manisa
İUP
elazığ
ReplyDeletebilecik
kilis
sakarya
yozgat
0166
https://titandijital.com.tr/
ReplyDeletenevşehir parça eşya taşıma
bolu parça eşya taşıma
batman parça eşya taşıma
bayburt parça eşya taşıma
POAU
BDFB0
ReplyDeletetekirdağ bedava sohbet odaları
kilis canli sohbet bedava
urfa sesli sohbet siteleri
aksaray en iyi sesli sohbet uygulamaları
eskişehir canlı sohbet uygulamaları
antalya goruntulu sohbet
balıkesir canlı görüntülü sohbet odaları
tokat canlı görüntülü sohbet siteleri
nanytoo sohbet
00424
ReplyDeleteBitcoin Nasıl Alınır
Kwai Takipçi Satın Al
Coin Nasıl Alınır
Bitcoin Çıkarma
Twitch Takipçi Satın Al
Coin Nasıl Çıkarılır
Bitcoin Çıkarma Siteleri
Threads Takipçi Satın Al
Parasız Görüntülü Sohbet
8DFD9
ReplyDeleteBatman
Altındağ
Karapürçek
Kemah
Tavşanlı
Gelibolu
Çiçekdağı
Korgun
İnönü
jkYCwumPns
ReplyDeleteرقم مصلحة المجاري بالاحساء H2IfybZQvy
ReplyDelete