Mongodb GridFS using Python

Data is collected and stored in different forms depending on the nature of data. It can be in the form of text, images, videos, audio files and files in other formats. MongoDB can be used for storing all kinds of data, but so far we have used it for storing plain text information in MongoDB documents. As you know by now that MongoDB document has a size limit of 16MB. Though 16MB is good enough in most cases but looks tiny if you think of storing high-resolution images, PDF files, music, videos etc.

In this chapter, we are going to see how we can store more information in MongoDB. Till now we played around with storing text in JSON format but in this chapter we’ll see how to store images, PDFs, audio, video etc. I am sure you are going to enjoy this chapter. So, let’s get started!

Introduction
MongoDB provides a way to store arbitrary large files through GridFS. Till now we have inserted documents in MongoDB that have information in the form of key-value pairs. If you want to store files in MongoDB then MongoDB provides this functionality as well. Let me introduce you to GridFS.

GridFS is a protocol for storing and retrieving arbitrary large files. What GridFS does is, it divides one large file into multiple small chunks and stores each chunk as a separate document in MongoDB. The default size of the GridFS chunk is 255k. Yes, you guessed it right. Once the chunk is stored as a document in MongoDB, all those operations that can be performed on regular MongoDB documents can be performed on these documents as well. If I put simply, they are regular MongoDB documents.

Now you must be wondering how do we get the file back if needed. The driver will reassemble the chunks for you so you don’t need to worry. Luckily, when using GridFS, the driver does all the magic and saves you from headaches and sleepless nights.

GridFS uses two collections to save files. Both of them are under the default namespace of ‘fs’. Here are the two collections –

files – This collection has the information about the file, like – file name, chunk size, upload date etc.

chunks – This collection holds the chunks, as the name suggests. Documents in this collection host the binary data of the file. Every document contains the id of the file, binary data and chunk sequence.

Don’t worry if you find this a little complex. Things will become much more clear once we start playing around with examples.

Let’s see what are some of the benefits of storing files in MongoDB.

All data in one place
By storing files in MongoDB you have all your data in one place. This reduces maintenance overhead. Traditionally, the files are stored in a folder somewhere in the file system and path to the file is saved in database. This requires you to take care of files in the folder in addition to database.

Easy backup
If you save you files in MongoDB then backing them up is not at all a problem. Create a replicaset and you have the backup as well as failover ready. You also don’t know to backup database and files folder separately. You just have to take backup of MongoDB and you have all you need.

Store large number of files
You can save huge number of files in MongoDB. Unlike file system, MongoDB has no limit on number of document it can handle. It solely depends on the resource you provide. You can also take advantage of sharding to distribute the load.
Random access a file

MongoDB splits the file into multiple chunks. This is how the GridFS work. It divides the file into chunks of 255K. Every chunk is a document and it behaves just like a normal MongoDB document. This gives you some flexibility like, you can ask for specific chunk from file or you can skip chunks. This can be useful in video streaming services that allow skipping.

Okay! Good enough theory. This book is not about theory and I believe you also are more interested in practical code. What we are going to do now is, we’ll see some simple examples using mongofiles tool. Afterwards, we’ll jump over to Python and see some examples in python. So let’s jump right in.

Command line tool – mongofiles
When you install MongoDB on your system you get a few executable binary files like, mongo, mongod, mongodump, mongorestore etc. One of them is mongofiles. This tool is used to browse and modify GridFS files. It’s a very simple tool and we’ll use it to whet our appetite. Let’s see what we do with mongofiles.

First, have a look at the options available with mongofiles. Open the Terminal on Unix/Linux or command prompt on windows and run the following command.

 

The above command will show you a lot of options. I would suggest you go through all of them, at least once. Some of the options are common in all MongoDB tools. For instance ‘-d’ option for database, ‘-c’ option for collection, ‘-h’ for hostname etc.

The options that specific to mongofiles are –

list – list all files
put – add a file with filename ‘gridfs filename’
get – get a file with filename ‘gridfs filename’
delete – delete all files with filename ‘gridfs filename’

Store a file
Let’s store our first file in MongoDB using mongofiles tool. I have a picture on my system that I am going to store in MongoDB. Because mongofiles is there so I don’t need to worry about anything, I will just run the following command. One command and all done.

$ mongofiles -d 'mydb' put the-great-heads.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('54dbb22ca187e981e194a97c'), filename: "the-great-heads.jpg", chunkSize: 261120, uploadDate: new Date(1423684140782), md5: "f6b1abeb03c257e5ef1cf75e304fd5b9", length: 5846083 }
done!

You can see I passed my database name using ‘-d’ option. Then I use ‘put’ command, which is for storing the file. And after ‘put’ I mentioned the name of the file, which is the name of the picture I have on my system. If the file is not in your current location then you have to provide full path to the file.

The put operation was successful, as it did not throw any error. You can see it return a JSON object containing information about the file. It returned filename, chunkSize, uploadDate and length of the file stored.

List files
Now, let’s use ‘list’ command with mongofiles to list the file in GridFS.

$ mongofiles -d 'mydb' list
connected to: 127.0.0.1
the-great-heads.jpg 5846083

The output is very clear. I don’t think I need to explain this. You can see it is showing only one file because we stored only one file. The numbers shown against the file name is the size of the file in bytes.
Delete file

Let’s see how to delete a file. We can use ‘delete’ command with mongofiles to delete a GridFS file. Here is an example –

$ mongofiles -d 'mydb' delete the-great-heads.jpg
connected to: 127.0.0.1
done!

Congratulations! We have just deleted the file we stored a minute ago.

GridFS collections
Before we go ahead any further, let’s see chunks and files collections quickly so that you have fair amount of idea regarding the structure of documents inside them.

Oops! We don’t have data in this collection. You may remember that we deleted the only file we stored. No problem. Let’s store the file again and then we’ll have a look at the GridFS collections from mongo shell.

$ mongofiles -d 'mydb' put the-great-heads.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('54dbb9ca65b24efcec30a56c'), filename: "the-great-heads.jpg", chunkSize: 261120, uploadDate: new Date(1423686090205), md5: "f6b1abeb03c257e5ef1cf75e304fd5b9", length: 5846083 }
done!

Okay! The above command ran well and the file is stored again in ‘mydb’ database. Now, let’s go to mongo shell and see what’s inside.

$ mongo localhost/mydb
MongoDB shell version: 2.6.5
connecting to: localhost/mydb

The above command may look familiar to you. You have seen this a few time in previous chapters. In case you forget what this command does, mongo is MongoDB client to connect to the database and perform operations. By the above command we are connected to ‘mydb’ database and can perform operations on the database.

Files collection
Files collection contains metadata of files. It doesn’t hold data itself. The real file is stored in chunks collection. Let’s check the files collection first and see what information is stored there.

> db.fs.files.find().pretty()
{
"_id" : ObjectId("54dbb9ca65b24efcec30a56c"),
"filename" : "the-great-heads.jpg",
"chunkSize" : 261120,
"uploadDate" : ISODate("2015-02-11T20:21:30.205Z"),
"md5" : "f6b1abeb03c257e5ef1cf75e304fd5b9",
"length" : 5846083
}

You can see I used find() method on fs.files collections. As I told you, GridFS will save file metadata and chunks as documents so you can perform all the typical MongoDB operations on them as well. They are regular MongoDB documents. Our find() method returned only single document because we have only one file right now. Let me explain the fields in brief. Don’t worry I will do it quickly; I don’t want you to loose interest.

“_id” – This is an object id automatically created by MongoDB
“filename” – The name of the file. It’s better you give different names for files.
“chunkSize” – This is the size of chunk. Default size is 255k.
“uploadDate” – The date and time of file upload.
“md5” – It’s a hash that can be used to check file integrity.
“length” – Length of file.

Chunks collection
The chunks collection is the one that hold the file data. As you know by now that GridFS divides the file into multiple chunks and stores them in the form of documents in chunks collection. We can use the same technique that we used for files collection to query chunks collection. Because chunks collection holds the real data in binary form, the documents are big as compare to the documents in the files collection. So let’s get only one document from chunks collection.

> db.fs.chunks.findOne()
{
"_id" : ObjectId("54dbb9cad26fa2234f917cf2"),
"files_id" : ObjectId("54dbb9ca65b24efcec30a56c"),
"n" : 7,
"data" : BinData(0,"SRbXIAVpU5LybByaEpP2jkDBQL5tVFad8sG4tlEbtn4voxBoJkuSrqamlMgQLYRjxhqMrxoMMo77tcQANm2YeGRAbNlocrtkuG0QyAGnEFtx0xEXK8Pi3cEqKHp44JHuajuKaeNEAyuEjJxBGMXbDpuckd+bkWs+I/a2w8IZAXza9bn8IHTDwVuw47NOEgUHIyjZZDyWK1aqOuTltuuw2K4o1PfK+PuaslgbLmFV98iCyvbzWKKVBO+E1zZYx3qLswIGTABLi5OK9nUNeWQNcnKjGljDvkweiSKK3qMiSAzWEEGpyQlbW1yqMa3YA9G1Yg+2CXJsiaOy0vy65DhoNpJkN1rhWG22RMuFpmBVhSA4ioy27SIAcljMo3O2W8JLLgtzctjlNbuNOJB2aDFDU98mRY2bxspEVJIyyJ4RRaQpMCehy6O3Nopa4K9OmTjRKOR2UyhJr2y3xNqQbJtotTbEDvYyLY4ndsBvo2XXN1Nqr0xA3Ty3C48Snvgo22SyCQU6ccvq2cY0GnVh075VxBxcgMeSkVZd8sEwWoEhsj8cjbkyPVpQBtgkSd2iIstqDXcYz5ORRHN3HhuemAS4hTELCeTbCgy47KQ5qVBPTGzWzDIQvYDqOmYkSTsxq1hbw65lAUspU0FLGnbJ2Bu1DmqGMU265Ucu+7kUC2PhpXrkSbOzcA1tXbr4ZI7L1W7DqKZK7WVdWwRgprsBfUUFcr36MwRzWVHbLSO9qIdWmQZANN8OCO7XI00K48ymItaTXY5cG0u/ZyNb2kRoNEEAeGAizuxMSXFV6nrgBrZyCAA12rhaDytwoR88r5NLTqB8ssxlfNykCnhgkCyA3stsPV3GwGAHhYEcRa4kD4euPECd2UTTuO253yXF3NpbYhRQYBZLKVtMQw3O+Mdjs1nzU164QGBiFtFII6nJAsYgFoq69iB45PiBZUQKIXRlQhJFSMhKFlpEdrWgSliafCcka5NY47PcuUlELEeOAmzs5OM8MbIcq+p8QO2MzTUayb8lSND1HbKTLvbMcSN24mCsw8RkSLasU/UfNfHvkZUHLgqlSMptulGnV4jfp44SLOzXIfJwRftHfCZEimrgB3K3inVsIkVHDW7liLbg7Y8YG7OML5KZQqeJ3XLRKxYQYmLljQtTKzOXNsjAE7rmi4nbpgjIkbspxALQiNQxBzIjPo0XSLhtFlpyFMx")
}

I used findOne() method to get the first document that MongoDB finds in the collection. As you know, documents in chunks collection are 255k, I had to remove some of the lines from data field otherwise the document would have taken 4-5 pages. Let’s see what are these fields.

“_id” – This is an object id automatically created by MongoDB
“files_id” – This is the value of _id field of the file object in the files collection. All chunks of the same file will have same files_id.
“n” – It’s number given to chunks of the files. The initial value of n is 0.
“data” – This is the binary data of the chunk.

GridFS with Pymongo
Hope you are enjoying this interesting topic. But you must be wondering why we have not used Pymongo to interact with GridFS. I thought of explaining GridFS by using mongofiles tool for two reasons. First, it is much more simpler and so very effective to give you a high level overview. Second, my intent was to introduce you with mongofiles tool. I am sure you would, otherwise, even have not noticed that there is a tool called ‘mongofiles’ for handling GridFS files.

Okay! Your wait is over. Let’s have some more fun with GridFS using Pymongo. Now what we are going to do is, we’ll use Python’s shell, also called REPL, for our Python code in one command line shell. Also, we will keep mongo shell open so that we can have a look at the collections quickly. Got it? Great! Let’s get started.

Open mongo shell like this –

$ mongo
MongoDB shell version: 2.6.5
connecting to: test
>

Now, open Python REPL –

$ python
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

For using GridFS with Pymongo, we need to import some classes. Let’s import them first –
Import MongoClient so that we can connect to a running mongod instance.

>>> from pymongo import MongoClient

Import GridFS from gridfs module so that we may put,get and delete files in MongoDB

>>> from gridfs import GridFS

We also need to import objectid from bson because we are going to need it.

>>> from bson import objectid

All of the above should run without any error if you are following along from the start. These are simple import statements and the only thing that can go wrong is, module not installed. In that case, you can use ‘pip’ to install Python module.

Storing strings
This time I’ll start with saving a simple string in MongoDB using GridFS. It’s pretty simple. Let me show you how.

1. Connect to mygrid database like this –

>>> db = MongoClient().mygrid

2. Get the GridFS object

>>> fs = GridFS(db)

3. Store “hello world” in MongoDB using GridFS. The put method will return the ObjectId of the stored file.

>>> ob = fs.put("hello world")

4. Now, let’s retrieve back the ‘hello world’ we just stored. The get method is use to get the file. We’ll pass the ObjectId returned by put() method. And, because get() method returns a file-like-object, we’ll use read() method to read the content of the object. Like this –

>>> fs.get(ob).read()
'hello world'

Wow! Interesting, right? Though storing strings in GridFS is not something you would want. If you are thinking of using GridFS then surely it’s not because you have problem storing strings otherwise. But I showed this just to give you an idea.

Namespace
In the mongo shell, if you go and check the collections in ‘mygrid’ database, you will find fs.files and fs.chunks collection. Nothing new, right? Do you remember I mentioned earlier that ‘fs’ is the default namespace for GridFS collections? You can change it to whatever you like. For instance, if I were using GridFS to store the invoices then I would want to have something like ‘invoice.files’ and ‘invoice.chunks’. Don’t you agree this makes more sense? Let me show you how to do this.

In this example, we’ll store an invoice that is in PDF form. Also, we’ll use different namespace for our GridFS collections.

1. Connect to mygrid database like this –

>>> db = MongoClient().mygrid

2. Get the GridFS object. First argument is database object and second argument is the namespace.

>>> fs = GridFS(db, "invoice")

3. Store the invoice located at /tmp/first-invoice.pdf

>>> with open('/tmp/first-invoice.pdf') as f:
... invoice = fs.put(f, content_type='application/pdf', filename='first-invoice.pdf')
...

Now if you to mongo shell and check the collections in ‘mygrid’ database, you will find ‘invoice.files’ and ‘invoice.chunks’ collections.

Reading files
We have seen how to store a file in MongoDB using GridFS. We have also used get() method in short for getting the content of the file back in previous examples. But I think reading file needs a little more attention. Let’s take a look the following example –

>>> db = MongoClient().mygrid
>>> fs = GridFS(db)
>>> ob = fs.put("hello world")
>>> fs.get(ob).read()
'hello world'

Yes, you got it right. You have seen this example earlier. I used it a little while ago. This time I just removed the explanation for each line. I believe you know what each line of code does by now. In this example, we stored a file with ‘hello world’ as content. The put() method returned the file_id. We captured the file_id in ob variable because we need this to identify the file in case we want to get the file back. Then, when we needed to read the file we just pass the ob variable to the get() method. Great! But what about the read() method? Actually, get() method returns a file-like-object and to see the content we can call read() method on this file-like-object. Wonderful! We have our highly impressive file content before us.

I know you have a question in your mind. You are wondering what if we don’t know the file_id. Yes, in most situations you would not know the file_id. Most probably you may know the name of the file. Remember that GridFS collections are just like regular MongoDB collections. What I would do in situation is run a find() query on the fs.files collection and get the ‘_id’. This ‘_id’ is same as the file_id in fs.chunks collection. Now when I the file_id, I can go ahead and pass it to the get() method. Let’s see an example.

In this simple example we’ll store string as file in GridFS. There is nothing wrong with storing real file but I am storing string as file wholly for the sake of simplicity.

1. Connect to mygrid database like this –

>>> db = MongoClient().mygrid

2. Get the GridFS object. First argument is database object and second argument is the namespace.

>>> fs = GridFS(db, "stringfiles")

3. Store a new string with filename “mystory.txt”.

>>> fs.put("Here is the content of this file.", filename="mystory.txt")

Notice we didn’t catch the file_id returned by put() method. Let’s find that in ‘stringfiles.files’ collection.

>>> f_id = db.stringfiles.files.find_one({ "filename" : "mystory.txt" },{ "_id" : 1 })

By now, you should be able to easily understand what the above statement is doing here. Let me quickly explain in case you don’t remember. Here I am asking MongoDB to find a document in stringfile.files collection that has filename field set to mystory.txt. Also, I used project to return only the ‘_id’ field. So, the result would be something like this – {u’_id’: ObjectId(’54e4463bd6e6fe5c5c2b7171′)}

Got it? Great! Now that we have the file_id, we can go ahead and read the file easily. Like this –

>>> fs.get(f_id['_id']).read()
'Here is the content of this file.'

Easy, right? I told you so. But this approach has a problem. The problem lies in the fact that file name doesn’t have to be unique. This means that if you insert a file multiple times then you will have multiple entries with same file name in fs.files collection. Let’s try it ourselves.

>>> fs = GridFS(db)
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bcdd6e6fe5c5c2b7173')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bced6e6fe5c5c2b7175')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bcfd6e6fe5c5c2b7177')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd0d6e6fe5c5c2b7179')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd1d6e6fe5c5c2b717b')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd2d6e6fe5c5c2b717d')
>>>

We just stored the same file with same content 6 times and every time it returned a file_id. If I go to my mongo shell then I can see similar results there also. See below the output of find() on mongo shell.

> db.fs.files.find()
{ "_id" : ObjectId("54e44bcdd6e6fe5c5c2b7173"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:37.029Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bced6e6fe5c5c2b7175"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:38.958Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bcfd6e6fe5c5c2b7177"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:39.878Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd0d6e6fe5c5c2b7179"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:40.789Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd1d6e6fe5c5c2b717b"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:41.684Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd2d6e6fe5c5c2b717d"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:42.581Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
>

The only fields different are ‘_id’ and ‘uploadDate’. Rests of the things are exactly same. Why, you may ask, MongoDB is doing so? Probably this is a wonderful way of versioning files. If you need to read the file, just get the latest one. If you don’t like the idea then you can give your file a unique name.

What should we do now? Don’t worry. We have an easy way to get the most recent file. Actually, we can get all the metadata for the most recent file. Here is how you do it. The good part is, you don’t need file_id to read the content of file. We just need to know the file name and we are good to go. See here –

>>> f = fs.get_last_version(filename="mystory.txt")

Read the content of the file.

>>> f.read()
'Here is the content of this file.'

Get the ‘_id’ of the file.

>>> f._id
ObjectId('54e4500fd6e6fe5c5c2b717f')

Get the upload date of the file.

>>> f.uploadDate
datetime.datetime(2015, 2, 18, 8, 40, 47, 629000)

Get the name of the file.

>>> f.filename
u'mystory.txt'

Interesting, right? I am enjoying this topic and I hope you are enjoying it.

Delete files
Deleting files is extremely simple. The delete() method deletes all data relating to the file_id passed as argument to the method. Let’s see a quick example.

>>> db = MongoClient().mygrid
>>> fs = GridFS(db)
>>> fileid = fs.put("This is the content of the file.", filename="newfile.txt")
>>> fs.get(fileid).read()
'This is the content of the file.'
>>> fs.delete(fileid)

In case you don’t know the file_id but you know the name of the file then you can delete the most recently uploaded file like this –

>>> fileid = fs.get_last_version(filename="newfile.txt")._id
>>> fs.delete(fileid)

I would highly recommend going through Python’s GridFS documentation, which can be found here – http://api.mongodb.org/python/current/api/gridfs/ and GridFS section of MongoDB documentation here – http://docs.mongodb.org/manual/core/gridfs/.