MongoDB GridFS using Python
Data is collected and stored in different forms depending on the nature of data. It can be in the form of text, images, videos, audio files, and files in other formats. MongoDB can be used for storing all kinds of data, but so far, we have used it for storing plain text information in MongoDB documents. As you know by now, that MongoDB document has a size limit of 16MB. Though 16MB is good enough in most cases, but looks tiny if you think of storing high-resolution images, PDF files, music, videos, etc.
In this chapter, we are going to see how we can store more information in MongoDB. Till now, we played around with storing text in JSON format, but in this post, we’ll see how to store images, PDFs, audio, video, etc. I am sure you are going to enjoy this chapter. So, let’s get started!
Introduction
MongoDB provides a way to store arbitrary large files through GridFS. Till now, we have inserted documents in MongoDB that have information in the form of key=value
pairs. If you want to store files in MongoDB, then MongoDB provides this functionality as well. Let me introduce you to GridFS.
GridFS is a protocol for storing and retrieving arbitrary large files. What GridFS does is, it divides one large file into multiple small chunks and stores each chunk as a separate document in MongoDB. The default size of the GridFS chunk is 255k. Yes, you guessed it right. Once the chunk is stored as a document in MongoDB, all those operations that can be performed on regular MongoDB documents can be performed on these documents as well.
GridFS uses two collections to save files. Both of them are under the default namespace of fs
. Here are the two collections:
files
– This collection has information about the file, like – file name, chunk size, upload date, etc.chunks
– This collection holds the chunks, as the name suggests —documents in this collection host the binary data of the file.
Don’t worry if you find this a little complex. Things will become much more clear once we start playing around with examples.
Let’s see what some of the benefits of storing files in MongoDB are.
All data in one place
By storing files in MongoDB, you have all your data in one place. This reduces maintenance overhead. Traditionally, the files are stored in a folder somewhere in the file system, and the path to the file is saved in the database. This requires you to take care of files in the folder in addition to the database.
Easy backup
If you save your files in MongoDB, then backing them up is not at all a problem. Create a replica set, and you have the backup as well as failover ready. You also don’t know to backup the database and files folder separately. You just have to take a backup of MongoDB, and you have all you need.
Store a large number of files
You can save a huge number of files in MongoDB. Unlike the file system, MongoDB has no limit on the number of documents it can handle. It solely depends on the resource you provide. You can also take advantage of sharding to distribute the load. Random access a file
MongoDB splits the file into multiple chunks. This is how the GridFS work. It divides the file into chunks of 255K. Every chunk is a document, and it behaves just like a normal MongoDB document. This gives you some flexibility as you can ask for a specific chunk from a file, or you can skip chunks. This can be useful in video streaming services that allow skipping.
Okay! Good enough theory. This book is not about theory, and I believe you also are more interested in practical code. What we are going to do now is, we’ll see some simple examples using the mongofiles
tool. Afterward, we’ll jump over to Python and see some examples in python. So let’s jump right in.
Command line tool – mongofiles
When you install MongoDB on your system, you get a few executable binary files like mongo, mongod, mongodump, mongorestore, etc. One of them is mongofiles. This tool is used to browse and modify GridFS files. It’s a very simple tool, and we’ll use it to whet our appetite. Let’s see what we do with mongofiles.
First, have a look at the options available with mongofiles. Open the Terminal on Unix/Linux or command prompt on windows and run the following command.
$ mongofiles
The above command will show you a lot of options. I would suggest you go through all of them, at least once. Some of the options are common in all MongoDB tools. For instance ‘-d’ option for database, ‘-c’ option for collection, ‘-h’ for hostname etc.
The options that specific to mongofiles are –
list
– list all files
put
– add a file with filename
get
– get a file with filename
delete
– delete all files with filename
Store a file
Let’s store our first file in MongoDB using the mongofiles tool. I have a picture of my system that I am going to store in MongoDB. Because mongofiles is there, so I don’t need to worry about anything, I will just run the following command. One command and all done.
$ mongofiles -d 'mydb' put the-great-heads.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('54dbb22ca187e981e194a97c'), filename: "the-great-heads.jpg", chunkSize: 261120, uploadDate: new Date(1423684140782), md5: "f6b1abeb03c257e5ef1cf75e304fd5b9", length: 5846083 }
done!
You can see I passed my database name using -d
option. Then I use put
command, which is for storing the file. And after put
I mentioned the name of the file, which is the name of the picture I have on my system. If the file is not in your current location, then you have to provide a full path to the file.
The put operation was successful, as it did not throw any error. You can see it return a JSON object containing information about the file. It returned filename, chunkSize, uploadDate, and length of the file stored.
List files
Now, let’s use the list
command with mongofiles to list the file in GridFS.
$ mongofiles -d 'mydb' list
connected to: 127.0.0.1
the-great-heads.jpg 5846083
The output is very clear. I don’t think I need to explain this. You can see it is showing only one file because we stored only one file. The numbers shown against the file name is the size of the file in bytes.
Delete file
Let’s see how to delete a file. We can use the delete
command with mongofiles to delete a GridFS file. Here is an example –
$ mongofiles -d 'mydb' delete the-great-heads.jpg
connected to: 127.0.0.1
done!
Congratulations! We have just deleted the file we stored a minute ago.
GridFS collections
Before we go ahead any further, let’s see chunks
and files
collections quickly so that you have a fair amount of ideas regarding the structure of documents inside them.
Oops! We don’t have data in this collection. You may remember that we deleted the only file we stored. No problem. Let’s store the file again, and then we’ll have a look at the GridFS collections from the mongo
shell.
$ mongofiles -d 'mydb' put the-great-heads.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('54dbb9ca65b24efcec30a56c'), filename: "the-great-heads.jpg", chunkSize: 261120, uploadDate: new Date(1423686090205), md5: "f6b1abeb03c257e5ef1cf75e304fd5b9", length: 5846083 }
done!
Okay! The above command ran well, and the file is stored again in the mydb
database. Now, let’s go to the mongo
shell and see what’s inside.
$ mongo localhost/mydb
MongoDB shell version: 2.6.5
connecting to: localhost/mydb
The above command may look familiar to you. You have seen this a few times in previous chapters. In case you forget what this command does, mongo is MongoDB client to connect to the database and perform operations. By the above command, we are connected to the mydb
database and can perform operations on the database.
Files collection
Files collection contains metadata of files. It doesn’t hold data itself. The real file is stored in chunks collection. Let’s check the files collection first and see what information is stored there.
> db.fs.files.find().pretty()
{
"_id" : ObjectId("54dbb9ca65b24efcec30a56c"),
"filename" : "the-great-heads.jpg",
"chunkSize" : 261120,
"uploadDate" : ISODate("2015-02-11T20:21:30.205Z"),
"md5" : "f6b1abeb03c257e5ef1cf75e304fd5b9",
"length" : 5846083
}
You can see I used find()
method on fs.files
collections. As I told you, GridFS will save file metadata and chunks as documents so you can perform all the typical MongoDB operations on them as well. They are regular MongoDB documents. Our find()
method returned only a single document because we have only one file right now. Let me explain the fields in brief. Don’t worry; I will do it quickly; I don’t want you to lose interest.
_id
– This is an object id automatically created by MongoDB.
filename
– The name of the file. It’s better you give different names for files.
chunkSize
– This is the size of the chunk. The default size is 255k.
uploadDate
– The date and time of file upload.
md5
– It’s a hash that can be used to check file integrity.
length
– Length of file.
Chunks collection
The chunks collection is the one that holds the file data as you know by now that GridFS divides the file into multiple chunks and stores them in the form of documents in chunks collection. We can use the same technique that we used for file collection to query chunks collection. Because chunks collection holds the real data in binary form, the documents are big as compare to the documents in the files collection. So let’s get only one document from chunks collection.
> db.fs.chunks.findOne()
{
"_id" : ObjectId("54dbb9cad26fa2234f917cf2"),
"files_id" : ObjectId("54dbb9ca65b24efcec30a56c"),
"n" : 7,
"data" : BinData(0,"SRbXIAVpU5LybByaEpP2jkDBQL5tVFad8sG4tlEbtn4voxBoJkuSrqamlMgQLYRjxhqMrxoMMo77tcQANm2YeGRAbNlocrtkuG0QyAGnEFtx0xEXK8Pi3cEqKHp44JHuajuKaeNEAyuEjJxBGMXbDpuckd+bkWs+I/a2w8IZAXza9bn8IHTDwVuw47NOEgUHIyjZZDyWK1aqOuTltuuw2K4o1PfK+PuaslgbLmFV98iCyvbzWKKVBO+E1zZYx3qLswIGTABLi5OK9nUNeWQNcnKjGljDvkweiSKK3qMiSAzWEEGpyQlbW1yqMa3YA9G1Yg+2CXJsiaOy0vy65DhoNpJkN1rhWG22RMuFpmBVhSA4ioy27SIAcljMo3O2W8JLLgtzctjlNbuNOJB2aDFDU98mRY2bxspEVJIyyJ4RRaQpMCehy6O3Nopa4K9OmTjRKOR2UyhJr2y3xNqQbJtotTbEDvYyLY4ndsBvo2XXN1Nqr0xA3Ty3C48Snvgo22SyCQU6ccvq2cY0GnVh075VxBxcgMeSkVZd8sEwWoEhsj8cjbkyPVpQBtgkSd2iIstqDXcYz5ORRHN3HhuemAS4hTELCeTbCgy47KQ5qVBPTGzWzDIQvYDqOmYkSTsxq1hbw65lAUspU0FLGnbJ2Bu1DmqGMU265Ucu+7kUC2PhpXrkSbOzcA1tXbr4ZI7L1W7DqKZK7WVdWwRgprsBfUUFcr36MwRzWVHbLSO9qIdWmQZANN8OCO7XI00K48ymItaTXY5cG0u/ZyNb2kRoNEEAeGAizuxMSXFV6nrgBrZyCAA12rhaDytwoR88r5NLTqB8ssxlfNykCnhgkCyA3stsPV3GwGAHhYEcRa4kD4euPECd2UTTuO253yXF3NpbYhRQYBZLKVtMQw3O+Mdjs1nzU164QGBiFtFII6nJAsYgFoq69iB45PiBZUQKIXRlQhJFSMhKFlpEdrWgSliafCcka5NY47PcuUlELEeOAmzs5OM8MbIcq+p8QO2MzTUayb8lSND1HbKTLvbMcSN24mCsw8RkSLasU/UfNfHvkZUHLgqlSMptulGnV4jfp44SLOzXIfJwRftHfCZEimrgB3K3inVsIkVHDW7liLbg7Y8YG7OML5KZQqeJ3XLRKxYQYmLljQtTKzOXNsjAE7rmi4nbpgjIkbspxALQiNQxBzIjPo0XSLhtFlpyFMx")
}
I used findOne()
method to get the first document that MongoDB finds in the collection. As you know, documents in chunks collection are 255k; I had to remove some of the lines from the data field; otherwise, the document would have taken 4-5 pages. Let’s see what these fields are.
_id
– This is an object id automatically created by MongoDB.
files_id
– This is the value of the _id
field of the file object in the file collection. All chunks of the same file will have the same files_id.
n
– It’s number given to chunks of the files. The initial value of n is 0.
data
– This is the binary data of the chunk.
GridFS with Pymongo
I hope you are enjoying this interesting topic. But you must be wondering why we have not used Pymongo to interact with GridFS. I thought of explaining GridFS by using the mongofiles
tool for two reasons. First, it is much simpler and so very effective to give you a high-level overview. Second, my intent was to introduce you to the mongofiles tool. I am sure you would, otherwise, even have not noticed that there is a tool called ‘mongofiles’ for handling GridFS files.
Okay! Your wait is over. Let’s have some more fun with GridFS using Pymongo. Now what we are going to do is, we’ll use Python’s shell, also called REPL, for our Python code in one command-line shell. Also, we will keep the mongo
shell open so that we can have a look at the collections quickly. Got it? Great! Let’s get started.
Open mongo shell like this –
$ mongo
MongoDB shell version: 2.6.5
connecting to: test
>
Now, open Python REPL –
$ python
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
For using GridFS with Pymongo, we need to import some classes. Let’s import them first – Import MongoClient so that we can connect to a running mongod instance.
>>> from pymongo import MongoClient
Import GridFS from gridfs module so that we may put, get and delete files in MongoDB
>>> from gridfs import GridFS
We also need to import objectid
from the bson
package because we are going to need it.
>>> from bson import objectid
All of the above should run without any error if you are following along from the start. These are simple import statements, and the only thing that can go wrong is a module not installed. In that case, you can use pip
to install the Python module.
Storing strings
This time I’ll start with saving a simple string in MongoDB using GridFS. It’s pretty simple. Let me show you how.
- Connect to
mygrid
database like this –
>>> db = MongoClient().mygrid
- Get the GridFS object
>>> fs = GridFS(db)
- Store “hello world” in MongoDB using GridFS. The put method will return the ObjectId of the stored file.
>>> ob = fs.put("hello world")
- Now, let’s retrieve back the ‘hello world’ we just stored. The get method is used to get the file. We’ll pass the ObjectId returned by put() method. And, because get() method returns a file-like-object, we’ll use read() method to read the content of the object. Like this –
>>> fs.get(ob).read()
'hello world'
Wow! Interesting, right? Though storing strings in GridFS is not something you would want. If you are thinking of using GridFS, then surely it’s not because you have a problem storing strings otherwise. But I showed this just to give you an idea.
Namespace
In the mongo shell, if you go and check the collections in the mygrid
database, you will find fs.files and fs.chunks collection. Nothing new, right? Do you remember I mentioned earlier that ‘fs’ is the default namespace for GridFS collections? You can change it to whatever you like. For instance, if I were using GridFS to store the invoices, then I would want to have something like ‘invoice.files’ and ‘invoice.chunks’. Don’t you agree this makes more sense? Let me show you how to do this.
In this example, we’ll store an invoice that is in PDF form. Also, we’ll use a different namespace for our GridFS collections.
- Connect to
mygrid
database like this –
>>> db = MongoClient().mygrid
- Get the GridFS object. The first argument is the database object, and the second argument is the namespace.
>>> fs = GridFS(db, "invoice")
- Store the invoice located at /tmp/first-invoice.pdf
>>> with open('/tmp/first-invoice.pdf') as f:
... invoice = fs.put(f, content_type='application/pdf', filename='first-invoice.pdf')
...
Now, if you to mongo shell and check the collections in the ‘mygrid’ database, you will find ‘invoice.files’ and ‘invoice.chunks’ collections.
Reading files
We have seen how to store a file in MongoDB using GridFS. We have also used the get()
method in short for getting the content of the file back in previous examples. But I think reading the file needs a little more attention. Let’s take a look at the following example –
>>> db = MongoClient().mygrid
>>> fs = GridFS(db)
>>> ob = fs.put("hello world")
>>> fs.get(ob).read()
'hello world'
Yes, you got it right. You have seen this example earlier. I used it a little while ago. This time I just removed the explanation for each line. I believe you know what each line of code does by now. In this example, we stored a file with ‘hello world’ as content. The put()
method returned the file_id
. We captured the file_id
in the ob
variable because we need this to identify the file just in case we want to get the file back. Then, when we needed to read the file, we just pass the ob variable to the get()
method. Great! But what about the read()
method? Actually, get()
method returns a file-like-object, and to see the content, we can call read()
method on this file-like-object. Wonderful! We have our highly impressive file content before us.
I know you have a question in your mind. You are wondering what if we don’t know the file_id. Yes, in most situations, you would not know the file_id. Most probably, you may know the name of the file. Remember that GridFS collections are just like regular MongoDB collections. What I would do in a situation is run a find()
query on the fs.files collection and get the _id
. This _id
is the same as the file_id in fs.chunks collection. Now when I the file_id
, I can go ahead and pass it to the get()
method. Let’s see an example.
In this simple example, we’ll store a string as a file in GridFS. There is nothing wrong with storing a real file, but I am storing string as file wholly for the sake of simplicity.
- Connect to mygrid database like this –
>>> db = MongoClient().mygrid
- Get the GridFS object. The first argument is the database object, and the second argument is the namespace.
>>> fs = GridFS(db, "stringfiles")
- Store a new string with filename “mystory.txt”.
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
Notice we didn’t catch the file_id returned by the put()
method. Let’s find that in the stringfiles.files
collection.
>>> f_id = db.stringfiles.files.find_one({ "filename" : "mystory.txt" },{ "_id" : 1 })
By now, you should be able to easily understand what the above statement is doing here. Let me quickly explain, just in case you don’t remember. Here I am asking MongoDB to find a document in a collection named stringfile.files
that has a filename
field set to mystory.txt
. Also, I used the projection
to return only the _id
field. So, the result would be something like this – {u'_id': ObjectId('54e4463bd6e6fe5c5c2b7171')}
Got it? Great! Now that we have the file_id, we can go ahead and read the file easily. Like this –
>>> fs.get(f_id['_id']).read()
'Here is the content of this file.'
Easy, right? I told you so. But this approach has a problem. The problem lies in the fact that the file name doesn’t have to be unique. This means that if you insert a file multiple times, then you will have multiple entries with the same file name in fs.files collection. Let’s try it ourselves.
>>> fs = GridFS(db)
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bcdd6e6fe5c5c2b7173')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bced6e6fe5c5c2b7175')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bcfd6e6fe5c5c2b7177')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd0d6e6fe5c5c2b7179')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd1d6e6fe5c5c2b717b')
>>> fs.put("Here is the content of this file.", filename="mystory.txt")
ObjectId('54e44bd2d6e6fe5c5c2b717d')
>>>
We just stored the same file with the same content six times, and every time it returned a file_id. If I go to my mongo
shell, then I can see similar results there also. See below the output of find() on mongo shell.
> db.fs.files.find()
{ "_id" : ObjectId("54e44bcdd6e6fe5c5c2b7173"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:37.029Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bced6e6fe5c5c2b7175"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:38.958Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bcfd6e6fe5c5c2b7177"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:39.878Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd0d6e6fe5c5c2b7179"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:40.789Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd1d6e6fe5c5c2b717b"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:41.684Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
{ "_id" : ObjectId("54e44bd2d6e6fe5c5c2b717d"), "chunkSize" : 261120, "filename" : "mystory.txt", "length" : 33, "uploadDate" : ISODate("2015-02-18T08:22:42.581Z"), "md5" : "9f1a03ba9a31737a554ea79470f3a621" }
>
The only fields different are _id
and uploadDate
. Rests of the things are exactly the same. Why, you may ask, MongoDB is doing so? Probably this is a wonderful way of versioning files. If you need to read the file, just get the latest one. If you don’t like the idea, then you can give your file a unique name.
What should we do now? Don’t worry. We have an easy way to get the most recent file. Actually, we can get all the metadata for the most recent file. Here is how you do it. The good part is, you don’t need file_id to read the content of the file. We just need to know the file name, and we are good to go. See here –
>>> f = fs.get_last_version(filename="mystory.txt")
Read the content of the file.
>>> f.read()
'Here is the content of this file.'
Get the ‘_id’ of the file.
>>> f._id
ObjectId('54e4500fd6e6fe5c5c2b717f')
Get the upload date of the file.
>>> f.uploadDate
datetime.datetime(2015, 2, 18, 8, 40, 47, 629000)
Get the name of the file.
>>> f.filename
u'mystory.txt'
Interesting, right? I am enjoying this topic, and I hope you are enjoying it.
Delete files
Deleting files is extremely simple. The delete()
method removes all data relating to the file_id passed as an argument to the method. Let’s see a quick example.
>>> db = MongoClient().mygrid
>>> fs = GridFS(db)
>>> fileid = fs.put("This is the content of the file.", filename="newfile.txt")
>>> fs.get(fileid).read()
'This is the content of the file.'
>>> fs.delete(fileid)
In case you don’t know the file_id, but you know the name of the file then you can delete the most recently uploaded file like this –
>>> fileid = fs.get_last_version(filename="newfile.txt")._id
>>> fs.delete(fileid)
I would highly recommend going through Python’s GridFS documentation.