Python and MongoDB

This primarily refers to usage of the pymongo package, however, the concepts apply to using MongoDB from Python in general. Interacting with MongoDB from Python has a couple of easy traps to fall into, and this looks at one in particular and how to avoid the pitfall via proper usage of indexes and careful construction of queries.

When querying MongoDB in general, it will attempt to use indexes based on the components of your query, in the order you specify them. That is, if you construct a query that looks as follows:

docs = db.collection.find({"stuff": 1, "things": 0})

If you use that as the filter for a .find() call, MongoDB will look for a compound index where the fields are specified in the order (stuff, things). If a compound index exists with both fields but the ordering is instead (things, stuff), then the index may not be used! I have experienced a number of situations where the query planner will select the correct index despite providing the query fields in a different order, however, that behavior should not be relied upon.

NOTE 1: make sure that you create compound indexes on your MongoDB collections with the fields specified in the order which you want to query them in and that you construct query filters using the same ordering as your indexes!

Great, so we understand that ordering is important on the MongoDB side of the equation, but there’s just one little problem. Almost all examples of using a library like pymongo to interact with MongoDB from Python use plain old dict objects to construct query filters; the thing is, Python dict objects are inherently an unordered data structure and thus provide no guarantee of maintaining any specific ordering. While that is not entirely true for Python 3.6+ (but that is a whole different topic) I would still argue that using the approach explained below conforms better to best practice and more explicitly communicates the intent of what the code is doing.

If the filter only contains a single field, then there won’t be any issues because there aren’t multiple fields. However, as soon as you introduce multiple fields into a filter using a normal dict, the order in which they might be presented to MongoDB becomes non-deterministic. This means that our example query might actually be serialized at runtime to:

{"things": 0, "stuff": 1}

Which could prevent MongoDB from utilizing a compound index on (stuff, things). That might not be an issue on very small collections, but try doing that on hundreds of thousands or millions of documents and the queries will start taking orders of magnitude longer.

There are certain edge-cases with some versions of Python where depending on whether hash randomization is enabled or not the ordering of dict``s may be preserved (or at least deterministic), however, the point here is that things like this **should not** ever be relied upon for ordering since the ``dict as a data structure provides no guarantees or ordering.

It is unfortunate that the pymongo documentation is not clearer on this point; though it does mention it, you essentially have to read a section about a method which takes a filter (like the collection find method), and notice that the filter argument says:

filter (optional): a SON object specifying elements which must be present for a document to be included in the result set

Well it says nothing about dict… so what is a SON object? The quoted description above provides no clickable link in the documentation, but if you dig around a little more you can come across the Tools for working with SON an ordered mapping page about the bson.son module which defines SON objects, and says (emphasis added by me):

Regular dictionaries can be used instead of SON objects, but not when the order of keys is important. A SON object can be used just like a normal Python dictionary.

NOTE 2: now we see that filters are expected to be SON objects, or speaking more broadly, dict objects can be used in specific cases, but as a general rule filters are expected to be an ordered mapping.

Right, so pymongo actually expects an ordered mapping, which makes sense based on the importance of ordering on the MongoDB side. How then do we ensure that ordering is preserved for queries? There are effectively two options; personally I’ve not found any compelling reasons to use bson.son.SON and prefer to use the already-familiar option that is built right into Python, collections.OrderedDict. OrderedDict acts almost exactly like a normal dict object, however, it also preserves the order in which keys were inserted. Using this approach, our example query filter could instead be constructed with:

from collections import OrderedDict

query_filter = OrderedDict()
query_filter['stuff'] = 1
query_filter['things'] = 0

docs = db.collection.find(query_filter)

It can also be built directly by utilizing the constructor but you have to keep in mind that initializing data by passing keyword arguments or another dict object will not preserve ordering. This is because OrderedDict can only preserve what ordering it has access to, and neither kwargs nor a dict provide an order. Instead, you will need to pass it an iterable of some sort (such as a list, tuple, or generator) of key-value pairs. Constructing the example filter in this manner would look like:

from collections import OrderedDict

query_filter = OrderedDict([('stuff', 1), ('things', 0)])

docs = db.collection.find(query_filter)

NOTE 3: initializing it may not look as clean or pretty as a dict, but you should almost always use OrderedDict for performing MongoDB queries in Python; you can initialize an OrderedDict properly by passing an iterable of key-value pairs to the constructor. This will ensure that ordering is preserved and ensure compound indexes with the same order will be utilized.