Python and MongoDB¶
This primarily refers to usage of the pymongo
package, however, the
concepts apply to using MongoDB from Python in general. Interacting with
MongoDB from Python has a couple of easy traps to fall into, and this looks
at one in particular and how to avoid the pitfall via proper usage of indexes
and careful construction of queries.
When querying MongoDB in general, it will attempt to use indexes based on the components of your query, in the order you specify them. That is, if you construct a query that looks as follows:
docs = db.collection.find({"stuff": 1, "things": 0})
If you use that as the filter for a .find()
call, MongoDB will look for a
compound index where the fields are specified in the order (stuff, things)
.
If a compound index exists with both fields but the ordering is instead
(things, stuff)
, then the index may not be used! I have experienced a
number of situations where the query planner will select the correct index
despite providing the query fields in a different order, however, that behavior
should not be relied upon.
NOTE 1: make sure that you create compound indexes on your MongoDB collections with the fields specified in the order which you want to query them in and that you construct query filters using the same ordering as your indexes!
Great, so we understand that ordering is important on the MongoDB side of the
equation, but there’s just one little problem. Almost all examples of using
a library like pymongo
to interact with MongoDB from Python use plain old
dict
objects to construct query filters; the thing is, Python dict
objects are inherently an unordered data structure and thus provide no
guarantee of maintaining any specific ordering. While that is not entirely true
for Python 3.6+ (but that is a whole different topic) I would still argue that
using the approach explained below conforms better to best practice and more
explicitly communicates the intent of what the code is doing.
If the filter only contains
a single field, then there won’t be any issues because there aren’t multiple
fields. However, as soon as you introduce multiple fields into a filter using
a normal dict
, the order in which they might be presented to MongoDB
becomes non-deterministic. This means that our example query might actually
be serialized at runtime to:
{"things": 0, "stuff": 1}
Which could prevent MongoDB from utilizing a compound index on
(stuff, things)
. That might not be an issue on very small
collections, but try doing that on hundreds of thousands or millions of
documents and the queries will start taking orders of magnitude longer.
There are certain edge-cases with some versions of Python where depending on
whether hash randomization is enabled or not the ordering of dict``s may be
preserved (or at least deterministic), however, the point here is that things
like this **should not** ever be relied upon for ordering since the ``dict
as a data structure provides no guarantees or ordering.
It is unfortunate that the pymongo documentation is not clearer on this point;
though it does mention it, you essentially have to read a section about a
method which takes a filter (like the collection
find
method), and notice that the filter
argument says:
filter (optional): a SON object specifying elements which must be present for a document to be included in the result set
Well it says nothing about dict
… so what is a SON object? The quoted
description above provides no clickable link in the documentation, but if you
dig around a little more you can come across the
Tools for working with SON an ordered mapping
page about the bson.son
module which defines SON
objects, and says
(emphasis added by me):
Regular dictionaries can be used instead of SON objects, but not when the order of keys is important. A SON object can be used just like a normal Python dictionary.
NOTE 2: now we see that filters are expected to be SON objects, or speaking
more broadly, dict
objects can be used in specific cases, but as a
general rule filters are expected to be an ordered mapping.
Right, so pymongo
actually expects an ordered mapping, which makes sense
based on the importance of ordering on the MongoDB side. How then do we ensure
that ordering is preserved for queries? There are effectively two options;
personally I’ve not found any compelling reasons to use bson.son.SON
and
prefer to use the already-familiar option that is built right into Python,
collections.OrderedDict
. OrderedDict
acts almost exactly like a normal
dict
object, however, it also preserves the order in which keys were
inserted. Using this approach, our example query filter could instead be
constructed with:
from collections import OrderedDict
query_filter = OrderedDict()
query_filter['stuff'] = 1
query_filter['things'] = 0
docs = db.collection.find(query_filter)
It can also be built directly by utilizing the constructor but you have to
keep in mind that initializing data by passing keyword arguments or another
dict object will not preserve ordering. This is because OrderedDict
can
only preserve what ordering it has access to, and neither kwargs nor a dict
provide an order. Instead, you will need to pass it an iterable of some sort
(such as a list, tuple, or generator) of key-value pairs. Constructing the
example filter in this manner would look like:
from collections import OrderedDict
query_filter = OrderedDict([('stuff', 1), ('things', 0)])
docs = db.collection.find(query_filter)
NOTE 3: initializing it may not look as clean or pretty as a dict
, but you
should almost always use OrderedDict
for performing MongoDB queries in
Python; you can initialize an OrderedDict
properly by passing an iterable
of key-value pairs to the constructor. This will ensure that ordering is
preserved and ensure compound indexes with the same order will be utilized.