Visual and auditory representations in the human brain have been studied with encoding, decoding and reconstruction models. Representations from convolutional neural networks have been used as explanatory models for these stimulus-induced hierarchical brain activations. However, none of the fMRI datasets currently available has adequate amounts of data for sufficiently sampling their representations. We recorded a densely sampled large fMRI dataset (TR=700 ms) in a single individual exposed to spatiotemporal visual and auditory naturalistic stimuli (30 episodes of BBC’s Doctor Who). The data consists of 120.830 whole-brain volumes (approx. 23 h) of single-presentation data (full episodes, training set) and 1.178 volumes (11 min) of repeated narrative short episodes (test set, 22 repetitions), recorded with fixation over a period of six months. This rich dataset can be used widely to study the way the brain represents audiovisual input across its sensory hierarchies.