Strategies for effective transfer learning for HRTEM analysis with neural networks

Generalization measures

In supervised learning, we use the validation loss—more formally known as the generalization gap in empirical risk minimization settings—as our working metric for measuring in-distribution generalization GIDG_{ID}. For a parameterized model fθ:XYf_\theta: X \rightarrow Y optimized against a loss function L\mathcal{L},

GID,X(fθ)=L(f(X),Y)ESX[L(f(X),Y)]G_{\scriptscriptstyle\text{ID}, \mathcal{X}} (f_\theta) = \mathcal{L} \left(f(X), Y \right) - \underset{S \sim \mathcal{X}}{\mathbb{E}}\left[ \mathcal{L} \left( f (X), Y \right) \right]

where (X,Y)(X, Y) are the data and supervision labels (i.e., micrographs and segmentation masks) drawn from an underlying distribution X\mathcal{X} and SS is the finite sample of data used to optimize the model fθf_\theta. The loss L(fθ(X),Y)\mathcal{L} \left(f_\theta(X), Y \right) represents the performance of the model over the entire domain X\mathcal{X}, but in practice, we approximate this loss using a second finite dataset—deemed the validation dataset—which should be drawn independently and identically distributed to the training sample SS. During model training, we optimize model weights θ\theta against the training loss ES[L(fθ(X),Y)]\mathbb{E}_S\left[ \mathcal{L} \left( f_\theta (X), Y \right) \right]; commonly, final weights for a model are instead selected to correspond to parameters which minimize GID(fθ)G_{\scriptscriptstyle\text{ID}}(f_\theta).

Using the shorthand LX(f)=L(f(X),Y)\mathcal{L}_{\mathcal{X}}(f) = \mathcal{L} \left(f(X), Y \right), we can analogously define an out-of-distribution generalization metric as

GOOD,XX(f)=LX(f)LX(f)G_{\scriptscriptstyle\text{OOD}, \mathcal{X} \rightarrow \mathcal{X}'} (f) = \mathcal{L}_{\mathcal{X'}}(f) - \mathcal{L}_{\mathcal{X}}(f)

where X\mathcal{X} is the training distribution of ff and X\mathcal{X'} is a new distribution of data which differs from X\mathcal{X}, i.e., under some measure, XX\mathcal{X} \neq \mathcal{X'}. Both terms in Eq. (4.2) must be approximated with validation datasets in practice. We note, also, that Eq. (4.2) is not always nonnegative, unlike Eq. (4.1) (which, in theory, should be): one can imagine a scenario in which a model improves out of distribution.