## Abstract

We develop and explore a deep learning based single-shot ptychography reconstruction method. We show that a deep neural network, trained using only experimental data and without any model of the system, leads to reconstructions of natural real-valued images with higher spatial resolution and better resistance to systematic noise than common iterative algorithms.

© 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Ptychography is a powerful coherent diffractive imaging technique, yielding label-free, high-contrast quantitative amplitude and phase images, without the need for prior information (e.g. support) on the object and probe beam [1,2]. In a conventional ptychographic microscope, a complex-valued object is scanned in a stepwise fashion through a localized beam. In each step, the intensity diffraction pattern (DP) from the illuminated region on the object is measured on a Fraunhofer plane. Critically, the illumination spot in each step overlaps substantially with neighboring spots, resulting in significant redundancy in the measured data. The set of recorded DPs is used to reconstruct the object’s complex transfer function using an iterative phase retrieval algorithm. Recently, a neural network (NN) based reconstruction algorithm was developed for scanning-based ptychography [3], showing high quality reconstructions with no prior knowledge about the probe beam and extremely low probes overlap. Notably, prior knowledge of the probes central positions is required. Also, the NN algorithm was not implemented on experimental data.

A significant limitation of conventional ptychography is the relatively long acquisition time (due to the scanning), precluding the application of conventional ptychography to imaging of fast dynamics. Recently, single-shot ptychography (SSP) schemes, in which tens or hundreds of intensity DPs are recorded in a single CCD exposure, were proposed and demonstrated to overcome this limitation [4–7]. Remarkably, SSP allows ultrafast ptychography [8] as well as ultrahigh-speed imaging [8,9]. In the current reconstructing algorithm of SSP (which is based on iterative scanning ptychography), the recorded data is divided into zones, each zone approximately contains one DP that is associated with a known localized region of the object. However, this post partitioning of the detection results in resolution deterioration. For instance, an $N \times N$ square-lattice division limits the spatial resolution to $N \lambda / 2$ [5]. Measuring an object with Fourier components beyond this limit gives rise to crosstalk between different zones which degrades the reconstruction quality, since the crosstalk information is considered as noise in current SSP reconstruction algorithms.

Here we develop and explore experimentally reconstruction of SSP data of natural real-valued images by employing Deep Neural Networks (DNN). More specifically, we train a convolutional encoder-decoder NN to learn the inverse mapping of the SSP measurement function using labeled data techniques, where the SSP recorded data is the datum and the sampled image is its label. Importantly, in contrast to all current ptychographic reconstruction algorithms (single-shot and scanning based, iterative and DNN), our DNN does not receive the connections between the DPs and their associated central coordinates on the object or any other information about the microscopic system. We find that the network handles experimental noise and artifacts better than currently standard single-shot ptychographic reconstruction algorithms, extended Ptychographic Iterative Engine (ePIE) [1,2] and Semi-implicit relaxed Douglas-Rachford (sDR) [10] which is a generalized version of the Difference Map (DM) algorithm [11], giving rise to higher spatial resolution and improved reconstructions, especially with poorly aligned system.

## 2. Experimental setup

The optical setup we used is based on the 4fSSP configuration (see Fig. 1(a)) [5,8,9,12,13]. In 4fSSP, an array of pinholes is located at the input plane of a 4f system. Lens L1 focuses the light beams that diffract from the array onto the object, which is located at distance $d$ before the back focal plane of lens L1. The small displacement from the Fourier plane, $d$, creates a partial overlap between the beams which is essential in ptychography. Lens L2 collects the diffracted light from the object and transfers it to a camera on the output plane of the 4f system. Assuming that the spatial power spectrum of the object is largely confined to a low-frequency region, the camera measures an intensity pattern consisting of clearly distinguishable blocks. Each block contains a diffraction pattern associated with a beam originating from a single pinhole and contains spectral information about a specific region on the object plane.

In our experiment (Fig. 1(b)), the optical setup is comprised of a 520nm diode-laser coupled to a single-mode fiber for spatial filtering, then spatially magnified by a telescope (not shown in Fig. 1) and enters a modified 4fSSP setup. In order to keep the design flexible and versatile, we replaced the static pinhole array by a reflective HOLOEYE PLUTO-2 phase-only spatial light modulator (SLM), denoted by SLM_{P}, that generates a tunable mask-like beam structure on the input plane of the 4f system. Specifically, we set the phase such that SLM_{P} acts as a micro-lens array (MLA), producing an effective pinhole array at a focal distance, $f_{MLA}$, downstream from the SLM. Hence, SLM_{P} is located $f_{MLA}$ before the input plane of the 4f system. An example of such phase mask is shown in Fig. 1(c). It is an array of squared $N_X \times N_Y = 9 \times 9$ phase masks, where in each mask the phase is given by $\Phi (r)=\exp (i \pi r^2 / \lambda f_{MLA})$, where $\lambda = 520$nm is the illumination wavelength and $r$ is the distance from the center of a single micro-lens which is locally defined in each square mask. We determined $f_{MLA}$ according to $f_{MLA} = f_{OL} b / W = 100$mm, where $f_{OL}=50$mm is the focal length of lens OL, $b=0.5$mm is the distance between consecutive lenses/pinholes and $W=250 \mu$m is the required single probe spot size on the object plane. Lens OL replaces here both lenses L1 and L2 in a double pass configuration: the beams pass through OL twice – the first time as they propagate towards the object and the second time when they are reflected from the object towards the camera. This configuration leads to a more compact setup, which is essential for cases in which the focal lengths of L1 and L2 are short and raise some practical mounting challenges.

Since training a DNN requires typically thousands of samples, we used an amplitude only SLM (HOLOEYE HED 6001 monochrome LCOS microdisplay) as an object, allowing a fast acquisition of thousands of different object images. Because our objects are displayed on an amplitude SLM, all the objects in this work are real and positive. The object SLM, denoted by SLM_{O}, is placed at a distance $d=7$mm before the focal plane of OL (which is Fourier plane of the 4f system). Thus, yielding $\sim$75% overlap on the object plane between beams originated from neighboring lenses, and a field of view of FOV $\approx (N_X \times N_Y)b d / f_{OL} = 630 \mu$m $\times$ $630 \mu$m [5]. Both SLMs, SLM_{P} and SLM_{O}, have $1920 \times 1080$ pixels, a pixel size of $8 \mu$m and an 8-bit dynamic range.

On the second pass, OL transforms the object’s plane exit-wave to spatial frequency domain (up to an additional phase which is undetectable by the camera) at the exit plane of the 4f system, where the camera is placed. The transformation from real space to the spatial coordinates is given by $\pmb {\nu } = \mathbf {r} / \lambda f_{OL}$, where $\mathbf {r}$ is a spatial coordinates vector on the camera plane. Hence, the highest measurable frequency in each block is given by:

The resulting images are captured with a Basler acA2440-75um camera that has $2448 \times 2048$ pixels, a pixel size of $3.45 \mu$m and a 12-bit dynamic range.## 3. SspNet

#### 3.1 Architecture

The proposed NN (Fig. 2), denoted by SspNet, is a convolutional encoder-decoder comprised of 9 encoding units (encoder) and 9 decoding units (decoder). Each encoder unit in the encoder is comprised of a convolution layer with a $4 \times 4$ kernel, stride 2 and 1 pixel padding, followed by a LeakyReLU activation with $\alpha = 0.2$, and a batch normalization layer (except the first block that does not contain batch normalization). Each decoder unit is comprised of a transposed convolution layer, followed by a ReLU activation and a batch normalization layer, except the last one which contains only the transposed convolution layer. The first 3 transposed convolution layers have a $4 \times 4$ kernel, 1 pixel padding and stride 2, and the last 6 have $3 \times 3$ kernels, 1 pixel padding and stride 1 to keep the same image size. The first encoder unit has a single input channel and 4 output channels, and the number of output channels of every other encoder unit is twice the number of its input channels. The number of output channels of every decoder unit is a half of the number of its input channels, except the last unit which has 1 output channel. Therefore, for an input image of size $1 \times 2448 \times 2048$, the feature encoding size (the size of the result in the middle of the network) is $1024 \times 4 \times 4$ and the output image size is $1 \times 32 \times 32$.

#### 3.2 Training

SspNet was trained to reconstruct an image from its SSP measurement. Unlike previous SSP reconstruction methods [4,5], we use the recorded diffraction pattern directly as an input to the algorithm without dividing it into $N \times N$ square blocks of single DPs. Avoiding the division is potentially beneficial for obtaining high resolution. Since information about frequencies higher than $\nu _{\mathit {cutoff}}$ (see Eq. (1)) exceeds the boundaries of a DP block, it is not available in algorithms that do divide the data into blocks, hence limiting the resolution. Moreover, in these algorithms, this information is not just lost, but practically becomes noise for neighboring DP blocks. Thus, our SspNet may lead to reconstructions with higher resolution than previous SSP algorithms.

We trained SspNet (using Pytorch) using experimental measurements of 20,000 natural images from the CIFAR10 dataset [14], one of the most common benchmarks in computer vision. Before feeding a raw data image through SspNet we took its square root (use the amplitude instead of the intensity of the measurement) and normalized it such that the value of the brightest pixel is 1. SspNet was trained to output for every SSP measurement an image that matches the original ground truth image from CIFAR10 that was displayed on SLM_{O} (a grayscale version of the original RGB image). In other words, SspNet was optimized such that the function it implements is close to the inverse mapping of the SSP measurement. This is achieved using Adam [15], a variant of stochastic gradient descent, by minimizing the mean absolute error (L_{1} loss) between the output image and the ground truth. We trained with a learning rate of 0.0002 and $\beta _1=0.9$.

## 4. Experimental results

In this work we compare SspNet with two different iterative reconstruction algorithms: ePIE [2] and sDR [10]. These algorithms belong to different classes: ePIE is an alternate projection algorithm while sDR is based on both projection and reflection [10]. Comparing to algorithms from different classes raises the level of confidence in our results. Since the objects are displayed on an amplitude only SLM, we compare only the absolute value of the complex reconstructed images. The computation time of SspNet is 5ms per image, much lower than 100 iterations of ePIE or sDR – 150s (the algorithms run on a single NVIDIA GTX 1080Ti graphics card).

We evaluate the performances of the three methods according to three criteria: reconstruction quality (compared to the ground truth image), resolution and noise resistance. All the comparisons are done using 100 test images from the CIFAR10 dataset that were not used while training SspNet and are fed into the network for inference only.

#### 4.1 Reconstruction quality

We compare the quality of reconstructed images according to two commonly used image quality metrics: the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [16]. These metrics indicate the visual similarity (in real space) between a retrieved image and an original object image. PSNR is an absolute error metric which is proportional to the logarithm of the mean squared error (MSE). It is measured in dB and higher PSNR usually indicates better quality. SSIM is a perception-based metric that considers image degradation as changes in structural information. For real and positive images, as in this work, the SSIM index has a decimal value between 1 and 0, where 1 means identical images and 0 means no structural similarity between the images.

As shown in Fig. 3(a), SspNet achieves a significantly higher mean PSNR value over 100 test samples (27.1dB) than ePIE (18.6dB) and sDR (20dB). SspNet also gives higher mean SSIM index (0.9) compared to ePIE (0.58) and sDR (0.63), as shown in Fig. 3(b). Unlike the PSNR distributions where all three methods have similar standard deviation, the SSIM distribution of SspNet is much narrower than the iterative methods. Three examples (out of 100 test samples) are displayed in Fig. 3(c) to demonstrate the visual differences. Besides the better visual quality that matches the higher PSNR and SSIM values, the SspNet reconstructions contain much finer resolvable details. The resolution enhancement of SspNet is explored in the next section.

#### 4.2 Resolution

In SSP, all the probe beams illuminate the object and then detected simultaneously forming a grid of DPs. As mentioned in sections 1 and 2, in previous reconstruction algorithms, the detected intensity pattern is first divided into blocks and the set of DPs is treated as a standard scanning-based ptychographic data [4,5]. This division leads to an effectively smaller numerical aperture (NA) thus limiting the resolution of the reconstructed images. Specifically, the resolution limit of $N \times N$ square blocks is $1/N$ of the resolution limit when the same optical system is used for scanning-based ptychography. In our system where $b=0.5$mm and $\lambda =520$nm, and according to Eq. (1), the spatial resolution limit is:

Higher frequencies are aliased in neighboring blocks and counted as noise. In order to explore the resolution of reconstructions and compare SspNet to other algorithms, we plot in Fig. 4 the averaged spatial spectra (over the 100 test samples) of the ground truth and reconstructed images, and a single image example. The results in Fig. 4 show that the spectra of SspNet reconstructions follow closely the spectra of the ground-truth images until $1.25 \nu _{\mathit {cutoff}}$, beyond the cutoff frequency, and then it is slightly attenuated. The spatial spectra of ePIE and sDR reconstructions are similar, both are significantly attenuated beyond $0.75 \nu _{\mathit {cutoff}}$ (sDR slightly earlier). These cutoffs define the available resolution – $41.6 \mu$m for SspNet and $69.3 \mu$m for ePIE and sDR. Visually, as shown in Fig. 3(c), finer details can be observed in the SspNet reconstructions compared to the ePIE and sDR reconstructions.#### 4.3 Noise resistance

Systematic noise might be added to optical systems through various sources such as stray light, reflections, high diffraction orders etc. We block most of this noise in our system by placing an iris in front of the camera. A measurement recorded using this configuration, denoted by low-noise (LN) data, is shown in Fig. 5(a). By removing this filter, we can explore the robustness of each algorithm when the recorded data is very noisy. A measurement recorded using this configuration, denoted by high-noise (HN) data, is shown in Fig. 5(b). In this part we compare four reconstruction methods for HN data: SspNet that was trained using HN data (SspNet-HN), SspNet that was trained using LN data (the nominal model parameters that were used in sections 4.1 and 4.2), ePIE and sDR. Some reconstruction results are shown in Fig. 5(e).

As shown in Fig. 5, despite the higher noise level (Fig. 5(b)), the reconstruction quality of SspNet is only slightly changed, if trained using data with the same noise profile (HN data) – PSNR = 27.3dB and SSIM=0.9. However, this noise level is too high for ePIE and sDR, whose PSNR and SSIM values are significantly lower ($\sim$12.5dB, 0.21). The poor reconstruction quality is visually demonstrated in Fig. 5(e). We have expected this quality degradation of ePIE and sDR, since they impose the measured DPs and cannot separate the signal from the noise. Since the nominal SspNet was not trained for HN data and this noise level significantly degrades the performances of ePIE and sDR, we should have expected that the reconstruction quality of nominal SspNet would be deteriorated as well. However, the reconstruction quality of the nominal SspNet (17.1dB, 0.61) is surprisingly higher than the iterative algorithms, although still inferior than SspNet-HN.

## 5. Conclusion

In summary, we demonstrated reconstructions of experimental single shot ptychography (SSP) measurements using a deep neural network, denoted by SspNet. SspNet was trained using experimental data only, and with no prior information or model of the physical system such as the shape of the probe beams, the number of probe positions and their exact location on the object plane. Notably, this is different from most deep-learning based methods for phase retrieval in optical imaging [17] and diagnostics of ultrashort pulses [18,19], where the training is performed on numerical data or uses prior knowledge about the physics of the system. We compared SspNet with two iterative reconstruction algorithms for ptychographic data – ePIE and sDR. The comparison focused on three properties: reconstructed image quality, spatial resolution and resistance to noise. SspNet was superior in each of those three aspects. Remarkably, SspNet successfully reconstructs spatial frequencies that are not accessible for other algorithms, and it deals better with noise even if not trained particularly for noisy data. In this work we explored the application of deep neural networks to SSP of real-valued images. An important next step would be the extension and investigation of the algorithm to complex-valued images.

## Funding

H2020 European Research Council (819440-TIMP).

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **J. Rodenburg, “Ptychography and related diffractive imaging methods,” in * Advances in Imaging and Electron Physics*, vol. 150Hawkes, ed. (Elsevier, 2008), pp. 87–184.

**2. **A. M. Maiden and J. M. Rodenburg, “An improved ptychographical phase retrieval algorithm for diffractive imaging,” Ultramicroscopy **109**(10), 1256–1262 (2009). [CrossRef]

**3. **Z. Guan and E. H. Tsai, “Ptychonet: Fast and high quality phase retrieval for ptychography,” Tech. rep., Brookhaven National Lab.(BNL), Upton, NY (United States) (2019).

**4. **X. Pan, C. Liu, and J. Zhu, “Single shot ptychographical iterative engine based on multi-beam illumination,” Appl. Phys. Lett. **103**(17), 171105 (2013). [CrossRef]

**5. **P. Sidorenko and O. Cohen, “Single-shot ptychography,” Optica **3**(1), 9–14 (2016). [CrossRef]

**6. **X. He, C. Liu, and J. Zhu, “Single-shot Fourier ptychography based on diffractive beam splitting,” Opt. Lett. **43**(2), 214–217 (2018). [CrossRef]

**7. **G. Ilan Haham, O. Peleg, P. Sidorenko, and O. Cohen, “High-resolution (diffraction limited) single-shot multiplexed coded-aperture ptychography,” J. Opt. (to be published), https://doi.org/10.1088/2040-8986/ab7f23.

**8. **P. Sidorenko, O. Lahav, and O. Cohen, “Ptychographic ultrahigh-speed imaging,” Opt. Express **25**(10), 10997–11008 (2017). [CrossRef]

**9. **O. Wengrowicz, O. Peleg, B. Loevsky, B. K. Chen, G. I. Haham, U. S. Sainadh, and O. Cohen, “Experimental time-resolved imaging by multiplexed ptychography,” Opt. Express **27**(17), 24568–24577 (2019). [CrossRef]

**10. **M. Pham, A. Rana, J. Miao, and S. Osher, “Semi-implicit relaxed Douglas-Rachford algorithm (sDR) for ptychography,” Opt. Express **27**(22), 31246–31260 (2019). [CrossRef]

**11. **P. Thibault, M. Dierolf, A. Menzel, O. Bunk, C. David, and F. Pfeiffer, “High-Resolution Scanning X-ray Diffraction Microscopy,” Science **321**(5887), 379–382 (2008). [CrossRef]

**12. **B. K. Chen, P. Sidorenko, O. Lahav, O. Peleg, and O. Cohen, “Multiplexed single-shot ptychography,” Opt. Lett. **43**(21), 5379–5382 (2018). [CrossRef]

**13. **W. Xu, H. Xu, Y. Luo, T. Li, and Y. Shi, “Optical watermarking based on single-shot-ptychography encoding,” Opt. Express **24**(24), 27922–27936 (2016). [CrossRef]

**14. **A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” Tech. rep. (2009).

**15. **D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

**16. **A. Hore and D. Ziou, “Image quality metrics: PSNR vs. SSIM,” in 2010 20th International Conference on Pattern Recognition, (IEEE, 2010), pp. 2366–2369.

**17. **G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica **6**(8), 921–943 (2019). [CrossRef]

**18. **T. Zahavy, A. Dikopoltsev, D. Moss, G. I. Haham, O. Cohen, S. Mannor, and M. Segev, “Deep learning reconstruction of ultrashort pulses,” Optica **5**(5), 666–673 (2018). [CrossRef]

**19. **R. Ziv, A. Dikopoltsev, T. Zahavy, I. Rubinstein, P. Sidorenko, O. Cohen, and M. Segev, “Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot,” Opt. Express **28**(5), 7528–7538 (2020). [CrossRef]