In this study, we evaluate the snow and seasonal water supply prediction skill of a retrospective version of the NWM system, driven by downscaled reanalysis atmospheric forcings (NLDAS-2) but otherwise mirroring the operational configuration. We first compare NWM snow water equivalent (SWE) estimates against in-situ (SNOTEL) and analysis (SNODAS) products for the 2010-2017 water years across the western U.S. to characterize baseline snow model performance. Based on this benchmark, we narrow in on the Upper Colorado River basin as both a critical water supply source and a region of high variability in snow model performance skill. Since both SNOTEL and SNODAS suffer from representativeness errors, in this test region we supplement the SWE evaluation with a new spatial and temporally consistent remote sensing product. Specifically, we compare NWM fractional snow cover and albedo against an experimental, gap-filled snow product suite derived from MODIS that corrects for viewing angle, cloud cover, and vegetation effects and explicitly includes dust deposition (Gap-Filled MODSCAG+MODDRFS). In select sub-basins within the Upper Colorado test region, we calibrate the NWM snow depletion curve and snow albedo decay function parameters to match the experimental MODIS-based product and assess changes in seasonal streamflow prediction skill (volume and timing) at USGS stream gages. Results from the snow model benchmark assessment provide baseline error characteristics useful for interpreting NWM snow estimates and forecasts. Results from the watershed-scale experiments provide an indication of the feasibility of targeted snow model calibration within the NWM as well as potential benefits of real-time snow remote sensing data assimilation in the Upper Colorado region, a high priority for future NWM development.