Image recognition and classification deep learning models: A journey, problems, and the ways ahead
Transactions on Computing Science
PDF

Keywords

deep learning; CNNs; image recognition; image classification; computer vision; AI; model architecture; adversarial robustness; ethical AI; vision transformers; self - supervised learning; model efficiency

Abstract

Digital imaging tech paired with so much visual material must mean that the field researching image recognition and classification was far ahead of all other artificial intelligence research, development and application fields. Deep learning, Convolutional neural networks shook it up pretty good, reached human levels and above. Recognize an object, identify a face, get another viewpoint on medical imagery, assist autos in driving themselves, all of these. In this paper I have given a comprehensive analysis from multiple aspects on deep learning-based vision understanding model. It takes us from the most basic type of network right up to today’s complex CNNs and even up to the revolutionary vision transformers. The study carefully examines some important technical regulations and improved methods of training and the strictness of the requirements for performances. And also, there’s digging into deep, long-term issues like endless demand for labeled info, seriously big computing and environment cost from training, difficult to understand “black box” situation with the model, concerns about bad people trying to trick the model on purpose, and heavy ethical problems with being unfair, not treating everybody equally, and watching over who you share your information with. and I see so many case studies of diagnostic imaging in healthcare and e-commerce on visual search from our research today we’ve seen so many models just being thrown to do incredibly transformative things but incredibly constrained as well. Finally, the paper concludes with a roadmap of research suggestions for what to tackle next, some exciting new frontiers in data-efficient learning (FE and self-supervised), automated NAS, explainable AI (XAI), building more energy efficient ‘green AI’ models that are sustainable for long-term use, sound ethical governance. Discussion expands to include the emerging area of multimodal foundation models that combine vision and language and other sensory modes. This area both creates new possibilities and new problems. Putting together all of these results suggests that we need continuous, full-spectrum innovation in all of these areas: novel algorithms, large and sustainable infrastructure, and principled ethics if we really want to break through the problems we currently face and use deep learning to transform every part of society.

https://doi.org/10.63808/tcs.v2i1.291
PDF

References

[1] Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the opportunities and risks of foundation models (arXiv:2108.07258). arXiv. https://arxiv.org/abs/2108.07258

[2] Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler & C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency (pp. 77–91). PMLR. http://proceedings.mlr.press/v81/buolamwini18a.html

[3] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (pp. 1597–1607). PMLR. https://proceedings.mlr.press/v119/chen20j.html

[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy

[5] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations. https://arxiv.org/abs/1412.6572

[6] Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P. C., Mega, J. L., & Webster, D. R. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316(22), 2402–2410. https://doi.org/10.1001/jama.2016.17216

[7] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009). https://doi.org/10.1109/CVPR52688.2022.01553

[8] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90

[9] Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708). https://doi.org/10.1109/CVPR.2017.243

[10] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4015–4026). https://doi.org/10.1109/ICCV51070.2023.00371

[11] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

[12] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791

[13] Liang, W., Zhang, Y., Cao, J., Xie, B., Yu, K., & Wang, F.-Y. (2023). Can large language models understand context? (arXiv:2302.07180). arXiv. https://arxiv.org/abs/2302.07180

[14] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022). https://doi.org/10.1109/ICCV48922.2021.00986

[15] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with CLIP latents (arXiv:2204.06125). arXiv. https://arxiv.org/abs/2204.06125

[16] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). https://doi.org/10.1109/CVPR.2018.00474

[17] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54–63. https://doi.org/10.1145/3381831

[18] Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. https://arxiv.org/abs/1409.1556

[19] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–9). https://doi.org/10.1109/CVPR.2015.7298594

[20] Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (pp. 6105–6114). PMLR. https://proceedings.mlr.press/v97/tan19a.html

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2026 Yang Liu