Click here to Skip to main content
15,946,320 members
Articles / Artificial Intelligence / Deep Learning
Article

Deep Learning in C#: Preprocessing the Coin Detection Dataset

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
3 Nov 2020CPOL3 min read 17.3K   187   11   14
In the next article, we will preprocess a dataset to be inputted to a machine learning model.
Here we will preprocess a coin dataset for later training in a supervised learning model. Preprocess a dataset in machine learning usually involves tasks such as the following:
  • Clean the data - Filling in the holes that missing or corrupted data leave by averaging the values of the surrounding data or using some other strategy.
  • Normalize the data - Scaling values into a standard range, usually 0 to 1. Data that has a wide range of values can cause irregularities, so we bring everything into a common range.
  • One Hot Encode labels - Encoding the labels, or classes, of the objects in the dataset as binary N-dimensional vectors, where N is the total number of classes. The array elements are all set to 0, except for the element that corresponds to the class of the object, which is set to 1. This means that in each array there is a single element whose value is 1.
  • Divide the input dataset into a training set and a validation set - The training set is used for training the model and the validation set is used for checking how accurate our training was by evaluating the resulting model (after training) against a subset that was not trained on the original dataset.

In this example we will use Numpy.NET, which is basically the .NET version of the popular Numpy library in Python. Numpy is a library focused on working with matrices.

To implement our dataset processor, we create the Utils and DataSet classes in a PreProcessing folder. The Utils class incorporates a static Normalize method shown here:

C#
public class Utils
   {
       public static NDarray Normalize(string path)
       {
           var colorMode = Settings.Channels == 3 ? "rgb" : "grayscale";
           var img = ImageUtil.LoadImg(path, color_mode: colorMode, target_size: (Settings.ImgWidth, Settings.ImgHeight));
           return ImageUtil.ImageToArray(img) / 255;
       }

   }

In this method, we load an image with a given color mode (RGB or grayscale) and resize it to a given width and height. Then we return the matrix that contains the image with each element divided by 255. Dividing each element by 255 normalizes them, since the value of any pixel in an image is between 0 and 255, so by dividing them by 255 we make sure that the new range is 0 to 1, inclusive.

We also use in the code a Settings class. This class contains constants for the possible values of many parameters used across the application. The other class, DataSet, represents the dataset we are going to use to train the machine learning model. Here we have the following fields:

  • _pathToFolder - The path to the folder containing the images.
  • _extList - The list of file extensions to consider.
  • _labels - The labels, or classes, of the images in _pathToFolder.
  • _objs - The images themselves, represented as a Numpy.NDarray.
  • _validationSplit - The percentage used for dividing the total number of images into a validation set and training set, in this case the percentage will define the size of the validation set in relation to the total number of images.
  • NumberClasses - The total number of unique classes in the dataset.
  • TrainX - The training data, represented as a Numpy.NDarray.
  • TrainY - The training labels, represented as a Numpy.NDarray.
  • ValidationX - The validation data, represented as a Numpy.NDarray.
  • ValidationY -The validation labels, represented as a Numpy.NDarray.

Here’s the DataSet class:

C#
public class DataSet
    {
        private string _pathToFolder;
        private string[] _extList;
        private List<int> _labels;
        private List<NDarray> _objs;
        private double _validationSplit;
        public int NumberClasses { get; set; }
        public NDarray TrainX { get; set; }
        public NDarray ValidationX { get; set; }
        public NDarray TrainY { get; set; }
        public NDarray ValidationY { get; set; }

        public DataSet(string pathToFolder, string[] extList, int numberClasses, double validationSplit)
        {
            _pathToFolder = pathToFolder;
            _extList = extList;
            NumberClasses = numberClasses;
            _labels = new List<int>();
            _objs = new List<NDarray>();
            _validationSplit = validationSplit;
        }

        public void LoadDataSet()
        {
            // Process the list of files found in the directory.
            string[] fileEntries = Directory.GetFiles(_pathToFolder);
            foreach (string fileName in fileEntries)
                if (IsRequiredExtFile(fileName))
                    ProcessFile(fileName);

            MapToClassRange();
            GetTrainValidationData();
        }

        private bool IsRequiredExtFile(string fileName)
        {
            foreach (var ext in _extList)
            {
                if (fileName.Contains("." + ext))
                {
                    return true;
                }
            }

            return false;
        }

        private void MapToClassRange()
        {
            HashSet<int> uniqueLabels = _labels.ToHashSet();
            var uniqueLabelList = uniqueLabels.ToList();
            uniqueLabelList.Sort();

            _labels = _labels.Select(x => uniqueLabelList.IndexOf(x)).ToList();
        }

        private NDarray OneHotEncoding(List<int> labels)
        {
            var npLabels = np.array(labels.ToArray()).reshape(-1);
            return Util.ToCategorical(npLabels, num_classes: NumberClasses);
        }

        private void ProcessFile(string path)
        {
            _objs.Add(Utils.Normalize(path));
            ProcessLabel(Path.GetFileName(path));
        }

        private void ProcessLabel(string filename)
        {
            _labels.Add(int.Parse(ExtractClassFromFileName(filename)));
        }

        private string ExtractClassFromFileName(string filename)
        {
            return filename.Split('_')[0].Replace("class", "");
        }

        private void GetTrainValidationData()
        {
            var listIndices = Enumerable.Range(0, _labels.Count).ToList();
            var toValidate = _objs.Count * _validationSplit;
            var random = new Random();
            var xValResult = new List<NDarray>();
            var yValResult = new List<int>();
            var xTrainResult = new List<NDarray>();
            var yTrainResult = new List<int>();

            // Split validation data
            for (var i = 0; i < toValidate; i++)
            {
                var randomIndex = random.Next(0, listIndices.Count);
                var indexVal = listIndices[randomIndex];
                xValResult.Add(_objs[indexVal]);
                yValResult.Add(_labels[indexVal]);
                listIndices.RemoveAt(randomIndex);
            }

            // Split rest (training data)
            listIndices.ForEach(indexVal => 
            { 
                xTrainResult.Add(_objs[indexVal]);
                yTrainResult.Add(_labels[indexVal]);
            });

            TrainY = OneHotEncoding(yTrainResult);
            ValidationY = OneHotEncoding(yValResult);
            TrainX = np.array(xTrainResult);
            ValidationX = np.array(xValResult);
        }
    }

Here is an explanation of each method:

  • LoadDataSet() - The main method of the class that we call to load a dataset in _pathToFolder. It calls the other methods listed below to do this.
  • IsRequiredExtFile(filename) - Checks if a given file contains at least one of the extensions (listed in _extList) that should be processed for this dataset.
  • MapToClassRange() - Gets the list of unique labels in the dataset.
  • ProcessFile(path) - Uses the Utils.Normalize method to normalize an image and calls the ProcessLabel method.
  • ProcessLabel(filename) - Adds as label the result of the ExtractClassFromFileName method.
  • ExtractClassFromFileName(filename) - Extracts the class from the filename of the image.
  • GetTrainValidationData() - Divides the dataset into training and validation sub-datasets.

In this series we will be using the coin image dataset at https://cvl.tuwien.ac.at/research/cvl-databases/coin-image-dataset/.

To load the dataset we can include the following in the main class of our console application:

C#
var numberClasses = 60;
var fileExt = new string[] { ".png" };
var dataSetFilePath = @"C:/Users/arnal/Downloads/coin_dataset";
var dataSet = new PreProcessing.DataSet(dataSetFilePath, fileExt, numberClasses, 0.2);
dataSet.LoadDataSet();

Our data can now be inputted in a machine learning model. The next article will go over the basics of supervised machine learning and what the training and verification phases consist of. It is intended for readers with none to little experience with AI.

This article is part of the series 'Deep Learning in C# View All

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Serbia Serbia
Computer Scientist and book author living in Belgrade and working for a German IT company. Author of Practical Artificial Intelligence: Machine Learning, Bots, and Agent Solutions Using C# (Apress, 2018) and PrestaShop Recipes (Apress, 2017). Lover of Jazz and cinema Smile | :)

Comments and Discussions

 
QuestionHappy to report code works! Pin
asiwel6-Dec-20 20:22
professionalasiwel6-Dec-20 20:22 
QuestionMessage Closed Pin
3-Nov-20 23:44
Member 149835833-Nov-20 23:44 
QuestionEven more interesting, but another problem Pin
asiwel3-Nov-20 11:44
professionalasiwel3-Nov-20 11:44 
AnswerRe: Even more interesting, but another problem Pin
Ryan Peden3-Nov-20 14:58
professionalRyan Peden3-Nov-20 14:58 
GeneralThanks, but still another problem Pin
asiwel3-Nov-20 17:15
professionalasiwel3-Nov-20 17:15 
GeneralRe: Thanks, but still another problem Pin
Ryan Peden3-Nov-20 17:27
professionalRyan Peden3-Nov-20 17:27 
GeneralRe: Thanks, but still another problem Pin
asiwel3-Nov-20 17:49
professionalasiwel3-Nov-20 17:49 
GeneralRe: Thanks, but still another problem Pin
Ryan Peden3-Nov-20 18:52
professionalRyan Peden3-Nov-20 18:52 
GeneralRe: Thanks, but still another problem Pin
Ryan Peden5-Nov-20 4:05
professionalRyan Peden5-Nov-20 4:05 
GeneralRe: Thanks, but still another problem Pin
asiwel5-Nov-20 6:03
professionalasiwel5-Nov-20 6:03 
GeneralThanks again, but time to quit for a while Pin
asiwel6-Nov-20 5:51
professionalasiwel6-Nov-20 5:51 
GeneralRe: Thanks again, but time to quit for a while Pin
Ryan Peden6-Nov-20 6:02
professionalRyan Peden6-Nov-20 6:02 
GeneralRe: Thanks again, but time to quit for a while Pin
asiwel6-Nov-20 6:36
professionalasiwel6-Nov-20 6:36 
GeneralHappy to Report Success! Pin
asiwel13-Jan-21 17:58
professionalasiwel13-Jan-21 17:58 
GeneralRe: Happy to Report Success! Pin
Ryan Peden16-Feb-21 9:57
professionalRyan Peden16-Feb-21 9:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.