Microsoft.ML is the NuGet package for ML.Net, Microsoft’s open-source Machine Learning framework.
In this introduction I will create a stopword engine, capable of removing unwanted words from a sentence. I know that it is overkill to use machine learning to do this, but it serves as a great introduction as how to initialize and call Microsoft.ML.
STEP 1: THE NUGET PACKAGE
You need the following NuGet package:
STEP 2: CREATE A LIST OF STOPWORDS
We need a list or unwanted words to remove from the list:
public class StopWords
{
internal static readonly string[] Custom =
{
"profanity",
"swearing",
"degrading"
};
}
STEP 3: CREATE THE TEXTPROCESSING SERVICE
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;
public class TextProcessingService
{
// The PredictionEngineis part of the Microsoft.ML package
private readonly PredictionEngine<InputData, OutputData> _stopWordEngine;
// The PredictionEngine receives an array of words
private class InputData
{
public string[] Words { get; set; }
}
// The PredictionEngine returns an array of words
private class OutputData : InputData
{
public string[] WordsWithoutStopWords { get; set; }
}
public TextProcessingService()
{
var context = new MLContext();
// Getting the list of words to remove from our sentece
var stopWords =
StopWords.Custom.ToArray();
// Define the transformation
var transformerChain = context.Transforms.Text
.RemoveDefaultStopWords(
inputColumnName: "Words",
outputColumnName: "WordsWithoutDefaultStopWords",
language: StopWordsRemovingEstimator.Language.English)
.Append(context.Transforms.Text.RemoveStopWords(
inputColumnName: "WordsWithoutDefaultStopWords",
outputColumnName: "WordsWithoutStopWords",
stopwords: stopWords));
var emptySamples = new List<InputData>();
var emptyDataView = context.Data.LoadFromEnumerable(emptySamples);
var textTransformer = transformerChain.Fit(emptyDataView);
_stopWordEngine = context.Model.CreatePredictionEngine<InputData, OutputData>(textTransformer);
}
public string[] ExtractWords(string text)
{
// This will remove stopwords
var withoutStopWords = _stopWordEngine.Predict(new InputData { Words = text.Split(' ')}).WordsWithoutStopWords;
if (withoutStopWords == null)
return null;
return withoutStopWords;
}
}
USAGE:
public static void Main()
{
var textProcessing = new TextProcessingService();
var newString = textProcessing.ExtractWords("my code removes swearing and degrading language");
Console.WriteLine(String.Join(' ',newString));
}
The code above will generate the following output:
- code removes language
But why does it do that? The input string is “my code removes swearing and degrading language” and I have only defined “swearing” and “degrading” as words that needs to be removed?
The answer lies within line 37 in the TextProcessingService. I use a StopWordsRemovingEstimator, and the language is set to English. The RemoveDefaultStopWords method will add these default stop words to my list of words. The Microsoft class is pre-loaded with a number of stopwords, among those “my“, “and“. My list of words just adds to that list.
That’s it. Happy coding.
MORE TO READ:
- ML.NET – An open source and cross-platform machine learning framework from Microsoft
- ML.NET Github repository from Github
- StopWordsRemovingEstimator Class from Microsoft
- ML.Net Tutorials from Microsoft