Nowadays, Machine Learning models are frequently used in different domains including Health. Here is a tutorial on how to build and train a model as a heart attack predictor based on a clinical dataset without coding. This will teach you how to explore the power of SmartPredict with a real use case.
This article was originally published by Smartpredict.ai.
Connect to the Platform
First, let’s connect to the platform at cloud.smartpredict.ai. If it’s the first time you logged in to the platform, you have to create an account and you will be provided with a personal workspace in which you can view all your projects. You can download the dataset along with a request example for this tutorial from this repository.
On the homepage of SmartPredict, we have a list of prebuilt use cases and some sample projects that we have built based on a famous public dataset from the data science community.
In the My projects tab, you can see the list of all projects you have done on the platform. We have also built-in project templates for the creation of a new project. You can choose one of these templates or you can also choose to build a new empty project from scratch.
Let’s create our New Project. At the project creation pop-up, we choose Empty Project as a Project Template as we will build our model from scratch. Then, let’s choose a title for our project and you can put some descriptions if you want. In the Mode section, choose Manualflow. Autoflow mode is for building a model using autoML, for now, we can use auto ml models on SmartPredict for forecasting problems. That’s all, you can validate the configuration of your new project.
Now, you are redirected to the Build Space of your project. Here is the space where you will perform data analysis, data preprocessing, and model training to build your Machine Learning Pipeline.
Now, let’s import our dataset. From the Datasets menu, choose to create a new dataset. Then, select Tabular as we will use a CSV source file.
You will be redirected to the dataset creation page. First, you should select a source from where the dataset will be uploaded. SmartPredict can upload datasets from various sources. For now, choose a Local file as the dataset source and upload the dataset from the “heart.csv” file that you have downloaded from the repository. When you finish choosing the file, the Encoding and the Separator in the CSV file should be defined. Finally, confirm your upload by Creating the Source.
Wait a moment while SmartPredict is uploading your dataset and creating metadata for it. Then, before confirming the creation of the dataset, you can change the name of this one when displayed in the dataset list, you can also choose to keep only some columns if you don’t want to get the entire dataset. Finally, confirm your upload by clicking on Create A New Dataset.
Now, your dataset has been successfully created. You can choose to add it to your workspace, you can also explore it or use it in an Autoflow project if suitable. For now, let’s return to the Build Space and begin to construct the build flowchart.
Creating the Build Flowchart
The process of building and deploying an ML solution using SmartPredict is based on a flowcharts creation system. These flowcharts describe the entire pipeline to use to make the ML model operational. As mentioned earlier, the build flowchart (which is created in the Build workspace) is the flowchart where you perform data analysis, develop, train, and save your models. So let’s construct our build flowchart now.
Perform Data Analysis
The first step in a data science project is data analysis. SmartPredict has a module that we can use to perform this task. So in the build space, take the dataset you just uploaded from the dataset tab and place it in the interface. Then, search for a module named Data Visualizer, take it, and link its input to the output of the dataset module. At this point, you can launch the project (by clicking on the Play button on the left) to generate some visualizations of the dataset. When you run a build or a deployment process on SmartPredict, you have to choose one Ressources Template (depending on your subscription type :)), and during the launch process, you can monitor the pipeline by viewing logs.
When the run is successful, you can visualize your dataset by checking the visualizer module. In the Processing tab, you have a sample of the dataset. The Profiling tab will provide you with many useful pieces of information about your dataset including :
- An Overview of the dataset: General statistics and variable types count
- Variables description and distribution
- Interactive scatter plots that show interactions between variables
- Correlations heat map that can be calculated using different correlation measure
- Monitoring of numbers of missing values for each column
- Again samples of the dataset (first rows, Last rows, Duplicate rows)
The Visualization tab is not available yet but it is coming soon.
Once you have sufficient insight into your dataset, you can delete or disable the Data Visualizer module to optimize the running process duration for the next steps.
The next step is preprocessing. On SmartPredict, many data processing operations can be performed using the Dataset Processor module under the Data Preprocessing category of Core Modules.
One Hot Encode
From the data visualization step, we found that we have some categorical variables that we must encode here to be used when training the ML model. For that, let’s get a One Hot Encoder processor on the Dataset Processor module. Then, choose columns to encode, for this dataset we should encode 7 variables. And under the Handle unknown parameter, choose to ignore to avoid errors when a value has not been encoded in the deploy space. Save your parameters and close the setting pop-up.
Scaling the dataset is an important step when building an ML model. Here let’s use a Min-Max Scaler to scale numerical variables. For that, use a Dataset Processor again and choose Normalize as the processor. Keep the default value (MinMaxScaler) in the Normalize with parameter. Then, at Columns to process selection, choose columns to normalize. In this dataset, we have 6 numerical variables. Save your settings and close the pop-up.
Now, you should save these two steps of preprocessing by using an Item saver module. In the module parameter, choose a name for the saved item, choose Processing pipeline (as you are saving a preprocessing step) as Item type and choose to Overwrite in case of an existing same name.
Separating Features and Target
Now, you should separate features from the target. This is necessary for the next step. From the Normalize module output, you should drop the target column to get the features and keep only this column to get the target only. For that, use two Dataset Processor modules again, one to drop the output column and the other to keep only the output column. To transform the target variable to a one-dimensional array, you should use a Features selector module and put the output variable on the Selected Target parameter. At this point, you should have a flowchart like this one :
Splitting Train and Test Dataset
The next step is to split the dataset into train and test parts. For that, use a Data splitter module and choose Train/Test Split as the Nature of the Split. You can change the test ratio (the percentage of the dataset to use as a test dataset). You can also use a stratified shuffle split or change the Random state.
When linking this module to the last modules, the first input (Features) should be connected to the Delete column output and the second input (Labels) should be connected to the Features selector second output.
Now, it’s time to train our model. The module that has been developed for this purpose is the Model trainer module. Choose an ML model from the Machine Learning Algorithms categories and link this module to the first input of the Model Trainer module. At the second and third inputs respectively, you should link the training features and training labels from the Train/Test Split module. The output of the Model Trainer module will be a trained ML model. In the picture below, I’ve chosen the K Nearest Neighbors classifier as an ML model. Finally, save the trained model as a Trained model by using an Item saver module.
It’s time to evaluate our model. SmartPredict has a Model Evaluator module in which you can use different metrics according to the problem type. You can also use cross-validation if you want and choose the number of K-fold. Here, I’ve used Accuracy and F1 score as metrics but can experiment with other metrics as you want. Finally, you can use a Data Logger module to log the results of the evaluation.
Running The Whole Process
Now, you can run the whole process and view logs while running. When the process succeeds, you can view the performance of your model in the logs.
Creating the Deploy Flowchart
When the build process is finished, create a deploy flowchart by translating this first one.
You will be provided with a deploy flowchart similar to this one :
Remove the Features Selector module from the flowchart as we have used the features as a data frame, not as a NumPy array. Then, deploy the flowchart by clicking on the deploy button. For now, confirm right after your deployment without entering any additional settings.
Then, you have to choose the deployment mode :
- The SERVER MODE is used when you want the server to be up every time whether a request is processed or not. This deployment type is faster when treating a request but it’s more expensive as your server is always up.
- On the other hand, the SERVERLESS MODE is used when you want the server to be up only when a request is processed. This deployment type is slower than the first one as the time to set the server up is added to the time of processing the request. However, this deployment type is cheaper than the first one.
So the choice depends on you. For this tutorial, let’s use the SERVERLESS MODE. Once again, you will be asked to choose a Ressources Template according to your subscription type. Afterward, while the web service is launching you can go directly to the MONITOR space.
On the MONITOR space, you can view the deployment type you have chosen. The URL to access the web services is also provided with a default access token. Another feature (that I like very much 🙂 ) of SmartPredict is that you are also provided with code snippets in different languages that can be used when you want to integrate the API into your development project.
Sending a Request
The final step is to test our pipeline by sending a request. For that, go to the PREDICT space, create a new request from here and choose a name for this request. On SmartPredict, you can send requests by using different ways :
- you can use the TABLE when you want to send multiple requests at one time
- you can use a JSON if your input is formatted as a JSON dictionary
- you can use a FILE if your inputs are located in a file from your computer
- finally, you can use a DATASET if your inputs are loaded on SmartPredict as a dataset
For now, let’s use the above JSON request. I have already prepared a request example for you in the “request.json” file from the repository :
Copy and paste this request into the JSON tab and launch your request. If successful, you can view the result of your request as a table in the OUTPUT section. A confidence score is also provided here.
You can also view the result in the JSON tab. The JSON response is provided with other features like the data types and data qualities.
In this tutorial, we have shown you how to implement a basic data science project using the SmartPredict platform by explaining all steps from creating the project to sending a request. This tutorial shows the powerful abilities of SmartPredict to be used as a basic data science tool. In other tutorials, our team will show you more features and interesting projects to give you more ideas on what you can do with this fantastic platform.