Working on recognising anomalous patterns within the incoming data using machine training and improve Smart Home Security technologies today
An Intrusion Detection System is a listen only, defence system which is used to trace the malware and threats to the network from incoming traffic. It can be Host based or Network based (HIDS/NIDS) depending upon its location within the network. Existing IDS frameworks can be partitioned into two categories as indicated by the detection approaches: anomaly recognition and signature detection. Signature recognition frameworks attempt to match computer activity with stored signatures of known adversaries or attacks. On the contrary, the anomaly based IDS, is made to learn the normal behaviour and usual hosts over the course of time in the network and alert for intrusion when they identify anything that strays from typical action. Intrusion Detection System technology is not new to the present day industry. It is being utilised to protect the Computer Networks, large-scale industries, ad-hoc networks etc. from being compromised for security. However, Smart Home itself being a novice idea is still in its nascent stage and the companies are focussing more on developing new smart devices to connect, rather than the protocols to maintain their safety status. It is important to understand that our homes are single biggest store of heaps of personal information and anyone getting an access to them can prove to be fatal. Moreover, home environment is a mixture of static and continual activities every day and in such a place a single rule based detection system cannot be expected to provide complete insulation from the adversaries. In such a scenario, an anomaly based IDS is more useful to detect the adverse behaviour which is unforeseen and can be detected. This project aims in developing IDS which does not have any prior, user-defined rules or baseline to trace the traffic of suspicious behaviour. Since the former approach is archaic and insufficient in providing complete insulation to the smart home network, this paper discusses an Intrusion Detection System which is based on machine learning and pattern recognition techniques is discussed, which will not only give reports of malware intrusion in system to the user but also identify anomalies and alarm owner of adversaries in real-time. The whole idea is implemented on a virtual interface in order to avoid the conundrum of working with actual devices and technicalities incurred during running of algorithms. The simulation provides exactly the idea on which this security system is based and provides a clear understanding of the whole process. The platform used is IBM Watson and Microsoft Azure Machine Learning Studio. I. Methodology This project is carried out in three parts, each part description has been provided in subsequent sections. There are a number of sensors deployed in Home network. These sensors are automated devices which have the capability to connect to the internet and communicate amongst themselves with minimal human intervention. In a complex smart home network, there can be n-number of such smart devices which can be further classified into groups according to their common dependency factors and functional parameters, for example, Temperature, Humidity and Pressure sensors can be grouped into one class, Class A; automated lights and heating and cooling of rooms into another classified group and so on. The interconnection of all these devices will constitute a Smart Home Environment. A home atmosphere can be as heterogeneous as we can imagine it to be, because of vast number of activities taking place in it, it’s practically infeasible to judge and predict to inform the security system of adverse behavior and alert the owner. Not only will this be code extensive, it will be almost impossible to make the IDS completely insulated against the adversaries. This is where Machine Learning comes into work, where we can simply feed the data into the machine and let the system find the normal and anomalous patterns for us. As discussed earlier, the IOT device generates an inconsumable amount of data and this data is utilized to train the IDS device and later test it. A computer can be trained with data in several types - supervised and unsupervised learning. In the former system, the computer is told what the data is about and its parameters whereas in latter scenario, unclassified form of data is fed into the system which then tries to figure out the pattern in it. We have utilised the unsupervised learning system in our Project. The data running between the devices is not necessarily in a state where it can be fed to the system and processed. When you have immense amount of data, it may not generally be perfect. This can make issues like having inadequate information fields in it. Most important thing before getting the data ready is to convert it into an acceptable file format, clean it and then explore it. Once the data is collected, cleaned and in accessible format, we need to identify the useful data which is actually going to be processed. For example, in our case, we need only the sensor name and the reading which they give; the temperature in degree Celsius, light is On/Off and the time when they give out the reading. Rest of the headers and content can be deleted which are not the primarily required specifications. This dataset should be labelled properly so as to identify it during the processing. Since anomalous behavior is defined by uncommon occasions, it can be hard to gather a big enough representative sample of information that can be utilized for modelling. The calculations incorporated into this segment have been particularly intended to address the core difficulties of building and preparing models for anomaly detection, utilizing imbalanced datasets. Following is the cheat sheet of Microsoft Azure which gives a graphical design outline to guide you through the choice of algorithm. Tools used: Node-RED, Microsoft Azure ML The flow starts with importing of the dataset of our Smart Home Environment which was virtually implemented in Node-RED platform. Microsoft Azure accepts primarily a .csv file. However, there are various modules present that can allow you to convert the file into required format. The data contains 6 columns and over 4 million rows. The columns have mixed features (numeric and categorical types). In order to segregate the training data from test data, a last column named ‘Label’ is added with values 1 and 0, where the former defines a normal behavior and 0 defines abnormal/risky. In the testing data, both the values of Label are used to test the algorithm. II. Design A. Initializing Internet of Things in Smart Home Network on Node-RED Platform The Internet of Things network in Smart Home environment is realised using a virtual platform to eliminate the hardware complications mainly to facilitate focus on its security issue. This virtual platform is designed with the help of Node-RED. It is a visual tool for wiring the Internet of Things and other applications to assemble and monitor flows of service. It is an open-source tool made available by IBM Emerging Technology Organisation. It can be accessed via IBM Bluemix IoT applications starter, and also through Terminal where one can install and run Node.js application. In Node-RED, IoT devices are present in form of nodes which one can drag and drop in the flows. The functionality of these nodes is customisable and can be directly monitored. In this project, I have utilised SenseHAT node offered by Node-RED which allows us to create flows virtually eliminating the need of hardware. This needs to be installed separately from nodered.org library via Terminal. This node has various sensors that can be used for input data, and can be grouped into three sets : -Motion events- acceleration, gyroscope, orientation and compass -Environmental events- temperature, humidity and pressure values -Joystick events- give values when SenseHAT joystick is communicated with. I have used Environmental sensor values for the purpose of my project. This sensor displays the value of current temperature, humidity level and atmospheric pressure values in the environment. In Figure 5 below, SenseHAT Simulator window is shown through which we set values for the three parameters to be monitored in the Node-RED platform. Setting up Node-RED to Send Data to Azure IOT Hub: Node-RED provides access to a variety of nodes that can be used as devices and Gateways. In this section, we will design the Node-RED platform such a way that the data transmitted from IoT network can be received in Azure ML platform. In our Node-RED platform, the IoT device is the Simulator and we add Azure IoT Hub node in order to develop a connection between the two platforms. New nodes can be installed from Manage Palette menu in Node-RED easily. The Azure node needs to be configured and host name is put as the one set up in Azure ML. This connects the two platforms instantly and data sent can be seen on Azure Dashboard. Azure ML Platform Algorithm Description; Edit Metadata: In this item, the metadata of the data imported can be altered. We have used this item in order to rename the last column as Label so that the rest of the algorithm can identify the normal vs. abnormal data. The Split data module is used to divide the data in the ratio of 80% and 20% for training and testing data respectively. Anomaly detector is trained on information containing a single class (which is assumed to be “normal behaviour”). We delete lines containing "anomalous behaviour" label utilizing a Split module with a regex channel on the label column within the dataset. This procedure is repeated on the data used for parameter sweep (consisting of a further split of training data into train and validation sets). Learners- Currently, Microsoft Azure Machine Learning supports the following anomaly detecting algorithms: **One-Class Support Vector Machine (One-Class SVM):** This module is helpful in situations where there is large amount of information depicting “normal behaviour" is available and relatively fewer instances of the irregularities to be distinguished. For instance, if one has to identify a theft within the house, then there won’t be numerous cases of the same which can be combined to make a model used to train the IDS; however, there will be a number of cases that represent normal behaviour. Normally, the SVM algorithm is given a set of training examples belonging to one of two classes. The SVM calculation represents the examples as points in space, mapped so that the cases of different classifications are separated by a reasonable gap that is as wide as could be allowed. New illustrations are then mapped into that same space and anticipated to have a spot with some class, based on which side of the gap they fall. In some cases, oversampling is utilized to duplicate the current specimens so we can make a two-class model, yet it is difficult to foresee all the new examples of anomaly or system faults from limited cases. Therefore, in One-Class SVM, the support vector model is prepared on information that has just a single class, which is the "normal" class. It deduces the properties of typical cases and from these properties can anticipate which cases are not at all like the ordinary cases. This is helpful for anomaly detection on the grounds that the shortage of training examples is the thing that defines anomalies: that is, normally there are not many cases of the system interruption, extortion, or different anomalous conduct. Principal Component Analysis (PCA)-based: This module is intended for use in cases where it is difficult to get training data from one class; for example, when it is hard to obtain adequate specimens of the targeted anomalies. However, if we use the PCA-Based Anomaly Detection module, we can prepare the model utilizing the accessible components to figure out what constitutes an "ordinary" class, and after that utilization separate measurements to recognize cases which represent anomalies. PCA is as often utilized as a part of exploratory data examination since it uncovers the internal structure of the information and explains the variance in the data. PCA works by breaking down information that contains numerous factors, all possibly correlated, and determining some combination of values that best captures the differences. It then yields the mix of qualities into another arrangement of qualities called principal components. In the case of anomaly detection, for each new data, the anomaly identifier initially registers its projection on the eigenvectors, and afterward processes the standardized remaking blunder. This normalized error is the anomaly score. Trainer: The yield of One-Class SVM and PCA are fed into Sweep Parameters module so as to decide ideal parameter settings. Learner Parameters: The parameters of the learner can be demonstrated in two differing ways using the Create Trainer Mode decision found in the module properties. Parameter Sweep: Different qualities or a range of qualities can be determined for each of the tuneable parameters. Sweep Parameters module takes in untrained model along with training and optional validation data and provides optimum parameter settings for the experiment. Scoring: By using the generic Score Model module, we can obtain the predictions from the anomaly detectors. Possibly unbounded, non-calibrated scores are received as output of one-class SVM anomaly detectors. In order to match the dynamic range of scores generated by the PCA anomaly detector, we normalize the scores . The given area under the ROC curve determines a good method to measure the discriminatory power of the anomaly detectors. In this experiment, we have set the Entire Grid as the parameter setting for Parameter Sweeping mode, since we don’t know what best parameter settings might be and want to try many parameters. This option uses looping over a grid which is predefined by a system in order to try many different combinations of settings. The parameter sweep is then executed to streamline F-score estimation of the detector. Different measurements relevant to classification, for example, AUC, ROC, accuracy/ recall can likewise be optimized. **Classification:** The visualisation of a scored dataset as viewed in Evaluate model tab gives the following values which can be useful in order to analyse the performance of the entire project. True Positive (TP): Correctly identified case False Positive (FP): Incorrectly identified case True Negative (TN): Correctly rejected case False Negative (FN): Incorrectly rejected case Accuracy: The proportion of the total number of predictions that is correct. Accuracy = (TP + TN) / (TP + TN + FP + FN) (1) F1 Score is the harmonic mean of precision and Recall. F1 = 2TP / (2TP + FP + FN) (2) Threshold is the value above which it belongs to first class and all other values to the second class. For example, in this experiment, if a case value is less than or equal to 0.5 (threshold), then it will belong to anomaly class and otherwise normal class. The output and efficiency of the algorithm can also be analysed by observing the ROC (Receiver Operating Characteristic) plot in Visualize option . This plot gives corresponding Area under Curve (AUC) value whose measures can be used in analysis metrics. The performance of the whole experiment is reflected based upon the distance between the curve and upper left corner of the graph. The closer the curve is to the corner, the better is its performance i.e. maximising TP rate while minimising FP. Also, the closer the curve is to the diagonal of the plot, the closer the predictions are to random guessing. ROC plot of this experiment is shown and as can be seen, the Scored dataset is almost touching Upper left corner of the graph and hence shows optimum performance. Below is the performance of the model we found by visualizing the output of the Evaluate Model. Precision: This defines the proportion of cases which were accurately identified and should be close to 1. As can be seen from Fig. 25, the value for Precision we are getting is 1.000 which is absolute 1; this signifies that our model is correctly modelled. Precision = TP / (TP + FP) (3) Recall: Proportion of actual correctly identified cases. Again, this value should be as close to 1 as possible. In our experiment, the value of Recall is 0.978 which means there are less number of False Positives. Recall = TP / (TP + FN) (4) **Conclusion and Future scope** In the above discussion, it is apparent that the Pattern based IDS for Smart Home system can prove to be successful. Not only is it cost effective, but also full proof and an ideal security solution for Internet of Things Smart Home network. The application of Machine Learning in safeguarding the ad-hoc Computer Networks has been a success and there is no reason why the same approach cannot be used in Smart Home networks. As future prospects, this analysis can be applied in the IDS node of Node-RED platform. This way, the IDS will be fed first the 80% training data and afterwards, the incoming traffic will be scrutinised according to the results obtained and learned by the IDS from the Training Data. In order to apply this system in a practical situation, the Training Data will have to be collected from actual physical devices in the house. This can be done by connecting SD Memory cards in the Microprocessor connected to the device, so that data gets collected in the card. The data will have to be large enough to provide recognisable trends in usage. Once the data is collected, it will be pre-processed in order to make it suitable and legible for analysing. After the pre-processing of data, it can be run through the IDS in the home network for Training. The algorithms which can be used are illustrated in this report in section III, C- 1&2, along with the reasons why should they be used. This will ensure that the Accuracy of the algorithm is maintained and is close to what we received in the Azure ML platform. After the IDS is trained, it can be tested as per the virtual platform in real-time and the results can be verified. The transition from virtual phase to actual physical implementation is bound to surface many concealed issues, some of which are listed below: • Protocol to collect huge datasets in an SD card might be slow and not very efficient. • Integration of various devices needs unification of different protocols on which each of them run, this might be difficulty. There will be significant changes in the algorithm because the nature of values and parameters of IoT devices will be different from the ones being simulated in the network in this experiment.