Deep learning and cybersecurity
As more critical decisions involve information systems and as software continually envelops the world, the security and integrity of our systems are more important than ever. This is especially true of systems that handle sensitive user data. The recent Equifax security breach, in which over 143 million social security numbers were exposed, paints a dismal picture about the future safety of our personal information.
As our sensitive digital information grows, these systems become more susceptible to attack. This becomes a serious problem when you have a shortage of qualified talent in cybersecurity.
While the problem seems intractable, machine learning can help, especially the powerful technique known as deep learning. This method can help reduce alerting noise and make sure that valuable analyst time is spent investigating actual threats. The idea is to amplify the analyst, not replace her.
Approach
Using data from the
1999 Knowledge Discovery and Data Mining Cup, our goal was to classify network connection attempts as benign or malicious. Trained penetration testers simulated cyber attacks on a closed military network in order to generate this dataset. We also wanted our algorithm to determine the type of cyber attack used, if possible. The best systems keep improving as more data enters the system and should be able to function in contexts where unstructured or semi-structured data is present.
Below, we’ve visualized a small sample from the dataset using a method known as
t-SNE. Each point is colored according to its attack type.
[advanced_iframe securitykey=”014ce27b04887c6694f914628ec72b85bdad8e62″ src=”https://pandatum.shinyapps.io/tsne_diagram/” width=”100%” height=”400″]
Algorithm
To apply recent advances in AI to this problem, we based our model on deep neural networks. Specifically, we used a modernized version of the
feedforward network, as well as two recent advances to the state of the art: a
deep residual network and a
self-normalizing neural network. We combined these models with an ensembling method known as
stacking.
Results
Although we trained the model to identify each type of cyber attack, what we care about most is how well the model alerts humans to malicious attacks. That’s how we’ve presented our results in the graphic below.
The top nodes of the diagram represent each of the true attack types in the portion of data we set aside to evaluate our model, while the bottom nodes represent whether our model classified them as normal or malicious. All of the normal connections are contained in the leftmost node, while each of the nodes on the right was malicious. The size of the flows between each top and bottom node represents how much of the data for each attack type was classified as normal or malicious.
[advanced_iframe securitykey=”014ce27b04887c6694f914628ec72b85bdad8e62″ src=”https://pandatum.shinyapps.io/sankey_diagram/” width=”100%” height=”600″]
As you can see, it worked! Mouse over the flows to see results for each attack type. Our model was able to identify over 99.9% of all malicious connections. [1]
Conclusion
Although this is a proof-of-value work, we believe that this model can be used to augment cybersecurity investigations in security-focused organizations. Our model focuses analysts’ attention on real threats, enabling them to cut through the noise.
Deep learning is an exciting field, and its applications are everywhere. If you would like to find out how machine learning can improve your business, reach out to
hello@pandata.co.
[1] If you’re a data science person, our ensemble had precision 0.998567 and recall 0.999736 on our test set. We compared that to a standard softmax classifier, which had precision 1.0 and recall 0.5. More details can be found at Jason’s Github repository.