Cite this asKhan H, Hanif S, Muhammad B (2021) A survey of machine learning applications in digital forensics. Trends Comput Sci Inf Technol 6(1): 020-024. DOI: 10.17352/tcsit.000034
We address the role of machine learning in digital forensics in this paper, in order to have a better understanding of where machine learning stand in today’s cyber security domain when it comes to collecting digital evidence. We started by talking about Digital Forensics and its past. Then, to illustrate the fields of digital forensics where machine learning methods have been used to date, we recommend a brief literature review. The aim of this paper is to promote machine learning applications in digital forensics. We went through different applications of machine learning in different areas and analysed how machine learning can potentially be used in other areas by considering its current applications and we believe that the ideas presented here will provide promising directions in the pursuit of more powerful and successful digital forensics tools.
ACM: Access Control Management; CNN: Convolution Neural Network; ML: Machine Learning; GPS: Global Positioning System; BSAF: Blockchain-assisted Shared Audit Framework; ODL: Optimal Deep Learning; IEHO: Improved Elephant Herd Optimization; API: Application Programming Interface
Since the last decade of the 20th century, we witnessed a world-changing digital revolution. Mobile phones, Internet and many of the various digital technologies have become a part of our daily lives. At first, we were unaware of how significant those tools will be in the future but today they play an essential role in our daily lives and so do in our professional lives. This makes it obvious that most of our valuable information is going to be stored in digital form. In this age of Information Technology, the needs to change and enforce the laws are also changing. As a result, conventional crimes, especially those involving finance and commerce, are constantly evolving in response to technological advancements. In almost any investigation, a detailed analysis of computer systems and digital devices is becoming increasingly necessary in deciding the evidence.
This is where Digital forensic comes into the picture, to respond to law enforcement’s clear and expressed needs in order to make the most of this emerging medium of electronic evidence, where machine learning can play a vital role by automating the time taking processes.
Digital forensics can be defined as a branch of forensic science dedicated to recover and investigate digital data. It is an evolving field that is always advancing to catchup with the changes in devices and how they are used for the identification, preservation, analysis, and recovery of data from computer systems and various other digital storages. The “Big digital forensic evidence” is today’s major challenge for law enforcement agencies. The amount of data that needs to be analysed is rising all the time. A issue that needs to be solved is how to easily detect relevant file artefacts. In order to encourage further research in this field, we delve deeper into the role of machine learning in digital forensics applications in this paper, and provide an overview of the work already done in this direction.
As a discipline, the study of cyber security started in the early 1970s. At the time, the discipline’s approach was very strict, with a focus on the creation of theoretical models rather than the implementation of realistic implementations.
The work proposed by de Denning  is unquestionably one of the pillars of implementing machine learning in computer security.
Since then, a slew of new machine learning frameworks for computer security have been proposed , a real-time intrusion detection system having the capability to detect break-ins and penetrations by monitoring system audit records. The model also represented subject’s behaviour with respect to object. There are numerous examples of effective machine learning applications in various computer security fields like the one proposed in  which clustered a network level behavioural malware by analysing malicious HTTP traffic for structural similarities. Since the first recorded internet-wide attack in 1988 (the “Morris Worm”), it has taken about 25 years of worldwide efforts (and multi-billion dollars) to get this far.
Machine learning algorithms have been used in a number of fields. A variety of strategies for designing intelligent systems to solve various digital forensics problems have been proposed in the field of information security. Grillo et al. proposed a method for identifying machine users in order to make it easier to distinguish confiscated devices . The method leveraged the habits of particular user, skills level, online interests and searches, etc. Following user profiling, five distinct types of users were identified: casual users, Internet chat users, office worker users, seasoned users, and hacker users. This method assisted in prioritising the examination of confiscated devices. which resulted in minimizing the analysis time as forensic examiners had to examine only potentially relevant hard drive. Metadata has also been used for automatic forensic analysis in recent years. Garfinkel and Rowe  defined an approach that uses (file name, extension, path, size, access and modification time, hash codes, status flag and fragmentation) on a large volume to identify anomalies and suspicions in a file artefact. In 2013, Raghavan and Raghavan  demonstrated the use of metadata associations to identify the origin of downloaded files. In 2020 Asaf Varol, et al.  explained the role ML can play and how it can be used in analysing large datasets and revealing criminal behaviour and criminal intents by learning from previous activities, and help in predicting future criminal intents. S. Baskar et al. proposed  Blockchain-assisted Shared Audit Framework (BSAF) for the analysis of digital forensics data in internet of things platform. The proposed work detected source/cause of data scavenging attacks in virtualized resources. Francisca, et al.  reviewed applications of ML in object detection and classification. Mohamed Alhoseni, et al.  proposed a new IoT-enabled Optimal-Deep-Learning based Convolutional-Neural-Network (ODL-CNN) to assist in the process of suspect identification. The Improved Elephant Herd Optimization (IEHO) algorithm was used to optimise the hyper parameters of the DL-CNN model.
For smart settings: Abbas Acar, et al. (2019) presented a novel automated forensic system for intelligent environments . Modifier (ITM) and Analyzer are the two key components of this system (ITA). The ITM examines smart applications in order to find forensically important data inside them. The smart apps are then instrumented by inserting complex logs that, at runtime, send the forensic data to a secure Database (ITD). And, in case of an investigation, the (ITA) applies data processing and machine learning techniques on the ITD data to acquire the overall status of the environment. In the first experiment, both time-dependent and time-independent user actions and forensic behaviours were inferred. Initial results proved the effectiveness of the framework and the work is still in progress.
Recently, the major cause of data theft and system compromises has been the misconfiguration of access control systems, as quantified in this report by security analysis  and demonstrated by the newsworthy incidents listed in Table 1.
One of the key features missing today in ACMs is System administrators may use continuous behaviour validation to ensure that a system is behaving as expected after changing configurations. It is never a one-time initiative, but rather a continual process of adapting policy to evolving aspects of data and resource sharing.
The following are some of the most common scenarios in which system administrators must adjust the access control policy to accommodate changes to users, data, features, or domains.
➢ User transition: Within an entity or project, new users can join, or current users may leave or change roles.
➢ Changes in data: Parts of the data can become sensitive or begin to contain sensitive information that must be shielded from users who previously accessed the data.
➢ New functions, accesses, or facilities are added for the general public or a certain community of users to use
➢ Domain reorganisation: Data must be reorganised into new domains or subdomains.
P-DIFF, a realistic method for inferring access control behaviour and adjustments from access logs , was introduced by Cindy Moore, et al. (2019). P-DIFF will support sysadmins with the following two vital tasks:
➢ Forensic analysis. P-DIFF can also monitor all of the changes in behaviour caused by a malicious access. This offers information about when and what improvements made the access possible. Which assists the post-mortem analysis upon a security incident, those clues can help administrators narrowing down the record they need to investigate.
P-DIFF was tested using data from five real systems, two of which were from industrial companies. PDIFF detected 86 percent–100 percent of rule changes with an accuracy of about 89 percent for shift validation. In 85 percent–98 percent of the analysed cases, P-DIFF can pinpoint the root-cause shift that makes the target access for forensic examination.
Classification: One of the most discussed challenge in digital forensics is the ever-increasing data. The majority of the data on the confiscated computers is usually unrelated to the investigation. Finding a needle in a haystack is equivalent to manually extracting vital data and suspicious files.
Du, Xiaoyu, and Mark Scanlon (2019) proposed a system for prioritising suspicious file artefacts automatically . Rather than presenting the final analysis outcome, the toolkit predicts and advises that an artefact is possibly suspicious. It employs a supervised machine learning method that draws on the results of previously processed instances. The paper covers feature extraction, dataset generation, training, and evaluation. And additionally, a data extraction toolkit from disk images is outlined, this makes it simpler to incorporate this approach into the traditional investigative process and have it function in an automated manner.
In summary, they suggested an automated digital forensic data processing technique for prioritising investigation file artefacts, and they validated the methodology using an example scenario. A toolkit for data extraction, dataset creation, and pre-processing was developed. As a result, the procedure can be carried out automatically during the investigation. And Analysed and discussed the proposed solution for accelerating the processing of large volumes of digital evidence.
In digital forensics, file fragment classification is a crucial step. Traditional machine learning is used to extract features such as Ngram, Shannon entropy, and Hamming weights, which is the most widely used method. These characteristics, however, are insufficient to distinguish file fragments. Qing Liao, et al. (2018) suggest a new method based on fragment to grayscale image conversion and deep learning to retrieve more hidden features and increase classification accuracy . The deep Convolution Neural Network (CNN) model can extract about ten thousand features using multi-layered feature maps and non-linear connections between neurons. On the public dataset, the model was trained and checked. During the experiments, a classification accuracy of 70.9 percent was obtained, which is higher than previous works.
To summarise, grayscale images and deep learning were used to enhance the classification accuracy of file fragments. The findings were compared to representative results from previous research, revealing that this approach was more accurate. This approach appears to be promising, and it is worth further refining the model and techniques as a potential project.
Forensic tasks: Deep learning research indicates that deep features can generalise to seemingly unrelated tasks in some cases. Owen Mayer, et al. (2018) created deep feature learning techniques that can be applied to a range of forensic tasks, such as image distortion detection and camera model recognition . They devised two methods for constructing deep forensic features for this purpose:
1. A transfer learning approach requires moving features from one task to another.
2. A multitask learning strategy, in which a single function extractor is designed for multiple tasks at the same time.
The experimental performance was evaluated in numerous scenarios, and it was found that:
1. Camera model identification tasks generalise well to manipulation detection tasks, but learned features tasks from manipulation detection do not generalise well to camera model identification tasks, meaning task asymmetry.
2. Shallower features are more task-specific, while deeper features are more task-general, indicating a feature hierarchy.
3. It is possible to learn a single, coherent feature extractor that is highly discriminative of multiple forensic tasks.
Furthermore, the findings revealed that when training data is small, unified feature extractors outperform targeted CNNs. These findings illustrate a few key points to keep in mind when using deep feature-based approaches.
The idea of a smart environment has been enabled by the cooperative use of IoT devices and sensors, and in these smart settings, a massive amount of data is produced as a result of the interactions between devices and users and their day-to-day activities. Such data can provide useful information about incidents and activities taking place within the system, and if analysed, can assist in identifying and holding those who violate security policies accountable. However, prior smart app programming frameworks lacked forensics capabilities for identifying, tracing, storing, and analysing IoT data. IoTDots, a novel automated forensic system for a smart world such as smart homes and smart offices , was developed by Leonardo Babun, et al. (2018).
1. IoTDots- Modifier.
IoTDots- Modifier analyses smart app source code at compile time, detects forensic-relevant content, and inserts tracing logs automatically. The logs are then stored in an IoTDots database at runtime. The IoTDots-Analyzer then processes data and applies machine learning techniques to retrieve useful and usable forensic knowledge from the devices’ operation in the event of a forensic investigation. IoTDots were tested in a realistic smart office environment with 22 devices and sensors.
Also considered were ten separate instances of forensic practises and behaviours from users, software, and computers. The evaluation results show that IoTDots can detect user behaviours with an average accuracy of over 98 percent and user, system, and application actions with an average accuracy of over 96 percent.
IoTDots output was found to have no overhead for smart devices and very little overhead for the cloud server. It’s the first lightweight forensic approach for IoT devices, combining the collection of forensically relevant data from a smart environment with forensic analysis using data processing and machine learning techniques. The IoTDots is available online
During digital forensics research, textual records and e-mails are unquestionably a significant source of evidence. Authorship authentication and attribution are essential tasks when dealing with emails. Several studies have previously been proposed to solve this issue by studying the structure of an e-mail document (for example, e-mail headers, words, lines, and sentences, etc.) as well as linguistic trends (for example number of characters, occurrences of punctuation, vocabulary, etc) [18.]. With promising results, clustering algorithms and SVM were used. For example, in  When extracting the e-mails of three separate writers from 156 addresses, an accuracy of 84 percent to 100 percent was achieved. Other methods have been suggested for the study of any textual document, not just emails.
The development of successful digital forensics techniques is now aided by the introduction of text clustering methods. However, the topic of an ever-increasing number of text sources and the amount of confiscated devices has become increasingly important over time. Many studies have suggested that this problem needs to be addressed right away. Bandar Almaslukh (2019) presented a thorough overview of text-clustering approaches of digital forensic research and looked into the complexities of high volume data on digital forensic techniques. Furthermore, a useful classification and comparison of text clustering methods commonly used for forensic analysis were given. The main problems, as well as solutions and potential research directions, are highlighted to help researchers in the field of digital forensics in the age of big data. He also described possible future work in the field of forensic analysis using text clustering in the age of big data, such as validating text clustering on real-world and large-scale data, investigating the automated method for cluster labelling and bilingual clustering, and so on. Based on clustering strategies, similar works are categorised into fifth groups. SSOM, Kernel k-means and subject-based clustering, LDA, and benchmarking various clustering algorithms are the groups. In the last two groups, several clustering strategies in the sense of digital forensics are compared. The applicability of clustering techniques used in the literature to analyse a large volume of text data, on the other hand, is investigated.
Forensics: Network forensics is a technique for tracking down cyber criminals by analysing and tracing back network data. As a minimum, network traffic collection tools like Iris, Net Intercept, Net Witness, SoleraDS5150, and Xplico should be deployed. Network forensics entails examining network traffic for the purpose of detecting intrusion and determining how the crime occurred, i.e., setting up a crime scene for analysis and replays. The process model was introduced by Kaur P (2018), and it was compared to current network forensic process models and frameworks. In addition, the study problems at different stages were highlighted .
Network traffic classification is also important for network surveillance, security analysis, and digital forensics. The computational demands imposed by analysing all IP traffic flows are huge without correct traffic classification. The number of flows that need to be analysed and prioritised for review can also be reduced by classifying them. An automated feature elimination method based on a feature correlation matrix  is presented by Jan Pluskal (2018). They also suggested an improved statistical protocol recognition system, which was compared to Bayesian network and random forests classification methods and found to be accurate and perform well. Each classification method is applied to a subset of features that are most suitable for the method. The methods are judged on their ability to recognise application layer protocols as well as the applications themselves. The random forests classifier produces the most promising results, while the proposed improved statistical protocol recognition approach offers an interesting trade-off between higher efficiency and slightly lower accuracy, according to the experiments.
In this paper, we looked at some of the ways machine learning can be used in digital forensics. How machine learning techniques can be further used in Digital forensics to reduce the hard work and we explored those various techniques to show what to expect in the future from machine learning applications of digital forensics. Finally, we looked at the areas of digital forensics that needs our attention so that we work in those areas in order to secure our working environment in all aspects.
Today there are many other areas in digital forensics that needs the application of machine learning to increase efficiency and effectiveness of the process. One such area is Cloud Forensics. Cloud infrastructure vomits a huge number of logs. We create Application Logs, Cloud Operations Logs, System Host Logs, Perimeter (VPN packet trace) Logs, CI/CD DevOps Build and Release logs, and 3rd party API access logs just by setting up a simple system.
Any interaction, access, or API call generates a log. A medium-sized infrastructure (say, 100 hosts across all environments) with reasonable commercial use (20,000 unique visitors per day) produces more than 20GB of logs per day. The method of evaluating and examining all of this is labor-intensive and vulnerable to error. We could no longer depend on humans, with their limited processing capacity and inflexibility.
Let’s go back to a daily log volume of 20 GB. It’s a massive amount of information. An operating infrastructure is almost always either anomaly-free or relatively anomaly-free. Unless we’re training an algorithm to differentiate between anomaly-full and anomaly-free logs, anomaly-free logs are typically tedious.
Computers can and will examine the entire 20 GB of data every day. Humans are unable to do so. A computer system’s learning accuracy would also improve over time. In operational defence, precision is the name of the game. Security data will grow, particularly in the cloud, where you can turn up unlimited capacity in exchange for money. Finding a needle in a haystack would be difficult. All of the information can be read by computers.
This is where machine learning, our sidekicks, and invaluable resources, can come in handy on a wide scale. Over the last ten years, the security industry has done an outstanding job of turning large data pipelines into massive data stores for analytics. However, we assume it is now time to conduct the actual analytics.
Subscribe to our articles alerts and stay tuned.