IEEE Network - March / April 2017 - page 69

IEEE Network • March/April 2017
0890-8044/17/$25.00 © 2017 IEEE
As one of the most popular platforms for
processing big data, Hadoop has low costs, con-
venience, and fast speed. However, it is also a sig-
nificant target of data leakage attacks, as a growing
number of businesses and individuals store and
process their private data in it. How to investigate
data leakage attacks in Hadoop is an important but
long-neglected issue. This article first presents some
possible data leakage attacks in Hadoop. Then an
investigation framework is proposed and tested
based on some simulated cases.
Hadoop is one of the most popular platforms for
big data storage and analysis. It is widely used in
many fields, such as manufacturing, healthcare,
insurance, and retail because of its powerful pro-
cessing capacity, huge storage capacity, scalabili-
ty, and relatively low cost. Nowadays, a growing
number of individuals and businesses store and
process their private data in Hadoop, and this
valuable data has become an important target of
hackers [1].
In order to prevent these types of attacks, inves-
tigating them and reconstructing the entire scenario
is very important. Based on the forensic results [5]
[6], the vulnerability of Hadoop can be found and
the attackers can be accused. Although the Hadoop
investigator often faces many challenges, there is still
little research work in this area. Current challenges
faced by the Hadoop investigator include, but are
not limited to, how to locate the data leakage node
among thousands of Hadoop nodes, how to obtain
reliable evidence in a complex and rapidly changing
Hadoop environment, and how to investigate attacks
based on Hadoop audit logs, which usually contain a
large quantity of redundant and multi-user data.
Considering these challenges, in this article, we
first present some possible data leakage attacks in
Hadoop, then analyze the difficulties of investigation.
After that, an investigating framework is proposed
and tested based on simulated cases. The framework
is composed of the data collector and data analyzer.
The data collector collects Hadoop logs, Fsimage
files, our own monitor logs, and other information
from each node actively or on demand, and transmits
them to the data analyzer. Then, the data analyzer
analyzes the data with automatic methods to find the
stolen data, find the attacker who stole this data, and
reconstruct the crime scenario.
Current Hadoop security research mainly includes
trusted audit mechanisms, access control, data
encryption, and so on. In [1], an architecture was
proposed to fight against advanced persistent
threats (APT) targeted against data stored in the
HDFS (Hadoop distributed file system). This archi-
tecture is based on the trusted platform group
(TPM) and trusted computing group (TCG). With
the help of this architecture, all of the operations
triggered by users can be audited. In this way, sus-
picious actions can be discovered and evidence
can be retrieved for future investigation. Howev-
er, it results in a serious impact on performance
and a huge size of logs, which make it impractical.
In [2], a complicated access control mech-
anism was adopted to ensure the security of
Hadoop. It protects data in Hadoop from unau-
thorized access, accidental leakage and loss, and
breach of tenant confidentiality. “ACL Access
Control” and “Kerberos” are two of them that
have been adopted by the new versions of
Hadoop. Although this can make Hadoop more
secure, it is not able to prevent attacks if criminals
use legal user accounts, and nothing can be done
for direct data access in the operating system
layer because these mechanisms are only for the
application layer.
In [3], a secure Hadoop architecture was pro-
posed that adds encryption and decryption func-
tions in the HDFS. Through this method, data in
the HDFS will not be readable even if it is stolen
because the attacker does not have the secret
key. Data is protected in this way. Although it is a
fundamental solution for securing Hadoop, what
cannot be ignored is the impact on performance
and the fact that this approach does not guard
against attackers who use a legal account.
Unlike [1], our work concentrates on investi-
gating (including evidence collection and analysis)
data leakage attacks in Hadoop. Although [2] and
[3] present some good solutions to prevent data
leakage, they are far from silver bullets. This type
of attack still takes places frequently, so research
on how to investigate them after they happen is
necessary and important. We have not found any
other work on such an issue.
The data leakage attacks in Hadoop mainly
include but are not limited to the following cat-
Application layer data leakage.
This means
attackers can obtain private data by applica-
tion-layer vulnerability or malware. For example,
a vulnerability in the current Hadoop audit mech-
anism is that it only records the operation type,
time, and content, but no information on who did
this operation. Suppose in a company, Alice, Bob
and Cindy belong to a group named Hadoop,
which is responsible for managing their compa-
Security Threats to Hadoop: Data Leakage Attacks and Investigation
Xiao Fu, Yun Gao, Bin Luo, Xiaojiang Du, and Mohsen Guizani
Xiao Fu, Yun Gao, and Bin Luo
are with Nanjing University.
Xiaojiang Du is with Temple
Mohsen Guizani is with the
University of Idaho.
Digital Object Identifier:
1...,59,60,61,62,63,64,65,66,67,68 70,71,72,73,74,75,76,77,78,79,...100
Powered by FlippingBook