Big Data is for sure one of the biggest trends of the last few years. Besides the conceptual discussions on what big data actually is and what amount of data can be defined as such, any technology that quickly becomes widespread is relevant to security as well.
A bit of History
When talking about big data in practical terms, most of the time we are talking about Hadoop, that likely is the most adopted platform when dealing with huge amounts of data.
With the advent of Cloud Computing, Hadoop has catched the attention of cloud vendors and providers, which have started offering big data processing as a service or with the pay-as-you-go model. Also, many companies have deployed their own Hadoop cluster in the Cloud or on-premises.
However, despite being a great tool for processing big data, in origin Hadoop was designed mainly for internal use, meaning on local clusters within the security perimeter of the organization. Therefore, previous versions (before 2.6) of Hadoop were not designed to be well protected against external threats, hence they were easily hackable in case of a breach and highly unsecure. For instance, some of the weaknesses affecting earlier versions of Hadoop were:
- very weak, or absent, authentication (was performed through the whoami bash command);
- no support for transparent data encryption at rest (https://issues.apache.org/jira/browse/HDFS-6134), which is a strict requirement for enterprirses that want to be compliant with security best practices;
- no support for credentials (https://issues.apache.org/jira/browse/HADOOP-10607) and key management server (https://issues.apache.org/jira/browse/HADOOP-10433);
- no support for extended and fine-grained HDFS ACLs;
- no support for secure communications (encryption of data in transit using TLS) in HDFS, therefore data was sent in clear and could easily be sniffed on the network;
- and many more…
Current versions of Hadoop are by far more secure than earlier versions. However, by default they are not secure out-of-the-box, therefore you need to get your hands on the configuration files and make sure security relevant options are actually enabled.
Also, it is important to keep in mind that the only way of making your cluster secure is to protect it at different layers, from the lower (that is OS-level security) up to application-level and network-level security.
Here is a minimum set of options we strongly recommend to enable in order to secure your Hadoop clusters:
- Enable HDFS extended ACLs by adding the following properties to hdfs-site.xml
- Enable Hadoop security module and strong authentication (Kerberos) by adding the following properties to core-site.xml
hadoop.security.authentication kerberos hadoop.security.authorization true
- Secure HDFS by adding the following properties to hdfs-site.xml (in particular, enable HTTPS and advanced authorization)
dfs.block.access.token.enable true dfs.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab dfs.namenode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM dfs.namenode.kerberos.internal.spnego.principal HTTP/_HOST@YOUR-REALM.COM dfs.secondary.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab dfs.secondary.namenode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM dfs.secondary.namenode.kerberos.internal.spnego.principal HTTP/_HOST@YOUR-REALM.COM dfs.datanode.data.dir.perm 700 dfs.datanode.address 0.0.0.0:1004 dfs.datanode.http.address 0.0.0.0:1006 dfs.datanode.keytab.file /etc/hadoop/conf/hdfs.keytab dfs.datanode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM dfs.web.authentication.kerberos.principal HTTP/_HOST@YOUR_REALM
- If you use WebHDFS (REST API for HDFS), make sure Kerberos authentication is on by adding the following properties to hdfs-site.xml
dfs.web.authentication.kerberos.principal HTTP/_HOST@YOUR-REALM.COM dfs.web.authentication.kerberos.keytab /etc/hadoop/conf/HTTP.keytab
- Enable transparent data encryption and configure the Key Provider (which will take of generating and providing encryption keys). You can use the hdfs crypto to test your configuration. Here you can find the documentation https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html#Configuration
- Finally, as a general security warning, make sure that firewall rules are correctly set and limit the access from the Internet only to necessary services. Indeed, by default Hadoop will expose two web interfaces, one for the Resource Manager (on port 8088) and one for NameNode (on port 50070), which are not protected by authentication, therefore they can potentially leak sensitive and critical information
Continuous Security and Vulnerability Assessment
Of course, monitoring a Hadoop cluster (or even many of them) can quickly become a headache for system administrators and devops, therefore an automated tool can be of great help. That’s why we embedded these (plus a few others) Hadoop security analysis into our product Elastic Workload Protector in order to help you continuously monitor the security of your Hadoop cluster(s) and be notified as soon as there is a potential issue or misconfiguration.