MapR has a fantastic feature where it exposes the Hadoop file system directly via NFS. So, you can read and write to the Hadoop cluster directly from client machines, rather than having to use other tools you have to learn. Now, to be fair, Hortonworks and Cloudera have NFS share features too. The difference is that they buffer the incoming data before actually writing it to the cluster. With MapR, the data is written as it’s received. MapR should have better performance with NFS shares too.
The use cases for DWH are pretty obvious: if you can export csv files and have them simply write directly to the Hadoop cluster, then you’ve cut out an entire step in the data workflow. No need to use special ODBC drivers in SSIS to write data to Hadoop, no need to move very large files around over your network. You can even mount the NFS share on your web servers and have them log directly to that mount point. As the data is laid down, it is queryable too
It’s easy enough to mount an NFS share in Linux and OSX, but for Windows you have to either install some third party NFS client, or you can do this:
Enable the NFS client features in Windows – search for Add Features
regedit.exe
HKEY_LOCAL_MACHINE > SOFTWARE > Microsoft > ClientForNFS > Default > add 2 DWORDS:
AnonymousGid
AnonymousUid
The values need to be 0, which will be set by default.
from a command line:"nfsadmin client restart"
(or reboot)
Now when you browse to your mapR machine’s IP via Explorer, you will see the shared folders. You can copy stuff in or out, delete, etc. Note, I’ve only ever seen 30MB/s at the fastest.
mapR docs on the subject:
Notes on security:
I haven’t looked yet, but it’d probably be a good idea to restrict which hosts can hit which folders on the Hadoop cluster. I haven’t yet messed with how mapR exports the NFS shares, but I suppose IP restrictions should be alright.
No comments:
Post a Comment