How does Hbase Java API connect Hbase?

Probably lots of people will meet this problem when you tried to connect to Hbase with Java API:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch
And at same time, throw out similar error like this:
java.nio.channels.SocketChannel[connection-pending remote=HbaseNode1/192.168.0.105:50578]
Let’s go to check the code how we connect, like following:

Ok, you can see, we connect with zookeeper directly but not Hbase. And we connect with server ZOOKEEPERCOOR in port 2818 , but in error info, it says
java.nio.channels.SocketChannel[connection-pending remote=HbaseNode1/192.168.0.105:50578].
Seems we have to connected with HbaseNode1/192.168.0.105:50578? Yea, maybe you have gotcha. The reason is that Hbase API need Zookeeper telles us which server is the Hbase Master we can connect, and that’s why it response HbaseNode1/192.168.0.105:50578. But with the response from Zookeeper, we get the right Hbase master, but why still get the error? Because Zookeeper give us back the domain of server, not IP address directly, hence we have to dns lookup the domain, and because the local machine is not in the same subnet with Hbase that DNS server can not resolve domain or local machine can not find IP of that domain in its hosts file. So the solution is fix the IP mapping that domain in hosts file. That’s how I solve it. =]

[译]针对大数据的Jaccard相似度计算优化

在大量数据对间计算Jaccard相似度是一个巨大的难题。因为它的复杂度是O(N^2). 然而这里有许多优化技术能够显著地降低计算量。我花了大约一周的时间搜索了相关资料,研究了大量技术。下面是我找到的不同方法的总结,同时提供了部分有用信息的链接。值得提到的是以下两个非常有用的资料:
《Similarity Joins in Relational Databases》第六章
《Mining Massive Datasets》第三章

  1. 基于Token的过滤:
    想法:根据token来放置数据,同时仅仅考虑至少有一个token匹配的数据对。

Continue reading