开发者

HBase scan with compare filters has long delay when returning last row

开发者 https://www.devze.com 2023-03-30 06:06 出处:网络
I have HBase running in standalone mode and encountered some problems when I query the tables using the Java API.

I have HBase running in standalone mode and encountered some problems when I query the tables using the Java API. The table has several million entries (but might grow to billions) which have the following row key metric :

<UUID>-<Tag>-<Timestamp>

I use two compa开发者_运维百科re-operation filters to query a specific row range which represents a time interval.

Scan scan = new Scan();
RowFilter upperRowFilter = new RowFilter(CompareOp.LESS,
    new BinaryComparator(securityId + eventType + intervalEnd)
        .getBytes()));

RowFilter lowerRowFilter = new RowFilter(CompareOp.GREATER_OR_EQUAL,
    new BinaryComparator(securityId + eventType + intervalStart)
        .getBytes()));

FilterList filterList = new FilterList();
filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);

scan.setFilter(filterList);
scanner = table.getScanner(scan);
result = scanner.next();

When I call the ResultScanner#next() method everything works fine until it gets to the last row of the key range which is specified through the filters. It takes up to 40 seconds until the ResultScanner returns the last row, which is lexically smaller than the upper row range limit.

When I change the order of the filters in the filterList from

filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);

to

filterList.addFilter(upperRowFilter);
filterList.addFilter(lowerRowFilter);

it takes the scanner up to 40 seconds until it starts to return any results but there is no more delay on returning the last row, so I figured that the delay comes from the CompareOp.LESS - filter.

The only way I know of to get around this delay is to omit the upperRowFilter and check manually if the row keys are out of range but I am sure there has to be something wrong, because I found nothing on that problem searching the internet.

I also already tried to get rid of that with caching, but when I use a cache size which is less than the number of rows returned it doesn't change anything and if I use a cache size bigger than the number of rows returned the delay is still there but again before any results are returned.

Do you have any idea what could cause that kind of behaviour? Am I doing it wrong or is there something that I'm missing?

Thanks in advance!


The problem is that your scanner is scanning the entire table and throwing away the results that don't match your query. You need to explicitly set a stop row of (securityId + eventType + intervalEnd). If you set a corresponding start row of (securityId + eventType + intervalStart), then you won't need a filter at all and the scan will be efficient no matter the size of your data set.

0

精彩评论

暂无评论...
验证码 换一张
取 消