Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

haskell - Curious about the HashTable performance issues

I read that hash tables in Haskell had performance issues (on the Haskell-Cafe in 2006 and Flying Frog Consultancy's blog in 2009), and since I like Haskell it worried me.

That was a year ago, what is the status now (June 2010)? Has the "hash table problem" been fixed in GHC?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The problem was that the garbage collector is required to traverse mutable arrays of pointers ("boxed arrays") looking for pointers to data that might be ready to deallocate. Boxed, mutable arrays are the main mechanism for implementing a hashtable, so that particular structure showed up the GC traversal issue. This is common to many languages. The symptom is excessive garbage collection (up to 95% of time spent in GC).

The fix was to implement "card marking" in the GC for mutable arrays of pointers, which occured in late 2009. You shouldn't see excessive GC when using mutable arrays of pointers in Haskell now. On the simple benchmarks, hashtable insertion for large hashes improved by 10x.

Note that the GC walking issue doesn't affect purely functional structures, nor unboxed arrays (like most data parallel arrays, or vector-like arrays, in Haskell. Nor does it affect hashtables stored on the C heap (like judy). Meaning that it didn't affect day-to-day Haskellers not using imperative hash tables.

If you are using hashtables in Haskell, you shouldn't observe any issue now. Here, for example, is a simple hashtable program that inserts 10 million ints into a hash. I'll do the benchmarking, since the original citation doesn't present any code or benchmarks.

import Control.Monad
import qualified Data.HashTable as H
import System.Environment

main = do
  [size] <- fmap (fmap read) getArgs
  m <- H.new (==) H.hashInt
  forM_ [1..size] $ 
 -> H.insert m n n
  v <- H.lookup m 100
  print v

With GHC 6.10.2, before the fix, inserting 10M ints:

$ time ./A 10000000 +RTS -s
...
47s.

With GHC 6.13, after the fix:

./A 10000000 +RTS -s 
...
8s

Increasing the default heap area:

./A +RTS -s -A2G
...
2.3s

Avoiding hashtables and using an IntMap:

import Control.Monad
import Data.List
import qualified Data.IntMap as I
import System.Environment

main = do
  [size] <- fmap (fmap read) getArgs
  let k = foldl' (m n -> I.insert n n m) I.empty [1..size]
  print $ I.lookup 100 k

And we get:

$ time ./A 10000000 +RTS -s        
./A 10000000 +RTS -s
6s

Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through the foreign-function interface):

import Control.Monad
import Data.List
import System.Environment
import qualified Data.Judy as J

main = do
  [size] <- fmap (fmap read) getArgs
  j <- J.new :: IO (J.JudyL Int)
  forM_ [1..size] $ 
 -> J.insert (fromIntegral n) n j
  print =<< J.lookup 100 j

Running this,

$ time ./A 10000000 +RTS -s
...
2.1s

So, as you can see, the GC issue with hashtables is fixed, and there have always been other libraries and data structures which were perfectly suitable. In summary, this is a non-issue.

Note: as of 2013, you should probably just use the hashtables package, which supports a range of mutable hashtables natively.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...