Samuel Williams Monday, 20 April 2009

Python is a great language, and I use it frequently. However, I also use Ruby. One thing that recently became a bit of a problem was hashing. Python has a magic function called hash which works on strings and produces a signed integer hash. Because this was used in a script for storing files in the file-system, I've got a problem when I wanted to write a script in ruby, as it was impossible for me to calculate the Python hash in Ruby.

The basic problem was as follows. Here is the working Python code.

path = "..."
cache_path = hash(path) % 255

Here is the ruby code:

path = "..."
cache_path = ????(path) % 255

Because hash function is specific to Python, this was a bit of a problem. Well, I found a great page explaining the python hash function.

  1. Python Hash Algorithms - Python Implementation of hash
  2. python hash() function - C implementation of hash

The only complicated part in porting this code to Ruby was the signed arithmetic required. Basically, the above code relies on the behaviour of 2s-compliment overflow for 32bit signed integers. Because Ruby doesn't (as far as I know) have facilities for this, I've added in a function unsigned_to_signed which can do the conversion for arbitrary sized integers.

#!/usr/bin/env ruby
 
def unsigned_to_signed(h, mask)
  # h must be within the unsigned range we are dealing with, i.e. [0, mask] inclusive
  h = h % (mask + 1)
  # calculate the maximum positive signed value
  max = (mask / 2) + 1
  # if h is bigger than the maximum unsigned value, we need to wrap it around.
  h > max ? (h % max) - max : h
end
 
def signed_32bit_multiply(a, b)
  unsigned_to_signed((a * b), 0xFFFFFFFF)
end
 
class String
  def python_hash
    return 0 if size == 0
 
    sum = self[0] << 7
    (0...size).each do |i|
      sum = signed_32bit_multiply(1000003, sum) ^ self[i]
    end
 
    sum = sum ^ size
    return sum == -1 ? -2 : sum
  end
end

This is a simple program you can use pick out some random words and compare the hash between Python and Ruby.

def check_hash(s)
  puts " Hashing #{s.dump} ".center(64, "-")
 
  py_val = `python -c 'print hash(#{s.dump})'`.strip.to_i
  puts "Python hash value is: ".rjust(25) + " #{py_val}"
 
  rb_val = s.python_hash
  puts "Our hash value is: ".rjust(25) + " #{rb_val}"
 
  if (py_val != rb_val)
    puts " ERROR HASH VALUE MISMATCH ".center(64, "*")
  end
end
 
dict = File.open("/usr/share/dict/words", "r").readlines
 
(0...10).each do |i|
  word = dict[rand * dict.size-1].strip
 
  check_hash(word)
end

This code will be helpful for what I am trying to achieve, however if you are storing hash functions, make sure you are using standard functions such as MD5, SHA1, etc. This way, logic is easy to port across multiple platforms and code bases.

Comments

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <www.codeotaku.com>.