ngmoco:)


At ngmoco, cultivating an intellectually honest exchange of ideas and dialogue around the development of iPhone games is important to everyone at the company. The purpose of this blog is to highlight what we're learning as a company. In this space, expect to see interviews with leaders in iPhone gamemaking, data analysis and market trends on the iPhone games business, post-mortems, case studies, development techniques and code samples from ngmoco’s games, and regular blog entries on a variety of topics germane to making iPhone games.
Mar 02 2011 :)

Saved by Replicas

There has been a few notes about Redis presharding lately from places such as Craigslist. The one thing these have in common is that you “planned” to go big from the start. However, what if you didn’t? What happens when a little feature becomes a raging success and your Redis servers are getting hammered? Well if you’re using the Ruby Redis client here’s how the built in replicas feature can save your ass.

To start let’s setup the scenario. You implement a feature for your game/web site that uses Redis sets. You launched with two Redis nodes. Your YAML file looks something like this:

production:
    hosts:
      - redis01:6379
      - redis02:6379

This goes live and it’s a success. Everything is fine. Then it becomes a bigger success and you start to see some timeouts and slowdowns. Crap. You know you need to grow the cluster but how without losing existing data. You could create a client that moves data from one node to another while serving up the data but that’s not good for sets. Crap. Then you notice this method in Redis::HashRing

    # Adds a `node` to the hash ring (including a number of replicas).
    def add_node(node)
      @nodes « node
      @replicas.times do |i|
        key = Zlib.crc32(“#{node}:#{i}”)
        @ring[key] = node
        @sorted_keys « key
      end
      @sorted_keys.sort!
    end

Sweet gods, @replicas defaults to 160 so there are 320 keys in the ring. The damn client presharded for you. Yea. Wait how do I get new nodes into existing keys. Crap. Well here you go.

Modify the Redis::Client class to accept a name and a replicas_start attributes.

vendor/gems/redis-1.0.7/lib/redis/client.rb
@@ -155,10 +155,15 @@ class Redis
       @sock = nil
       @pubsub = false
+       @name = options[:name]
+      @replicas_start = options[:replicas_start] || 0
 
       log(self)
     end
 
+    def replicas_start
+      @replicas_start
+    end
+   
     def to_s
+       if @name
+         “Redis Client connected to #{@name} against DB #{@db}”

Modify Redis::Distributed to pass along the name and replicas_start attributes to the client. Also, modify it to pass the replicas_per_server to Redis::HashRing

vendor/gems/redis-1.0.7/lib/redis/distributed.rb

@@ -13,8 +13,15 @@ class Redis
 
       if opts[:hosts].is_a?(Hash)
         opts[:hosts].each_pair do |k,v|
+          if v.is_a?(String)
           host, port = v.split(‘:’)
           hosts « Client.new(:host => host, :port => port, :db => db, :timeout => timeout, :name => k)
+          elsif v.is_a?(Array)
+            v.each do |client|
+              host, port, replicas_start = client.split(‘:’)
+              hosts « Client.new(:host => host, :port => port, :db => db, :timeout => timeout, :name => k, :replicas_start => replicas_start.to_i)
+            end
+          end
         end
       else
         opts[:hosts].each do |h|
@@ -23,8 +30,12 @@ class Redis
         end
       end
 
+      if opts[:points_per_server]
+        @ring = HashRing.new(hosts, opts[:points_per_server])
+      else
       @ring = HashRing.new hosts
     end
+    end
 
     def inspect
       to_s

Modify Redis::HashRing to use the replicas_start attribute of the client.

vendor/gems/redis-1.0.7/lib/redis/hash_ring.rb
@@ -23,8 +23,9 @@ class Redis
     # Adds a `node` to the hash ring (including a number of replicas).
     def add_node(node)
       @nodes « node
+      start = node.respond_to?(:replicas_start) ? node.replicas_start : 0
       @replicas.times do |i|
-        key = Zlib.crc32(“#{node}:#{i}”)
+        key = Zlib.crc32(“#{node}:#{start + i}”)
         @ring[key] = node
         @sorted_keys « key
       end
@@ -33,8 +34,9 @@ class Redis
 
     def remove_node(node)
       @nodes.reject!{|n| n.to_s == node.to_s}
+      start = node.respond_to?(:replicas_start) ? node.replicas_start : 0
       @replicas.times do |i|
-        key = Zlib.crc32(“#{node}:#{i}”)
+        key = Zlib.crc32(“#{node}:#{start + i}”)
         @ring.delete(key)
         @sorted_keys.reject! {|k| k == key}
       end

Lastly modify the YAML file to pass in the correct options.

production:
    points_per_server: 80
    hosts:
      redis01:6379:
        - redis01:6379:0
        - redis03:6379:80
      redis02:6379:
        - redis02:6379:0
        - redis04:6379:80

Now replicate the data to the new servers and deploy the code. Bingo. Now get some sleep.

Ashley Martens
Server Engineer

Comments (View)
Aug 16 2010 :)

Using Cassandra from Erlang

I spent some time trying to access Cassandra from Erlang and ran in to a bit of difficulty. To start, Cassandra uses Thrift for it’s RPC and wire protocol. Using Thrift from Erlang is very poorly documented. I was eventually able to get it to work, with the help of some nice folks in the chat room. In the interest of returning the favor, I’ve put together a little sample application and posted it on github.

Being already familiar with using Thrift in Erlang, the real sticky point for me was figuring out how to assemble the rat’s nest of custom record types into working Cassandra requests. Look at the code and README file for details, but I’ll repeat the main roadblocks I hit here.

  • Use framed mode for Cassandra 0.7 or greater
  • Do not use framed mode for Cassandra 0.6 or below
  • The map type in the Cassandra IDL expects an Erlang dict type

-dave

Comments (View)
May 05 2010 :)

Loosely related issues

As most of our users know we have been experiencing server issues with We Rule and Plus+ lately. The server teams have been pouring over logs and graphs for the past weeks trying to figure where our issues are and implementing fixes to those issues.

One of the issues we recently “dealt” with was a problem on our NFS server, it was swapping and had really high load. This machine had become unstable but was critical to sharing certain static files between our web servers. We are currently in the process of replacing this machine but right now we need to keep it running, so we decided to stop all non-production critical access to the NFS shares on this machine. As soon as one of our analytic processes stopped writing to the share the machine became much more stable. Great.

A few hours later we found that Plus+ and We Rule were responding really well. In fact our response times were great and had been holding level since we solved the NFS issue. Holy $#!@ that’s awesome but why?

Well it turns out that nginx was serving up static files directly from NFS and when NFS was having problems nginx workers would start blocking as they waited to read the file. This resulted in requests being queued in nginx which, when the block cleared, would send a large batch to our application processes. This would flood the database with requests which would block as it worked through. All this resulted in choppy response times.

Long story short. Serving web files directly from NFS is a bad idea. Let’s not do that anymore.

Long live fast Plus+ response times.

- Ashley

Comments (View)
Apr 02 2010 :)
Comments (View)
Mar 24 2010 :)

When Memcached Doesn’t Scale

Memcached has been one of the greatest tools in the battle to scale web applications, especially in RoR. From write-through query caching to action caching to custom caching…. it’s been the bees knees. When memcached doesn’t work, because you need persistence for example, an alternative presents itself, Redis.

At some point, however, even the speed of memcached cannot save you from coding issues, which were fine with a limited number of processes but cripple your application when the throughput increases. One of our experiences with this was I18n_db. There were two problems with this plugin that we ran into, a per-request check and a database firestorm.

First, there was a check of the last updated_at integer during each request. This should be a super fast call to memcached which would return an integer. No big deal, until you  make 1000 request per second for this value. At this point you have introduced a lock to your distributed, stateless code. You also get a really bad/neat side effect (and the way we [Scott] found the problem) of maxing out the gigabit ethernet controller (900Mbps). This causes other queries to this memcached server to also be slow, which cascades throughout your application causing strange slow queries.

Second, if this key was deleted the translations were reloaded by each process from the database. Good idea until you have 800 concurrent requests for the same data from the database. The database becomes completely locked as every connection tries to get this data and update memcached with it.

Solutions:

1) It’s okay for the translations to be a bit stale, so only check for an update once a minute or five.

2) Update the translations in memcached when they change in the database and create a background job to refresh the cached translations once a day or so. Do not allow either of those memcached keys to be removed from memory at all costs.

-Ashley

Comments (View)
Page 1 of 10