Sidekiq doesn't load Rails code changes automatically, be awared of it

It’s easy for Rails developers to fall into thinking that Sidekiq loads their code automatically because they are too familiar with Rails’s autoload feature. However, it’s totally wrong. If you fixed a bug in a job code but didn’t restart Sidekiq workers, the bug might be still there and you might be like “WHAT”.

The reason why Sidekiq doesn’t auto load the codes because Rail’s autoloading is only in development and it’s not thread-safe so it got disabled. That means there’s no autoloading with sidekiq workers.

Conclusion

Restart sidekiq workers whenever you changed job codes

Dangerous attribute auto assignment by cancan that Rails developers should be awared of!

Cancan can save you a lot of time but sometime it makes you careless enough to produce hard-to-find bugs.

In web development, an application usually has multiple user roles such as admin, customer… Each role has a particular permission on some actions and objects. Permission management has become a well-known problem.

As a Rails developer, I use cancancan - community version of cancan to manage user permissions and I know a lot of Rails developers out there are using it too. It’s a handy gem, easy to use and provides permission checking on Model objects out of the box.

In controller, cancan allows you to load corresponding resource and check permission of current_user (if you are using devise). If current_user doesn’t have permission on particular action (such as update), it immediately stop the code and does something like a redirect or whatever you want it to do. With cancan, your code is more or less like this:

You will define users’ permission in ability model:

class Ability
  include CanCan::Ability

  def initialize(user)
    # initialize user object if user is nil
    user ||= User.new

    # admin can do whatever he wants on every object
    if user.admin?
      can :manage, :all
    else
    # other users can manage only their posts
      can :manage, Post, user_id: user.id
    end
  end
end

In your controller, you are likely to load the resource and authorize it:

class PostsController < ApplicationController
  # get the post with params[:id] and assign it to @post
  # after that, check if @post.user_id is equal to current_user.id or not
  # if it is, the code runs normally, otherwise redirect to root or something
  # else you want.
  load_and_authorize_resource :post

  def create
    # @post is created by cancan before
    if @post.save
      redirect_to @post, notice: 'Your post has been created.'
    else
      render :new
    end
  end

  private
  def post_params
    params.require(:post).permit(:text)
  end

end

The cancan mysterious assignment

As you can see in the code above, I didn’t assign the user to that post. Really? This is so wrong. Let’s run the test.

describe AdUnitsController do
  describe "POST create" do
    it "creates a post with current_user associated" do
      user = FactoryGirl.create(:user, role: "customer")
      # log the user in
      sign_in user
      valid_attributes = {
        text: "This is a test post body"
      }
      post :create, post: valid_attributes

      # get the post created inside controller
      created_post = assigns(:post)

      expect(created_post.persisted?).to be true
      # the post's user is expected to be the logged in user
      expect(created_post.user_id).to eq user.id
    end
  end
end

I ran this test and it passed. HAHA. So I didn’t have to assign user to @post myself, it was done by cancan transparently. It’s amazing. Is it? Might be. I feel a little bit confused and doubtful here.

The problem

After I ran the test, I thought that my code works properly but I wanted to check it one last time on the browser. This time, I already signed under an admin user, I tried to create a post and it was unable to save the post.

What? Why? Being confused, I started running the test repeatedly with stupid print-outs and wondered Why did the test pass when it actually doesn’t work correctly?

It turns out that user_id was never be assigned on @post, it was nil so it didn’t pass the validation. But wait, user_id is assigned in the test, why isn’t it here?

How the hell does cancan know that I want to assign user to current_user? What causes this inconsistency?

After a while reading cancan source code, I found that it automatically assigns some attributes to reasonable values without notice to developers. In my case, it assigns user_id to current_user.id. Why? Because I defined a rule to check current_user permission on this @post, the rule says :user_id => user.id so cancan thinks that I want to assign user_id to current_user.id on @post and it’s right. I do want that.

But that rule is applied only when the current_user is not admin. On the other casewhere current_user is an admin, cancan does no pre-assignment so user_id is not set on the fly. This is hard for developers to notice the difference because it’s done silently by cancan.

Test agaist that problem

To make a failing test for this, you have to test the action on both admin and customer role like this:

describe AdUnitsController do
  describe "POST create" do
    context "customer" do
      it "creates a post with current_user associated" do
        user = FactoryGirl.create(:user, role: "customer")
        test_post_create(user)
      end
    end

    context "admin" do
      it "creates a post with current_user associated" do
        user = FactoryGirl.create(:user, role: "admin")
        test_post_create(user)
      end
    end

    def test_post_create(user)
      # log the user in
      sign_in user
      valid_attributes = {
        text: "This is a test post body"
      }
      post :create, post: valid_attributes

      # get the post created inside controller
      created_post = assigns(:post)

      expect(created_post.persisted?).to be true
      # the post's user is expected to be the logged in user
      expect(created_post.user_id).to eq user.id
    end
  end
end

Conclusion

cancan might pre-assign attributes it needs for permission checking to the comparing values. If you don’t assign them, they might be assigned by cancan and might be not. Your test might pass and might not pass.
Always manually assign object attributes to expected values instead of depending on cancan.
It’s good to have tests on the same action with different user roles.

Translate IP to Country using Maxmind database (reinvent the wheel)

This blog post is explanation for the algorithm I used in previous post.

The algorithm is to determine which country does an IP come from.

Input: an IPv4 string (eg. 199.191.56.130)

Output: country name (eg. United States)

Existing libraries

Why did I have to “reinvent the wheel”

They are all good for translating IP to Country. They also provides easy-to-use APIs, however I didn’t find a way to use them with Apache Spark. I had to import them whenever I want to process something (eg. a line from a log file) which drains too much performance for library initialization.

Because of that reason, I needed to write a piece of code to translate IP to Country myself which should be simple, fast and be able broadcasted by Apache Spark.

Mindmax Database

There’s no straight algorithm to determine which country an IP comes from, I have to use a database that contains a map from an IP range to a country to know which range the IP belongs thus I know its country.

Mindmax provides simple, accurate and frequently updated that kind of database. They have GeoIP2 and GeoLagacy in binary and CSV format. I used GeoIP2 CSV version because of its simplicity and readabily.

Download the database

You can download it from here, please choose GeoLite Country.

Database format

In this database, there’s only one CSV file containing many lines. Each line is a mapping from IP range to country and it looks like this:

"2.17.252.0","2.17.255.255","34733056","34734079","DE","Germany"

It starts with two IPs defining the IP range follow with 2 numbers that are the previous 2 IPs in integer. It ends with the country code and country name. So that line means every IPs from 2.17.252.0 to 2.17.255.255 comes from Germany. It also means every IPs in integer between 34733056 and 34734079 is Germany.

Working with integer is always simpler and faster than string so I decided to use IP in integer form so I used only:

"34733056" "34734079" "Germany"

to define an IP range.

Our problem

Given a specific IP, our problem now turns into finding a range which the IP belongs to. So if “2.17.252.10” is our IP, the algorithm should know it belongs to ["2.17.252.0", "2.17.255.255"] so it comes from “Germany”.

The algorithm

Because I decided to use integer to represent IP address so I need to find a way to convert from IP address to integer. It’s pretty simple to figure out the formular like this:

Integer for o1.o2.o3.o4: 16777216 * o1 + 65536 * o2 + 256 * o3 + o4

Next, I have to determine which range does that integer x belong to.

It’s quite simple that I can loop through every given ranges [a, b]. If a <= x <= b, that range is the result and I know the corresponding country name. This algorithm is simple but slow because its lookup complexity is O(n) where n is number of given ranges (about 100k in the database).

It’s just too slow for me, I looked at the database and thought those ranges shouldn’t be overlapped. I checked to see if i was right or not by this peice of code:

>>> ranges = []
>>> with open('GeoIPCountryWhois.csv', 'rb') as file:
...     reader = csv.reader(file)
...     for row in reader:
...         ranges.append((int(row[2]), int(row[3]), row[5]))
...
>>> ranges.sort(key=lambda e: e[0])
>>> overlapped = False
>>> for i in range(0, len(ranges)):
...     if i > 0 and ranges[i-1][1] > ranges[i][0]:
...         overlapped = True
...         break
...
>>> overlapped
False

I was right, that means if I sort those ranges [ai, bi] in respect of ai (i comes from 1..n) I have

[a1 < b1] < [a2 < b2] < [a3 < b3] < ... < [an < bn].

If I found a range, there’s no other ranges that the IP belongs to.

Having [ai, bi] where ai is the greatest number which smaller or equal than x among a1, a2, …, an, If x <= bi, x belongs to that range otherwise x doesn’t belong to any ranges.

To find the greatest number which is smaller or equal than x, I can use binary search which is way faster than previous algorithm with O(logn) complexity.

Python provides bisect library with binary searching algorithm built in so I didn’t have to implement it myself.

Implementation

This is my code using the algorithm above:

def ip_to_country(ip):
  # collection of ranges, each range is a tupple of (starting ip, ending ip, country)
  ranges = []
  # array of starting ip for searching
  keys = []

  # read from database to collect ip ranges
  with open('GeoIPCountryWhois.csv', 'rb') as f:
      reader = csv.reader(f)
      for row in reader:
          ip_start = int(row[2])
          ip_end = int(row[3])
          ranges.append((ip_start, ip_end, row[5], row[0], row[1]))
          keys.append(ip_start)

  # convert IP to integer
  (o1, o2, o3, o4) = ip.split('.')
  i_ip = 16777216 * int(o1) + 65536 * int(o2) + 256 * int(o3) + int(o4)

  result = None

  # I dont have to sort the ranges because those ranges are already sorted in the
  # Mindmax database

  # using bisect to find the position i that every key goes before i will smaller
  # or equal to i_ip
  index = bisect.bisect(keys, i_ip)

  # if the position is greater than 0, the one before that position is the greatest
  # which is smaller or equal to i_ip
  if index > 0:

      # get that range
      found_range = ranges[index - 1]

      # if ending ip is greater or equal to i_ip, return the country name
      if found_range[1] >= i_ip:
          result = found_range[2]

      # otherwise, the IP doesn't belong to any ranges
      else:
          result = "Unknown"
  # if the position is 0, the IP doesn't belong to any ranges
  else:
      result = "Unknown"

  return result

Conclusion

With this implementation, it’s simple, fast and easy to understand. Most important thing is that it can be used effectively with Apache Spark by broadcasting the ranges and keys arrays.

I used this implementation with Apache Spark to process 5 GB log file in 110s. I took an hour to process the same file using existing libraries.

Geographic report with Apache Spark from Nginx access log, experience from a newbie

There are many blog posts and articles around the web saying that Spark is blazing fast because of in-memory distributed processing. Well, I would see it myself so I did some experiments on it.

Note: This might be a long post, I’m sorry for not bringing you straight to the point. I just want to write so in couple of months later I can remember what I have done.

Experiment description

In this experiment, I will try to do a very well-known log mining task, which is generating geographic report from Nginx access log.

Environment specifications:

2.3 GHz Intel Core i7 (4 process cores)
16 GB 1600 MHz DDR3
50 MB access log file

How do we do this experiment

Because Apache Spark supports API in Java, Scala and Python so I chose Python due to its readability.

Write a Python program to see how long does it take to generate the report
Using Apache Spark on that program to see many time faster it is in compare to the first one.

Here we go

Now it’s time for us to start doing the experiment.

How do we generate the report

It’s pretty straight forward to do this task, we can just read the log line by line, determine what the country does that request come from, increase that country counter. The result would be a dictionay of country name as keys and number of requests as values. I think it’s pretty much like this:

  {
    'Canada': 123,
    'Unknown': 123,
    'Lithuania': 123,
    'Argentina': 123,
    'Bahrain': 123,
    'Saudi Arabia': 123,
    ...
  }

Regular program

I came up to use python-geoip package to translate IP to country name.

You can install the package using pip like this:

pip install python-geoip
pip install python-geoip-geolite2

This is my regular python program.

import datetime
from geoip import geolite2

begining = datetime.datetime.now()
if __name__ == "__main__":
    ips = {}
    for line in open('access.log'):
        # getting ip from log line, it's the first non-space string
        ip = line.split(' ')[0]
        match = geolite2.lookup(ip)
        if match is not None:
            ips[match.country] = (ips.get(match.country) or 0) + 1
        else:
            ips["Unknown"] = (ips.get("Unknown") or 0) + 1
    ending = datetime.datetime.now()
    print ips
    print "processed in: %s" % (ending - begining)

After running this script, we got:

{'BD': 695, 'BE': 18171, 'BG': 139, 'BA': 27, 'BL': 21, 'Unknown': 4723, 'BN': 2, 'JP': 2140, 'BI': 1, 'JM': 6, 'JO': 5, 'BQ': 71, 'BR': 1569, 'BY': 67, 'RU': 487, 'RS': 88, 'RO': 817, 'GT': 3, 'GR': 64, 'BH': 1, 'GY': 30, 'GE': 4, 'GB': 13164, 'GL': 1, 'GI': 19, 'GH': 2, 'OM': 4, 'TN': 1, 'BW': 22, 'HR': 4, 'HT': 4, 'HU': 65, 'HK': 112, 'HN': 1, 'PS': 65, 'PT': 267, 'PY': 23, 'PA': 23, 'PE': 6, 'PK': 810, 'PH': 498, 'PL': 406, 'ZM': 54, 'EE': 61, 'EG': 111, 'ZA': 29501, 'EC': 97, 'IT': 579, 'AO': 16, 'ZW': 57, 'SA': 87, 'ES': 355, 'MD': 52, 'MA': 54, 'MO': 31, 'MK': 9, 'UA': 817, 'MX': 130, 'IL': 361, 'FR': 3720, 'FI': 58, 'FJ': 2, 'NL': 6867, 'NO': 494, 'NA': 44, 'NG': 3, 'NZ': 2789, 'CH': 590, 'CO': 152, 'CN': 3713, 'CL': 63, 'CA': 14187, 'CD': 2, 'CZ': 175, 'CR': 48, 'CW': 1227, 'KE': 5, 'SR': 1, 'SK': 24, 'KR': 86, 'SI': 31, 'SN': 21, 'SC': 51, 'KZ': 3, None: 1169, 'SG': 616, 'SE': 1366, 'DO': 2, 'DK': 923, 'DE': 5865, 'DZ': 56, 'US': 49443, 'LB': 1, 'TW': 87, 'TT': 2, 'TR': 153, 'LK': 50, 'LI': 1, 'LV': 62, 'LT': 70, 'LU': 30, 'TH': 270, 'AE': 41, 'VE': 53, 'IQ': 45, 'IS': 2, 'IR': 21, 'AM': 3, 'AL': 5, 'VN': 20978, 'AR': 46, 'AU': 2232, 'AT': 337, 'IN': 9469, 'IE': 73, 'ID': 329, 'MY': 8, 'QA': 1}
processed in: 0:00:58.259646

As you can see, it took about 58s to analyze 50 MB log file. Hmm, pretty slow but whatever, using Spark would make it blazing fast I thought.

The program using Apache Spark

The code:

import sys

from pyspark import SparkContext

def get_country_from_line(line):
    try:
        from geoip import geolite2
        ip = line.split(' ')[0]
        match = geolite2.lookup(ip)
        if match is not None:
            return match.country
        else:
            return "Unknown"
    except IndexError:
        return "Error"

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

You will find it weird when I import geolite2 inside a function but it is neccessary to make this code able to run in Spark because Spark will ship the method get_country_from_line to its workers, that wouldn’t know what geolite2 is. Weird right?

The result is:

Job 0 finished: countByValue at /Users/victor/Downloads/spark-1.2.1-bin-hadoop2.4/examples/src/main/python/access_log_analyzer.py:20,
took 33.927496 s
defaultdict(<type 'int'>, {'BD': 695, 'BE': 18171, 'BG': 139, 'BA': 27, 'BL': 21, 'Unknown': 4723, 'BN': 2, 'JP': 2140, 'BI': 1, 'JM': 6, 'JO': 5, 'BQ': 71, 'BR': 1569, 'BY': 67, 'RU': 487, 'RS': 88, 'RO': 817, 'GT': 3, 'GR': 64, 'BH': 1, 'GY': 30, 'GE': 4, 'GB': 13164, 'GL': 1, 'GI': 19, 'GH': 2, 'OM': 4, 'BW': 22, None: 1169, 'HR': 4, 'HT': 4, 'HU': 65, 'HK': 112, 'HN': 1, 'PS': 65, 'PT': 267, 'PY': 23, 'PA': 23, 'PE': 6, 'PK': 810, 'PH': 498, 'PL': 406, 'ZM': 54, 'EE': 61, 'EG': 111, 'ZA': 29501, 'EC': 97, 'AL': 5, 'VN': 20978, 'ZW': 57, 'ES': 355, 'MD': 52, 'MA': 54, 'MO': 31, 'US': 49443, 'UA': 817, 'MX': 130, 'IL': 361, 'FR': 3720, 'FI': 58, 'FJ': 2, 'NL': 6867, 'NO': 494, 'NA': 44, 'NG': 3, 'NZ': 2789, 'CH': 590, 'CO': 152, 'CN': 3713, 'CL': 63, 'CA': 14187, 'CD': 2, 'CZ': 175, 'CR': 48, 'CW': 1227, 'KE': 5, 'SR': 1, 'SK': 24, 'KR': 86, 'SI': 31, 'SN': 21, 'SC': 51, 'KZ': 3, 'SA': 87, 'SG': 616, 'SE': 1366, 'DO': 2, 'DK': 923, 'DE': 5865, 'AT': 337, 'DZ': 56, 'MK': 9, 'LB': 1, 'TW': 87, 'TT': 2, 'TR': 153, 'LK': 50, 'LI': 1, 'TN': 1, 'LT': 70, 'LU': 30, 'TH': 270, 'AE': 41, 'VE': 53, 'IQ': 45, 'IS': 2, 'IR': 21, 'AM': 3, 'IT': 579, 'AO': 16, 'AR': 46, 'AU': 2232, 'LV': 62, 'IN': 9469, 'IE': 73, 'ID': 329, 'MY': 8, 'QA': 1})

Wow wow wow, WTF, 33.9s. Not impressed at all, how slow!

What is the problem?

The problem is that I was right that I felt weird of importing geolite2 inside a function. That method is not just spread to its work, it will be used whenever the program process a log line. Thus we ran this importing stupidity too many time which led to slow down the program significantly.

How do we fix this

Spark provides a way to broadcasting variables to its workers so workers don’t need to reload that variables. I tried to broadcast the geolite2, however, it doesn’t work because geolite2 is a package, which references to thread and complex things internally so Spark can’t handle that complex box.

If you find a way to broadcast geolite2 or making the program faster and still using geolite2, please leave a comment. I really appreciate it.

Eventually I thought to fix this, I would have to implement my own IP to Country translator which can be broadcasted using Spark.

Fix and Improve the Spark version

I came up with this code:

import sys
import csv
import bisect

from pyspark import SparkContext

ranges = []
keys = []

with open('/Users/victor/Downloads/GeoIPCountryWhois.csv', 'rb') as f:
  reader = csv.reader(f)
  for row in reader:
      ip_start = int(row[2])
      ip_end = int(row[3])
      ranges.append((ip_start, ip_end, row[5], row[0], row[1]))
      keys.append(ip_start)

def lookup(ip, r, k):
    (o1, o2, o3, o4) = ip.split('.')
    i_ip = 16777216 * int(o1) + 65536 * int(o2) + 256 * int(o3) + int(o4)

    index = bisect.bisect(k, i_ip)
    result = None
    if index > 0:
        index -= 1
        found_range = r[index]

        if found_range[1] > i_ip:
            result = found_range[2]
        else:
            result = "Unknown"
    else:
        result = "Unknown"
    return result

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    ip_to_country_db = sc.broadcast(ranges)
    ip_to_country_keys = sc.broadcast(keys)
    def get_country_from_line(line):
        try:
            ip = line.split(' ')[0]
            match = lookup(ip, ip_to_country_db.value, ip_to_country_keys.value)
            return match
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

You might get confused about this code but I’m not gonna explain this because the post is too long now. I will write a different post to explain this code which contains some little tricks on algorithm that you might interest.

And now, the result is:

Job 0 finished: countByValue at /Users/victor/Downloads/spark-1.2.1-bin-hadoop2.4/examples/src/main/python/access_log_analyzer_with_custom_translator.py:49,
took 2.267526 s

Hah, bingo, 2.26s. Pretty fast.

Bigger data

I ran this improved version with a bigger file, 5 GB, it took about 119.2s. The previous version took over an hour. The regular program took forever, I couldn’t wait for it to finish the task.

Conclusion

Spark helps to solve this kind of problem seriously faster with parallelize the process. However we have to be awared of variable shipping to make it work efficiently, otherwise it will still be slow.

Regular program needs 58s to analyze 50 MB log file.
Bad Spark version needs 33s to analyze 50 MB log file.
Good Spark version needs 2.26s to analyze 50 MB log file.
Bad Spark version needs an hour to analyze 5 GB log file.
Good Spark version needs 119.2s to analyze 5 GB log file.

PLEASE FEEL FREE TO LEAVE A COMMENT.

Using Vim to convert Jade html to Slim

For every Rails developers, dealing with erb template engine is messy and unconfortable. They usually choose to use other template engine such as harm, slim … However, designers may attach themselves in Node.js environment so they tend to use jade template engine. Using Vim and a little bit regex substitution can help you convert a jade file to slim file really easily.

Differences between jade and slim we can deal with

Slim doesn’t use , (comma) to separate attributes as jade, it uses space.

Slim prefers to define attributes after id and classes.

Slim way:

div#id.classes data-attr1="something" data-attr2="something"

Jade can be like this:

div#id(data-attr1="something", data-attr2="something").classes

Jade allows non-value attributes, Slim doesn’t

Slim way:
```
div#id.classes data-attr1="true"
```
Jade way:
```
div#id.classes data-attr1
```

Using Vim to turn a Jade file to Slim

Move all classes and id to front of attributes

```
:%s/\((.\+)\)\(\.[^ ]*\)/\2 \1/g
```   2. Remove commas between attributes

```
:%s/\(".\{-}"\|data-\w\+\),/\1 /g
```   3. Remove parentheses around attributes

```
:%s/(\(.\{-}=".\{-}"\(\s\+.\{-}=*\(".\{-}"\)*\)*\))/ \1/g
```

Result

From this:

link(rel="stylesheet", href="css/ads.css")

div(style="width: 640px; margin: 20px auto").cl-wr
  .cl-ab
    a(href="index.html", target="_blank") Recommended by
  ul.cl-awi.cl-gr-3
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/1.jpg)")
        h4.cl-t [MWC 2015] HTC RE có phụ kiện mới: Đế khủng long, gậy tự sướng, cáp cứng...
        p.cl-s Tinh tế
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/2.jpg)")
        h4.cl-t [MWC 2015] Trên tay Galaxy S6 Edge màn hình cong 2 phía và S6 thường: nhẹ, đẹp, cao cấp
        p.cl-s Tinh tế
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/3.jpg)")
        h4.cl-t [MWC 2015] Trên tay HTC One M9: Máy đẹp, cấu hình tốt, phần mềm ngon
        p.cl-s Tinh tế
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/4.jpg)")
        h4.cl-t [MWC 2015] HTC ra mắt Grip - vòng đeo tay theo dõi sức khoẻ, màn hình cong, có GPS, 199$
        p.cl-s Tinh tế
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/5.jpg)")
        h4.cl-t [MWC 2015] Đang tường thuật sự kiện ra mắt Samsung Galaxy S6 / S6 Edge
        p.cl-s Tinh tế
    li
      a(href="", target="_blank")
        .cl-i(style="background-image: url(images/ads/6.jpg)")
        h4.cl-t Rò rỉ hình ảnh Samsung Galaxy S6 Edge
        p.cl-s Tinh tế
  .cl-ab
    a(href="index.html", target="_blank") Recommended by

to this:

link rel="stylesheet"  href="css/ads.css"

div.cl-wr  style="width: 640px; margin: 20px auto"
  .cl-ab
    a href="index.html"  target="_blank" Recommended by
  ul.cl-awi.cl-gr-3
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/1.jpg)"
        h4.cl-t [MWC 2015] HTC RE có phụ kiện mới: Đế khủng long, gậy tự sướng, cáp cứng...
        p.cl-s Tinh tế
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/2.jpg)"
        h4.cl-t [MWC 2015] Trên tay Galaxy S6 Edge màn hình cong 2 phía và S6 thường: nhẹ, đẹp, cao cấp
        p.cl-s Tinh tế
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/3.jpg)"
        h4.cl-t [MWC 2015] Trên tay HTC One M9: Máy đẹp, cấu hình tốt, phần mềm ngon
        p.cl-s Tinh tế
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/4.jpg)"
        h4.cl-t [MWC 2015] HTC ra mắt Grip - vòng đeo tay theo dõi sức khoẻ, màn hình cong, có GPS, 199$
        p.cl-s Tinh tế
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/5.jpg)"
        h4.cl-t [MWC 2015] Đang tường thuật sự kiện ra mắt Samsung Galaxy S6 / S6 Edge
        p.cl-s Tinh tế
    li
      a href=""  target="_blank"
        .cl-i style="background-image: url(images/ads/6.jpg)"
        h4.cl-t Rò rỉ hình ảnh Samsung Galaxy S6 Edge
        p.cl-s Tinh tế
  .cl-ab
    a href="index.html"  target="_blank" Recommended by

Making it a script for reusage

Open new file to write the script to, you can also open vim history to get all of our commands

The file would look like this:

    function! JadeToSlimFunction()
      silent! :%s/\((.\+)\)\(\.[^ ]*\)/\2 \1/g
      silent! %s/\(".\{-}"\|data-\w\+\),/\1 /g
      silent! %s/(\(.\{-}=".\{-}"\(\s\+.\{-}=*\(".\{-}"\)*\)*\))/ \1/g
      silent! normal gg
    endfunction

    command! JadeToSlim call JadeToSlimFunction()

You can use this script on a Jade file to turn it to Slim with just:

  :JadeToSlim

Conclusion

It’s always cool and fun to play with text especially with Vim. If you have heard of Vim but still don’t dare to use this. Be brave, learn it. Once you have Vim in your inventory, you will forget all other editors.

Weird thing about Rails date api

Rails has excellent support for timezone which brings developers conveniences working with time. Two of the most well-known API for date are Date.tomorrow and Date.yesterday.

About a month ago, I got into a problem that I had to use those APIs to determine a particular time point belongs to yesterday, today, or tomorrow. At that time, I thought about 3 APIs Date.yesterday, Date.today and Date.tomorrow with no doubt. They must have worked correctly as they sound. However, they actually dont!

Tomorrow can be equal to Today!

It’s weird to see that Date.yesterday is equal to Date.today which makes no sense but it’s true. Open your rails c:

# I set in my timezone to Pacific Time (US & Canada) on rails config file
# But my system time zone is UTC +7

2.1.3 :001 > Time.zone
 => #<ActiveSupport::TimeZone:0x007f8446262380 @name="Pacific Time (US & Canada)", @utc_offset=nil, @tzinfo=#<TZInfo::TimezoneProxy: America/Los_Angeles>, @current_period=nil>
2.1.3 :002 > Date.today
 => Mon, 02 Mar 2015
2.1.3 :003 > Date.current
 => Sun, 01 Mar 2015
2.1.3 :004 > Date.yesterday
 => Sat, 28 Feb 2015
2.1.3 :005 > Date.tomorrow
 => Mon, 02 Mar 2015

As you can see there, Date.today and Date.tomorrow are equal. Why? It turns out that Ruby actually has Date.today API and Rails team didn’t want to redefine this method, they created new method called Date.current so Date.current, Date.yesterday and Date.tomorrow are consistent but Date.today.

Date.today doesn’t respect Rails timezone configuration, it uses System timezone by default.

Date.current on the other hand is defined by Rails thus it uses Rails timezone.

Conclusion

Working with Rails, always remember to use Date.current with Date.yesterday and Date.tomorrow instead of Date.today.

Some common linux kernel tuning for Linux server

Using AWS EC2 Linux server grants you full control of the server, however you usually have to tune the kernel to make the server able to handle many internet connections and open files. This is some common tunning based on Bryan Veal and Annie Foong (Performance Scalability of a Multi-Core Server, Nov 2007, page 4/10)

Add all the following lines to the file: “/etc/sysctl.conf”

  fs.file-max = 5000000
  net.core.netdev_max_backlog = 400000
  net.core.optmem_max = 10000000
  net.core.rmem_default = 10000000
  net.core.rmem_max = 10000000
  net.core.somaxconn = 100000
  net.core.wmem_default = 10000000
  net.core.wmem_max = 10000000
  net.ipv4.conf.all.rp_filter = 1
  net.ipv4.conf.default.rp_filter = 1
  net.ipv4.tcp_congestion_control = bic
  net.ipv4.tcp_ecn = 0
  net.ipv4.tcp_max_syn_backlog = 12000
  net.ipv4.tcp_max_tw_buckets = 2000000
  net.ipv4.tcp_mem = 30000000 30000000 30000000
  net.ipv4.tcp_rmem = 30000000 30000000 30000000
  net.ipv4.tcp_sack = 1
  net.ipv4.tcp_syncookies = 0
  net.ipv4.tcp_timestamps = 1
  net.ipv4.tcp_wmem = 30000000 30000000 30000000
  #
  # Optionally, avoid TIME_WAIT states on localhost no-HTTP Keep-Alive tests:
  # “error: connect() failed: Cannot assign requested address (99)”
  # On Linux, the 2MSL time is hardcoded to 60 seconds in /include/net/tcp.h:
  # define TCP_TIMEWAIT_LEN (60*HZ). This option is safe to use in production.
  #
  net.ipv4.tcp_tw_reuse = 1
  #
  # WARNING:
  # ——–
  # The option below lets you reduce TIME_WAITs by several orders of magnitude
  # but this option is for benchmarks, NOT for production servers (NAT issues)
  # So, uncomment the line below if you know what you’re doing.
  #
  #net.ipv4.tcp_tw_recycle = 1
  #
  net.ipv4.ip_local_port_range = 1024 65535
  net.ipv4.ip_forward = 0
  net.ipv4.tcp_dsack = 0
  net.ipv4.tcp_fack = 0
  net.ipv4.tcp_fin_timeout = 30
  net.ipv4.tcp_orphan_retries = 0
  net.ipv4.tcp_keepalive_time = 120
  net.ipv4.tcp_keepalive_probes = 3
  net.ipv4.tcp_keepalive_intvl = 10
  net.ipv4.tcp_retries2 = 15
  net.ipv4.tcp_retries1 = 3
  net.ipv4.tcp_synack_retries = 5
  net.ipv4.tcp_syn_retries = 5
  net.ipv4.tcp_moderate_rcvbuf = 1
  kernel.sysrq = 0
  kernel.shmmax = 67108864

Then add also the 2 following lines to the file: /etc/secutity/limits.conf

* soft nofile 1000000
* hard nofile 1000000

Happy Lunar New Year, the Goat Year, my year!

Happy Lunar New Year from Vietnam. Wish you all a happy new year, lucky and success.

DONT WORRY, BE HAPPY!

Hanging out with my friends before Tet holiday

I’m in series of 9 days off work due to Lunar New Year holiday. It was so good and happy hanging out with my friends. We talked about every one’s job and discussed about Vietnam personal tax in case of transfering money from US to Vietnam. It’s good to know that we can accept unlimited amount of dollars income from the US.

Hanging out with friends

From left to right:

Thanh: BIDV employee
Thao: Finance student
Thiem: My co-worker
Me
Lan Anh: Accounting student
Trang: little sister
Quang: Auditor
Viet: Market Researcher

lion-attr, handy gem for manapulating Mongoid document object attributes' value with Redis

In ClickLion, we use Redis heavily to track all kind of stats in real-time such as pageview of an article, number of requests handled for a particular session, number of action users have made during the session and a lot more metrics.

Interacting with Redis is simple but contains a lot of repeated steps. For updating the value, we basically have to do following steps:

make connection to Redis
decide how to store the stats (key or hash)
choose a key to store that value
update the value
close the connection

For fetching the value, the steps might be:

make connection to Redis
remember corresponding key of a specific one or more stats
get those values from Redis
close the connection

Working through all of those steps every time you want to update or fetch a stat would not a good idea and it should be abstracted. lion-attr gem does the abstraction for you.

Install

gem install lion-attr

Usage

class Article
  include Mongoid::Document
  include LionAttr
  field :url, type: String
  field :view, type: Integer

  # field :view will be stored in Redis and saved back to Mongodb later
  live :view
end

# fetch the object from Redis
article = Article.fetch('54d5f10d5675730bd1050000')
# increase its view without database hits
article.incr(:view)
#=> 10

# view is updated without object save, its updated value is available
# even when you query the object from database
article_from_another_session = Article.find('54d5f10d5675730bd1050000')
article_from_another_session.view
#=> 10

Live Attribute

That counter is usually an attribute of a Model object, the difference is that attribute will get the value from Redis instead of the database. We call it live attribute because it tends update its value very frequently.

Cache Object

Including LionAttr will set an after save callback to create/update the object cache in Redis.

LionAttr stores objects as json string in Redis Hash with hash key is the full class name of the object. Objects in the same class will be stored in the same Hash with its identity.

Object identity is object’s id by default but you can tell LionAttr to use different field’s value as object identity.

class Article
  include Mongoid::Document
  include LionAttr
  field :url, type: String
  field :view, type: Integer

  # field :view will be stored in Redis and saved back to Mongodb later
  live :view
  # using :url as the key
  self.live_key = :url
end

# fetch the object using custom key
Article.fetch('http://uniqueurl.com')

LionAttr connections

LionAttr connections figure

Increment

LionAttr focuses on improving the wellknown problem increament operator in web application such as tracking pageviews, actions, impressions, clicks. It provides incr method in both class and object scope.

article.incr(:view)
Article.incr('54d5f10d5675730bd1050000', :view)
# also work with custom key
Article.incr('http://uniqueurl.com', :view)

This operator only interact with Redis. No database hits.

incr class method is very useful when you dont want to fetch the object from databse nor Redis, you just want to increase the counter.

Note: This operator will return the increased value if the field type is Integer or Float, otherwise a warning string will be returned.

Save Back to Database

You might want to save back the value to the database. LionAttr also provides update_db to do so.

article.update_db

It’s good practice if you do this periodly using the gem whenever.

Simple code snippet to save back all object in Article class.

Article.where(published: true).map! &:update_db

Redis Connection Pool

LionAttr uses Connection Pool to keep connections to Redis. Mongoid uses connection-pool too.

Configuration

LionAttr uses Redis as Redis ruby driver.

You can pass your redis configuration to LionAttr.

LionAttr.configure do |config|
  # Tell LionAttr to use Redis db 16
  config.redis_config = { :db => 16 }
end

All redis configuration used in Redis is valid here.