With the Microsoft Surface Pro out, I’ve been rethinking a lot of things. I have a long and storied history with both PCs and Macs that I won’t get into right now. Based on my experience with other PCs here’s my breakdown on why I use one over the other. Keep in mind that I’ve yet to use the Surface extensively. I’m willing to update this post if I get a chance to use it for real work.

I love PCs because:
  • Visual Studio
  • Microsoft Excel: I can get along without the rest of the Office suite, but I haven’t found an alternative to Excel.
I love my MacBook Pro because:
  • Amazing performance: I normally have about a dozen desktop apps open and 20 or more Chrome tabs
  • Incredible battery life: 6-8 hours typically under heavy use
  • Lack of malware issues: Anti-malware programs are resource hogs.
  • Excellent memory management: minimal paging (only pages when absolutely necessary)
  • Software lifecycle management: zero residue install/uninstall
  • Hassle free OS updates: Painless when compared to weekly Windows updates and reboots.
  • No vendor advertisements to install stuff: I’ve never gotten a notification from Apple that I didn’t need.
  • Thermal management: I only hear the fan when running Windows on Parallels.
  • Durability: solid cast aluminum, glass screen, stiff edges with no seems, only the keys are plastic, breakaway charger, well seated ports.
  • Longevity: I’ve owned many Macs that have lasted more than 5 years without major issues, reinstalling the OS, or major slow downs. I was a certified Apple technician for a year in college and regularly serviced Macs that were 5+ years old and put them back into service.

Many of these benefits come from a seamless cohesion of Apple OS and hardware. The new surface might have the same thing going for it so maybe it closes the gap? I’m VERY curious.

I recently attended a talk by Scott Hanselman and he was running his presentation on a Surface Pro. I was amazed by how well it performed. I’ve had a few $1,500+ Dells for business and I’ve never seen that kind of performance – even right out of the box. I’ve always had to use $2,500+ desktops to get Windows and Windows based software to run without lagging and that’s only after turning off the Windows animation fluff.

Apache Solr with Apache Tomcat on Linux

These are excellent instructions on how to install Solr and Apache Tomcat on Linux.  Just be sure to read my comment about the SLF4J jar files:

You also need to add the SLF4J logging jars to /home/solrdev/apache-tomcat-7.0.39/lib. Otherwise you will see this error when you try to start the app: “Application at context path /solr could not be started”

You can download them here: http://www.slf4j.org/download.html

The content-type header consists of the MIMI content type of the resource plus an optional character set specification.

I came up with a regular expression to split the content-type header value into the respective fields:

(?P<type>.*?)(;|$)(\s?charset=(?P<charset>.*?)(;|$))?

In python you would use it like so.  Notice the use of re.IGNORECASE: 

contentTypePattern = re.compile(r”(?P<type>.*?)(;|$)(\s?charset=(?P<charset>.*?)(;|$))?”, re.IGNORECASE)

m = contentTypePattern.search(contentTypeHeader)
contentType = m.group(‘type’)
charset = m.group(‘charset’)

This is what I used for my test input:

image/x-ms-bmp
text/html; charset=GB2312
application/postscript
video/quicktime
image/png
image/vnd.microsoft.icon
text/xml;charset=UTF-8
image/jpeg; charset=utf-8
text/html;charset=utf-8
text/html;charset=euc-kr
application/zip;charset=ISO-8859-1
text/plain;charset=UTF-8
image/x-png
application/x-zip-compressed
text/javascript; Charset=utf-8
video/webm
text/x-vCalendar;charset=UTF-8
text/javascript;charset=utf-8
application/javascript;charset=utf-8
text/plain; charset=utf-8
video/x-flv
application/rtf
text/xml; Charset=utf-8
text/html
text/xml;;charset=UTF-8
text/plain; charset=UTF-8
application/x-x509-ca-cert
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/pdf; charset=utf-8
text/calendar
text/xml; charset=UTF-8
application/ogg
text/xml
application/x-javascript;charset=UTF-8
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
image/x-icon; charset=utf-8
text/css; Charset=UTF-8
image/x-icon
Application/doc
application/pdf; charset=UTF-8
application/vnd.wap.xhtml+xml; charset=utf-8
application/vnd.ms-word.document.12
application/x-javascript;charset=utf-8
text/html; charset=euc-kr
Application/ppt
text/html; charset=utf-8;
application/x-shockwave-flash
text/html; charset=ISO-8859-1
text/css; charset=utf-8
application/xml
text/plain; charset=ISO-8859-1
APPLICATION/XML; charset=utf-8
image/jpeg;charset=ISO-8859-1
text/plain; charset=ISO-8859-15
text/html; charset=UTF-8
text/html; charset=iso-8859-1
text/html; charset=utf-8
application/octet-stream;charset=UTF-8;
image/png;charset=UTF-8
application/octet-stream
text/xml; charset=utf-8
text/x-js
text/plain
application/ms-download
text/css;charset=utf-8
application/rss+xml; charset=UTF-8
text/html;charset=GB2312
video/mp4
application/x-javascript; charset=utf-8
application/rss+xml; charset=utf-8
pdf
image/pjpeg
image/svg+xml
text/html; Charset=utf-8
image/gif; charset=utf-8
text/x-component
application/pdf
text/css;charset=UTF-8
text/css; charset=UTF-8
Application/mp3
application/x-msdownload
image/gif
application/javascript
img/gif
Text/html; charset=utf-8
image/tiff
application/x-rar-compressed
application/pdf;charset=UTF-8
text/js
Application/flv
application/xhtml+xml; charset=utf-8
application/x-javascript
text/html;charset=windows-1250
image/Jpeg
text/javascript
video/ogg
video/mpeg
text/html;charset=UTF-8
application/step
text/css
application/xhtml+xml;charset=UTF-8
text/html; charset=gb2312
image/jpeg
image/ico
image/gif;charset=UTF-8
application/pkix-crl
image/gif;charset=ISO-8859-1
application/zip
application/flv
image/vnd.wap.wbmp
text/html;charset=utf-8; charset=utf-8
text/x-vcalendar
text/html;;charset=UTF-8
application/pdf;charset=ISO-8859-1
image/dxf
application/vnd.ms-excel
image/bmp
image/png;charset=ISO-8859-1
text/javascript; charset=utf-8
text/javascript;charset=UTF-8
text/plain;charset=utf-8
audio/mpeg
text/html; charset=windows-1252
application/x-pkcs7-certificates
image/svg
application/msword
audio/x-ms-wma
application/vnd.ms-powerpoint
text/html; charset=EUC-KR
video/x-ms-wmv
text/html; Charset=UTF-8
application/rss+xml
text/rtf
video/x-msvideo
text/html; charset=ISO-8859-2
text/javascript;charset=ISO-8859-1
Application/swf
image/jpeg;charset=UTF-8
text/html;charset=ISO-8859-1

The following are some ways find or at least get an idea of the modification and/or creation date of a given web page:

1.) First and most obvious: look for some dates on the web page.
You can get creative with searching within the page using your browser’s search capabilities to find this.  For instance, search for “2012”

2.) Get the date of the last time Google indexed the page.
This date will be between the day of and approximately 3 months of the “last modified” date depending on where the page falls in Google priority for indexing. Google can get to most of the pages it will add to its index every three months at least. The more popular the page is the closer this date will be to the actual “last modified” date. Google doesn’t actually update this date unless the content of the page changed. This is about as good as you can get for dynamically generated web pages if you are looking at the web page as a whole.

The best way to do this is to enter this in your browser’s address bar and look at the date by the result.
https://www.google.com/search?q=inurl:<web address>&as_qdr=y15&safe=active

Here is an example:
https://www.google.com/search?q=inurl:https://codeinchinese.wordpress.com&as_qdr=y15&safe=active

3.) Look at the HTTP headers.
For static web pages or static pieces of web pages you can sometimes get the “last modified” and if you are lucky the “creation” dates. If you want this information for the website as a whole you can go to http://www.statscrop.com, enter the url of the page you are intested in, and scroll down to the “HTTP Header Analysis” section. This will show you the header for the main response to the url request.

Sometimes this information isn’t provided in the HTTP response header for that particular request or the page has content that is dynamic enough to make it useless (ie. it always returns today). In that case you can look at other resource requests. The average page now days is actually made up of dozens of resources that are all acquired by the browser with separate request. The web server’s response to each of these requests will contain a response header. You can see these other headers by opening the Chrome developer tools, clicking on the network tab, and putting the page’s URL in the browser’s address bar. Inside the Network tab there is another tab called Headers. You are interested in the Reponse headers. Try to select the resource on the left side that matches the content you are interested in. It may be a picture, CSS file, PHP, HTML file etc, …

Note, not all web servers are configured to send “creation” or “last modified” date in the HTTP response header. Don’t be surpised if you don’t see these fields.

4.) Other services
http://www.cubestat.com provides a service that shows some additional web page indexing information. I’m not entirely sure where it gets its data. I expected the dates to match what Google shows but that isn’t always so. It looks at multiple indexers (Google, Yahoo, Live (Bing)). It could be getting the data from Alexa, Quantcast, or MagesticSEO. Or it could be running its own crawler but that is doubtful.

5.) DNS registration
Another interesting piece of information is when the domain name was registered to a given ip address or owner. http://www.statscrop.com will also give you this information.

When you run ‘apt-cache search’ it returns anything that has the search text in either the package title or the description.  What if you only want to search within the package name?  The following pattern will help you do that:

apt-cache search mysql | grep -P “^[^\s]*server[^\s]*”

In general it can be used to search within the first column of space delimited text.

Excel Adjacency List to Dot on CodePlex

I created an open source project for my Excel add-in that converts an adjacency list entered into an Excel worksheet into a GraphViz dot file.  Check it out here:

https://exceladjlisttodot.codeplex.com/

RDLC Project on CodePlex

January 31, 2012

I finally got around to making this C# library for programmatically creating Visual Studio client report definition (.rdlc) files opensource.  You can download it here.  If you have an interest, feel free to improve it.

http://rdlc2005.codeplex.com/

Upcoming Stanford Courses

January 25, 2012

Stanford is offering a number of free online courses this semester. Two that I am interested in are Machine Learning and Natural Language Processing. If the course load is too much I will stick with NLP. I think they even give a certificate of completion.

I use my Amazon.com wish list a lot. I have hundreds of items in it. I wanted to see if I had any books in my wish list on LaTeX or TeX. So I thought this would be a good use of the Amazon.com API. Unfortunately I found out that they recently removed wish list querying methods from their API. Bummer. Then I came upon this post. http://bililite.nfshost.com/blog/2010/10/31/hacking-my-way-to-an-amazon-wishlist-widget/ . The solution is the layout=compact URL parameter.

First open one of your wish lists on Amazon.com. Add ?layout=compact to the end of the URL in the address bar. This will put all of the wish list items in a single page. Then use the browser’s search feature (ctlr+f) to search the page. If you want to search multiple wish lists, you need to click on each of them and repeat but it sure beats paging through the items.

This is a hidden feature since as far as I can tell there is no way to change the layout through the UI.

I did a little research and wasn’t able to find a solution. I played around with tesseract-ocr using VietOCR. It wasn’t great but it is better than nothing. Apparently tesseract is the best library out there for this and I think it is what google uses. I may be able to use tesseract-ocr for my implementation. Check out my StackOverflow question here:

http://stackoverflow.com/questions/9007370/searching-the-file-system-for-text-in-an-image