Parsing content-type header

May 14, 2013

The content-type header consists of the MIMI content type of the resource plus an optional character set specification.

I came up with a regular expression to split the content-type header value into the respective fields:

(?P<type>.*?)(;|$)(\s?charset=(?P<charset>.*?)(;|$))?

In python you would use it like so.  Notice the use of re.IGNORECASE: 

contentTypePattern = re.compile(r”(?P<type>.*?)(;|$)(\s?charset=(?P<charset>.*?)(;|$))?”, re.IGNORECASE)

m = contentTypePattern.search(contentTypeHeader)
contentType = m.group(‘type’)
charset = m.group(‘charset’)

This is what I used for my test input:

image/x-ms-bmp
text/html; charset=GB2312
application/postscript
video/quicktime
image/png
image/vnd.microsoft.icon
text/xml;charset=UTF-8
image/jpeg; charset=utf-8
text/html;charset=utf-8
text/html;charset=euc-kr
application/zip;charset=ISO-8859-1
text/plain;charset=UTF-8
image/x-png
application/x-zip-compressed
text/javascript; Charset=utf-8
video/webm
text/x-vCalendar;charset=UTF-8
text/javascript;charset=utf-8
application/javascript;charset=utf-8
text/plain; charset=utf-8
video/x-flv
application/rtf
text/xml; Charset=utf-8
text/html
text/xml;;charset=UTF-8
text/plain; charset=UTF-8
application/x-x509-ca-cert
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/pdf; charset=utf-8
text/calendar
text/xml; charset=UTF-8
application/ogg
text/xml
application/x-javascript;charset=UTF-8
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
image/x-icon; charset=utf-8
text/css; Charset=UTF-8
image/x-icon
Application/doc
application/pdf; charset=UTF-8
application/vnd.wap.xhtml+xml; charset=utf-8
application/vnd.ms-word.document.12
application/x-javascript;charset=utf-8
text/html; charset=euc-kr
Application/ppt
text/html; charset=utf-8;
application/x-shockwave-flash
text/html; charset=ISO-8859-1
text/css; charset=utf-8
application/xml
text/plain; charset=ISO-8859-1
APPLICATION/XML; charset=utf-8
image/jpeg;charset=ISO-8859-1
text/plain; charset=ISO-8859-15
text/html; charset=UTF-8
text/html; charset=iso-8859-1
text/html; charset=utf-8
application/octet-stream;charset=UTF-8;
image/png;charset=UTF-8
application/octet-stream
text/xml; charset=utf-8
text/x-js
text/plain
application/ms-download
text/css;charset=utf-8
application/rss+xml; charset=UTF-8
text/html;charset=GB2312
video/mp4
application/x-javascript; charset=utf-8
application/rss+xml; charset=utf-8
pdf
image/pjpeg
image/svg+xml
text/html; Charset=utf-8
image/gif; charset=utf-8
text/x-component
application/pdf
text/css;charset=UTF-8
text/css; charset=UTF-8
Application/mp3
application/x-msdownload
image/gif
application/javascript
img/gif
Text/html; charset=utf-8
image/tiff
application/x-rar-compressed
application/pdf;charset=UTF-8
text/js
Application/flv
application/xhtml+xml; charset=utf-8
application/x-javascript
text/html;charset=windows-1250
image/Jpeg
text/javascript
video/ogg
video/mpeg
text/html;charset=UTF-8
application/step
text/css
application/xhtml+xml;charset=UTF-8
text/html; charset=gb2312
image/jpeg
image/ico
image/gif;charset=UTF-8
application/pkix-crl
image/gif;charset=ISO-8859-1
application/zip
application/flv
image/vnd.wap.wbmp
text/html;charset=utf-8; charset=utf-8
text/x-vcalendar
text/html;;charset=UTF-8
application/pdf;charset=ISO-8859-1
image/dxf
application/vnd.ms-excel
image/bmp
image/png;charset=ISO-8859-1
text/javascript; charset=utf-8
text/javascript;charset=UTF-8
text/plain;charset=utf-8
audio/mpeg
text/html; charset=windows-1252
application/x-pkcs7-certificates
image/svg
application/msword
audio/x-ms-wma
application/vnd.ms-powerpoint
text/html; charset=EUC-KR
video/x-ms-wmv
text/html; Charset=UTF-8
application/rss+xml
text/rtf
video/x-msvideo
text/html; charset=ISO-8859-2
text/javascript;charset=ISO-8859-1
Application/swf
image/jpeg;charset=UTF-8
text/html;charset=ISO-8859-1

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: