Langsung ke konten utama

How To Recognize Valid URL from Text

Twitter and host of Twitter clients as well as tons of other programs have to recognize valid URL's (protocol defined in RFC 1738) from plain text and hyperlink them. Unfortunately dues tyo lazyness (or lack of knowledge) of the programmers such URL detection schemes are often hare-brained and fails torecognize valid URL's properly.

URL can use a wide variety of characters and you need to recognize all of them to properly identify and isolate an URL from surrounding text. Here is a simple guide for programmers (based on RFC 1738 obviously):
In general, URLs are written as follows:
:

   A URL contains the name of the scheme being used () followed
   by a colon and then a string (the ) whose
   interpretation depends on the scheme.

   Scheme names consist of a sequence of characters. The lower case
   letters "a"--"z", digits, and the characters plus ("+"), period
   ("."), and hyphen ("-") are allowed. For resiliency, programs
   interpreting URLs should treat upper case letters as equivalent to
   lower case in scheme names (e.g., allow "HTTP" as well as "http").

URLs are written only with the graphic printable characters of the
   US-ASCII coded character set. The octets 80-FF hexadecimal are not
   used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
   control characters; these must be encoded.

   Unsafe:

   Characters can be unsafe for a number of reasons.  The space
   character is unsafe because significant spaces may disappear and
   insignificant spaces may be introduced when URLs are transcribed or
   typeset or subjected to the treatment of word-processing programs.
   The characters "<" and ">" are unsafe because they are used as the
   delimiters around URLs in free text; the quote mark (""") is used to
   delimit URLs in some systems.  The character "#" is unsafe and should
   always be encoded because it is used in World Wide Web and in other
   systems to delimit a URL from a fragment/anchor identifier that might
   follow it.  The character "%" is unsafe because it is used for
   encodings of other characters.  Other characters are unsafe because
   gateways and other transport agents are known to sometimes modify
   such characters. These characters are "{", "}", "|", "\", "^", "~",
   "[", "]", and "`".

   All unsafe characters must always be encoded within a URL. For
   example, the character "#" must be encoded within URLs even in
   systems that do not normally deal with fragment or anchor
   identifiers, so that if the URL is copied into another system that
   does use them, it will not be necessary to change the URL encoding.

   Reserved:

   Many URL schemes reserve certain characters for a special meaning:
   their appearance in the scheme-specific part of the URL has a
   designated semantics. If the character corresponding to an octet is
   reserved in a scheme, the octet must be encoded.  The characters ";",
   "/", "?", ":", "@", "=" and "&" are the characters which may be
   reserved for special meaning within a scheme. No other characters may
   be reserved within a scheme.

   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. However, this is not
   true for reserved characters: encoding a character reserved for a
   particular scheme may change the semantics of a URL.

   Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
   reserved characters used for their reserved purposes may be used
   unencoded within a URL.

   On the other hand, characters that are not required to be encoded
   (including alphanumerics) may be encoded within the scheme-specific
   part of a URL, as long as they are not being used for a reserved
   purpose.
 
Twitter itself doesn't recognize some of them at the end. Twitter clients like Tweetdeck or Brizzly fails to recognize a ) even in the middle!
Is compliance with standards too much to expect from a 100 million dollar company?

source : http://blog.taragana.com/index.php/archive/how-to-recognize-valid-url-from-text/

POPULAR

Kerajaan Jeumpa, Kerajaan Islam Pertama Nusantara

Teori tentang kerajaan Islam pertama di Nusantara sampai saat ini masih banyak diperdebatkan oleh para peneliti, baik cendekiawan Muslim maupun non Muslim. Umumnya perbedaan pendapat tentang teori ini didasarkan pada teori awal mula masuknya Islam ke Nusantara. Mengenai teori Islamisasi di Nusantara, para ahli sejarah terbagi menjadi 3 kelompok besar, yaitu pendukung (i) Teori Gujarat (ii) Teori Parsia dan (iii) Teori Mekah (Arab). Bukan maksud tulisan ini untuk membahas teori-teori tersebut secara mendetil, namun dari penelitian yang penulis lakukan, maka dapat disimpulkan bahwa Teori Mekkah (Arab) lebih mendekati kebenaran dengan fakta-fakta yang dikemukakan. Teori Mekkah (Arab) hakikatnya adalah koreksi terhadap teori Gujarat dan bantahan terhadap teori Persia. Di antara para ahli yang menganut teori ini adalah T.W. Arnold, Crawfurd, Keijzer, Niemann, De Holander, SMN. Al-Attas, A. Hasymi, dan Hamka. i Arnold menyatakan para pedagang Arab menyebarkan Islam ketika mereka mendo...

Sejarah: Salman al-Parsi Pendiri Kerajaan Jeumpa Aceh

Sebagaimana dikemukakan terdahulu, bahwa sebelum Nabi Muhammad saw membawa Islam, dunia Arab dengan dunia Melayu sudah menjalin hubungan dagang yang erat sebagai dampak hubungan dagang Arab-Cina melalui jalur laut yang telah menumbuhkan perkampungan-perkampungan Arab, Parsia, Hindia dan lainnya di sepanjang pesisir pulau Sumatera. Karena letak gegrafisnya yang sangat strategis di ujung barat pulau Sumatra, menjadikan wilayah Aceh sebagai kota pelabuhan transit yang berkembang pesat, terutama untuk mempersiapkan logistik dalam pelayaran yang akan menempuh samudra luas perjalanan dari Cina menuju Persia ataupun Arab. Hadirnya pelabuhan transito sekaligus kota perdagangan seperti Barus, Fansur, Lamri, Jeumpa dan lainnya dengan komuditas unggulan seperti kafur, yang memiliki banyak manfaat dan kegunaan telah melambungkan wilayah asalnya dalam jejaran kota pertumbuhan peradaban dunia. ”Kafur Barus”, ”Kafur Fansur”, ”Kafur Barus min Fansur” yang telah menjadi idiom kemewahan para Raja...

Sejarah Huruf Alfabet

Istilah alphabet sebetulnya berasal dari bahasa Semit. Istilah ini terdiri dari dua kata, yaitu aleph yang berarti 'lembu jantan' dan kata beth yang berarti 'rumah'. Konotasi pictografis dari pengertian kedua kata ini menjadi sebutan untuk menunjukkan huruf pertama a (aleph) dan b (beth) dalam urutan huruf-huruf semit (Mario Pei,1971:176). Ini bukan berarti bahwa tulisan tersebut memakai sistem pictografis-ideografis, akan tetapi malah sebaliknya. Orang-Orang Semit mengambil tanda gambar lembu (kepala lembu) dari huruf Hierogliph Mesir tanpa memperdulikan pengertian lembu itu dalam bahasa Mesir sendiri, sedangkan menurut bahasa Semit, lembu itu disebut aleph. Demikian juga dengan tanda gambar rumah yang mereka sebut beth. Kemudian dengan mempergunakan prinsip akroponi, tanda gambar kepala lembu, oleh masyarakat Semit dijadikan tanda untuk bunyi a dan tanda gambar rumah untuk bunyi b. Semua huruf pada alphebt Semit mempunyai konotasi seperti pictografis itu. Daerah y...