A taste of tag soup

TagSoup is a library to parse non compliant HTML.

To explain why you might want this, lets start by considering the following table.

<table id="test" border="1">
<tr>
<td>test
</tr>
</table>

Notice the missing closing td tag.

This still, in a browser renders as a table, border and all.

table
Though I don’t have the actual stats, judging by how often I may mistakes in my own HTML, I think this I a valid enough reason to suspect the html you try to parse may not be compliant.

The problem is, a strict XML parser would not parse this. What we need is a parser which is lenient enough to parse malformed HTML with some degree of usefulness. Remote controlling a browser session would be a novel solution but it does incur a few overheads making it slower, harder to build and harder to code. Let’s see how far we can get with some common xml parsers.

For example, using the popular HaXml the following error is produced when trying to parse the following source document.

Prelude Text.XML.HaXml.Parse>  xmlParse "" "<b>hello world</i></b>"                                                                                                                          
*** Exception: in element tag b,                                                                                                                                                                                   
tag <b> terminated by </i>                                                                                                                                                                                         
  at file   at line 1 col 15  

This library is way to strict to parse bad html. Let’s try another. This time we will try the package known on hackage as simply xml

Prelude Text.XML.Light> parseXML "<b>hello world</i></b>"
[Elem                                                                                                                                                                                                              
   (Element{elName =                                                                                                                                                                                               
              QName{qName = "b", qURI = Nothing, qPrefix = Nothing},                                                                                                                                               
            elAttribs = [],                                                                                                                                                                                        
            elContent =                                                                                                                                                                                            
              [Text                                                                                                                                                                                                
                 (CData{cdVerbatim = CDataText, cdData = "hello world</i>",                                                                                                                                        
                        cdLine = Just 1})],                                                                                                                                                                        
            elLine = Just 1})]     

Slightly better, but as you can see the closing i tag is counted as text just like hello.

Now, this time, we’ll try using TagSoup.

What is Tagsoup? Tagsoup is a library for parsing and re-rendering html.

To get it,

cabal install tagsoup

After having done that. You can do some scraping. But first you might want to do some reading on the library by doing an online search for “tag soup haskell”.

Also, if you do not know what “tag soup” is you might want to read up the page on wikipedia.

With that out of the way,

First import tagsoup

import Text.HTML.TagSoup

Now we can try and parse the broken HTML again.

Prelude Text.HTML.TagSoup> parseTags "<b>hello world</i></b>"
[TagOpen "b" [],TagText "hello world",TagClose "i",TagClose "b"]

…Now we’re getting somewhere.

As a small demonstration of the library, let’s extract just the text from the html document.

Prelude Text.HTML.TagSoup> concatMap fromTagText $ filter isTagText $ parseTags "<b>hello world</i></b>"
"hello world"

So here you see a library which is capable of handling broken html documents in a fairly more malleable way than the usual xml library.

The cross compatibility/targeted system trade off.

Why do some people use android and other ios?

Why do some people like windows and other MacOS?

Why do some people like to write apps in Java while others dot net.

Why do some people like web apps while others like native apps?

You will find that apps made for one system that just aren’t available in the same way on the others systems and one reason for this is because some environments are more cross compatible than others. This compatibility however, comes at a price.

Often cross platform code is slower because it has to run through an interpreter. The java script with web applications is one example. Even Java byte code has to run through extra interpretation before being run.

Often the abstraction layers do not provide the complete functionality of all the systems they target and just provide a subset. Someone developing for Windows using dot net might be able to add context menus to File Explorer but someone using Java may not do that because they are trying to create a consistent experience across systems where adding custom context menus is not possible. Also, web apps do not currently allow for as much data to be stored locally as native apps. A Cross Platform developer might have to support a wider array of hardware such as monitors with different resolutions so they might settle for a simpler UI for speed of development.

The more targeted developer can make more assumptions about Ram availability. While developers in the past have tried to put “Recommended Hardware Requirements” on the box, it can be hard to expect users to check it. Even for the experienced software buyer it can greatly complicate the software availability landscape to the end user.

In summary.

  • If each individual who uses your app will use it a lot of the time, use native as the experience can be better custom fit.
  • If you want your app to be fast use native
  • otherwise, provided the first two conditions are not met, if you want to target a large audience use an environment which is more cross platform.

 

Google Search Bar dilemma

Note: this is an old article I wrote around 2004 which Ihave transferred to this blog.

Scenario:

You go over to your friend(some friend…) house or anywhere which has a stock installation of IE and you realize:
“Oh no, my Google Search bar has gone. That means you will have to traverse one more link before arriving at my destination”(xxl0litaz.net etc).

This tutorial is an attempt at avoiding the usual break out of rage which typically occurs in such situations.

Solution:

Type in the address bar:
google.com/search?q=<search-query>

eg. google.com/search?q=chuck+norris
note: terms dilimited by “+” signs.

eg2. google.com/search?q=%20chuck+norris%20
note: quotes are specified by the string “%20”.

eg3. google.com/search?q=chuck+norris+facts&btni*
note: The btni specifies the magic Google option I’m Feeling
lucky which can effectively save 2 links.
*remember the “I” in “btnI” is case sensitive.

eg4. google.com/groups/search?q=chuck+norris
note: adding /groups/ will make it try search within google groups

eg5. google.com/group/alt.whateva
note: this is the quick way to access your fav group

eg6. google.com/groups/dir?q=chuck+norris
note: this is the way to search for a particular group.

There is heaps more you can do with google searches if you
go to the site and click help. this stuff is just
undocumented by them.

The difference between excessive, redundant and superfluous.

The wordsredundant, superfluous and excessive all imply that a quantity greater than the minimum amount needed has exceededsome threshold. But they all suggestdifferent implications.

Excessive indicatesthat a quantity is so high it might be a bad thing. It could be said for example, that excessive rainfall will cause a tank to overflow. If you wanted the tank to overflow to feed the plants around it, you might just call it sufficient instead.

Redundant implies that the extra abundance can be used as a means of substitution. When you buy a packet of say 100 sheets of paper, they may give you 103. The extra 3 are redundant in case only 99 of the regular sheets are provided. Animals often have organs in pairs, Lungs, kidneys etc. Its said that one kidney can perform the job of two so it that sense it is there for redundancy reasons.

If your job is to be terminated, one of the tactfulways to express this is that your position is redundant. This implies that the company found a way to do your job without you. To simply say that the person is no longer needed might imply that they can no longer do there job which is only sometimes the case and could be a harshthing to do considering their situation.

Superfluous makes no judgement on whether something is worse or better, indeed it suggests the matter is little importanceto the rest of the message being conveyed . Although the etymology of superfluous means overflown water, you would not use the term if you were concerned with wasting water or things getting wet.Instead you would say it’sexcessive. If however you were describing the motion of water in a physics class and you filled up a cup of water which overflowed into a basin you might describe the overflown water as superfluous as to not import any other concerns other than the motion of the water. In a practical scenario, saying excessive when you meant superfluous will almost always be forgiven as superfluous is a much less used word than excessive and they can both get a common point across. This would be like saying big when you meant ample. Both are matters of connotation which generally fall under fairly subjective interpretation.

As a further explanation, when acronyms are described, sometimes both the last letter and last translation of the last letter are pronounced. For example, ATM Machine, which expands to Automatic teller machine machine. Notice machine is described twice. This, I’ve noticed to be a contentious issue in communication. If you think its a good thing you wouldcall the extra M redundant. If you think its a bad thing you wouldcall it excessive. If you think its neither bad nor good you wouldcall it superfluous.

EDIT: there is also extraneous. This word is similar to excessive but works on type and not quantity. Extra objects of which the type is bad.

Why RAM is slower than some things like CPU cache.

A software person should know enough about hardware to write software easily but I confess I only recently figured out different types memory exist. Namely RAM and CPU cache.

Why both RAM and drive storage exist is aneasier thing to understand although it was recently made harder by thearrival of flash drives. RAM stands for random access memory . HDDs certainly are not random access but flash drives are. This is why HDD performance can be improved with defragging but not flash drives. Flash Drives are also refereed to as SSDs. One reason RAM has been faster than SSDs is because SSDs are non volatile where RAM is volatile. To make the memory in your computer stick when you shut it down requires a longer time to alter the hardware than something that is just influenced by the flow of current which inevitably stops when the computer is turned off.

Now on to CPU cache. It turns out the speed of light is not infinite. Therefore, the time it takes for information to travel is dependent on how far it has to go. In the case of computers this is current running along a conductivepath. Given that there is also a minimum to the size of which manufacturers are capable of making small hardware, there is a delay between memory and computations on that memory. The more memory you are trying to access, the farther away it will potentially be because memory cells can only be packed so tightly together. Thus, if you work with a small amount of memory, provided your algorithm allows it, you can compute faster. CPU cacheis made of smaller divisions of memory particularly for executing things fast. This often serves as a cache on information in RAM but CPU registers could also be considered a form of memory too. CPU caches come in different sizes for algorithms requiring more memory,namely L1, L2 cache etc. The smaller the level, the less memory is used and the faster your code can execute with.

Http vs plain old sockets

In the old days. Most services did not use HTTP. FTP. IRC, SMTP specify there own language for communication. Nowadays, many more things are brought via Http. Social media apps. RSS. Git over http. Email APis over http. This site for example.

I can think of some examples why this is the case.

  1. HTTP performs slow so restrictednetworks would be more likely to allow it as they are less likely to connect to things like file sharing networks.
  2. HTTP is connection-less (or conventionally so (see long polling) which reduces memory consumption .
  3. HTTP can be tested in a web browser which is slightly easier than using telnet.
  4. HTTP supports compression and encryption headers.
  5. HTTP has a standard for submitting the domain name allowing for Virtual Hosts
  6. HTTP supports basic auth.
  7. Probably other stuff.

Sockets does not provide this stuff out of the box and you would have to roll your own. Web sockets are the connectionful version of HTTP.

HTTP of course adds it’s own overhead with the HTTPmethod names like GET but if you can get past that, it can be pretty lucrative.