Package pylal :: Module scrapeHtmlUtils :: Class scrapePage
[hide private]
[frames] | no frames]

Class scrapePage

source code

This class is responisble for taking in out expected html formatted file and allowing us to manipulate the central table of interest while keeping the rest of the html available for later writing to a disk.

Instance Methods [hide private]
 
__init__(self) source code
 
setContextKeys(self, newStartKey="", newEndKey="")
Calling self.setContectKeys will allow you to specify two new context keys to select a single table from a parsed html file.
source code
 
readfile(self, filename)
Reads in a text html file given a filename.
source code
 
__createTableObject__(self, inputHTML=None)
Given a list of text string we want to manipulate these strings to create a table object.
source code
 
getColumnByText(self, textString='', colNum=1)
Given a text string expected in Column #1 we select the specified column given as an argument here.
source code
 
showRows(self)
Call this method after reading the html to literally see the row labels inside the HTML table we are manipulating.
source code
 
getRowList(self)
This method gets the list of rows in the table for that htmlPage() instance.
source code
 
showCols(self)
Call this method after reading the html to literally see the column labels inside of the html table we are manipulating.
source code
 
getColumnByCoord(self, RowNum, ColNum)
Given a row number and column number return that element in the table.
source code
 
insertTextAtCoord(self, RowNum, ColNum, Text)
Given a row number and column number insert the argument text over what currently exists.
source code
 
insertTextGivenText(self, matchText, colNum, Text)
Looks for given row matching column 1 to given text.
source code
 
__buildMiddleOfPage__(self)
This method should not be called explicity.
source code
 
buildTableHTML(self, formattingTxt="")
Call this method to build a single string that corresponds the the html you want to have that will begin with <table> and end with </table>.
source code
 
writeTableHTML(self, filename="table.html", formattingTxt="")
Call this method to write just the html for creating the table to a file.
source code
 
__stripHTMLTags__(self, stringIN)
Take input string and remove all tags inside of < > delimiters.
source code
 
__stripRowNumber__(self, stringA)
Takes the string representing the table row number.
source code
 
__compareKeyWords__(self, stringA="", stringB="", exact=False)
Break stringA into keywords minus html tags.
source code
 
writeHTML(self, filename)
Writes out the html that was manipulated to the file filaname.
source code
Method Details [hide private]

setContextKeys(self, newStartKey="", newEndKey="")

source code 

Calling self.setContectKeys will allow you to specify two new context keys to select a single table from a parsed html file. The two arguments for this function require that you specify a key which is one entire line long from the source html file that you want to extract the table from. This will allow the code to save the surrounding html and allow you to manipulate the table in a more natural manner of table[row][col] to edit the entries. Do not set the keys to partial line matches or non-printing characters, this will almost ensure the failure of the html parser.

__createTableObject__(self, inputHTML=None)

source code 

Given a list of text string we want to manipulate these strings to create a table object. If the inputHTML is None then we assume we want to work with self.middleOfPage variable.

getColumnByText(self, textString='', colNum=1)

source code 

Given a text string expected in Column #1 we select the specified column given as an argument here. If there was nothing found return empty string.

getRowList(self)

source code 

This method gets the list of rows in the table for that htmlPage() instance. The data returned in a list of two element lists. Like [[a,b],[c,d],...,[y,z]]

getColumnByCoord(self, RowNum, ColNum)

source code 

Given a row number and column number return that element in the table. If the coords do not exist return empty string.

insertTextAtCoord(self, RowNum, ColNum, Text)

source code 

Given a row number and column number insert the argument text over what currently exists. If the RowNum and ColNum is out of bounds do nothing.

insertTextGivenText(self, matchText, colNum, Text)

source code 

Looks for given row matching column 1 to given text. It then inserts the Text into the column specified by ColNum. If there is no match or ColNum is out of bound nothing is done.

__buildMiddleOfPage__(self)

source code 

This method should not be called explicity. It will rebuild the table object variable into a chunk of html for writing to the disk.

__stripRowNumber__(self, stringA)

source code 

Takes the string representing the table row number. It strips the number strip from the front. The input string is assumed to have the form #?? Word Words More Words where the only number is #?? Ideally this method should only be called by self.__compareKeyWords__()

__compareKeyWords__(self, stringA="", stringB="", exact=False)

source code 

Break stringA into keywords minus html tags. Then take these words and make sure they exist inside of stringB. If the exact key is True then strings like Big blue bird will not match Big blue pretty bird if the string is left as default (False) then we allow the above string to be matched since all the words in the first string are contained in the second string.