Jump to content

Need htm code guru


Recommended Posts

I'm writing a software (actualy a vb script) to clear htm documents from:

-scripts

-reference to scripts

-missing/unsaved images

-etc

The main goal is to clear completely the page from reference triggering the internet connection but also of any useless garbage.

What are the code that I should or shouldn't delete? I want to do a script that process line by line as it's faster than character by character...

Copypaste the code into notepad and save it as cleanhtm.vbs to try it.

I think my method is quiet radical, but of course some page behave wierd after cleaning, while I still didn't notice any loss of usefull data so far.

There are resulting in sometimes appearing bit of loss code in page view... :)

On a few pages it didn't remove all the auto-connecting references...

'Clear garbage from htlm files
'Written by ***Fredledingue***  fredledingo@yahoo.com
'-----------------------------------------------------------
Const ForReading = 1, ForWriting = 2, ForAppending = 3
Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0
Const OPEN_FILE_FOR_APPENDING = 8
Title = "HTLM Cleaner 1.0"
'--------browse for a folder---------
'On Error Resume Next
set oShell = CreateObject("Shell.Application")
Set oF = oShell.BrowseForFolder(0,  "Select the folder where the htlm files are located" , 0, "My Computer")
path = oF.Items.Item.path
If Err.Number > 0 Then
MsgBox "Action canceled.",,Title
wscript.quit
End If
Set fso = CreateObject("Scripting.FileSystemObject")
Set Folder = fso.getfolder(Path)
Set Files = folder.files
'------Check if Dirty_htm exists, and if not create it--------------
DirtyHtm = path & "\Dirty_Htm"
If Fso.FolderExists(DirtyHtm) Then
MsgBox "The original files will be saved in the folder """ & DirtyHtm & """",,Title
Else
Fso.CreateFolder(DirtyHtm)
MsgBox "A folder """ & DirtyHtm & """ was created to save the original files.",,Title
End If
'------------Ask to confirm for each file-------
If (MsgBox("Confirm for each file?", vbYesNo, Title) = vbYes) Then
AskToConfirm = True
Else
AskToConfirm = False
End If
'-------------Process files-----------------
For Each File in files
FileName = file.name
FilePath = file.path
GoAhead = False
If fso.GetExtensionName(FilePath)="htm" Or fso.GetExtensionName(FilePath)="html" Then
'------confirm-----
If AskToConfirm = True Then
 If (MsgBox("Nettoyer """ & FileName & """ ?", vbYesNo, Title) = vbYes) Then
 GoAhead = True
 End If
Else
GoAhead = True
End If
'---------process the file line by line-------------
If GoAhead = True Then
NewHtm = "<!--version modified by Fredledingue's html cleaner -->" & VbCrlf
LineCount = 0
DelLineCount = 0
DelImgCount = 0
IsScript = False
AlreadyDone = False
Set f = fso.GetFile(FilePath)
Set ts = f.OpenAsTextStream(ForReading, Tristatefalse)
 Do Until ts.AtEndOfStream
 t = ts.ReadLine
  If t = "<!--version modified by Fredledingue's html cleaner -->" Then
  AlreadyDone = True
  Exit Do
  End If
 t=Replace(t,"<NOSCRIPT","<br")
 t=Replace(t,"<noscript","<br")
 t=Replace(t,"</NOSCRIPT","<br")
 t=Replace(t,"</noscript","<br")
 t=Replace(t,"text/css","text(slash)css")
 t=Replace(t,"@import url","(at)import(space)url")
 t=Replace(t,"TEXT/CSS","TEXT(slash)CSS")
 t=Replace(t,"@IMPORT URL","(at)IMPORT(space)URL")
 t=Replace(t,".css","(dot)css")
 t=Replace(t,".js","(dot)js")
 t=Replace(t,".vbs","(dot)vbs")
 t=Replace(t,".CSS","(dot)CSS")
 t=Replace(t,".JS","(dot)JS")
 t=Replace(t,".VBS","(dot)VBS")
 LineCount = LineCount +1
  '-------remove script-----
  If IsScript = True Then
  DelLineCount = DelLineCount +1
   If InStr(t,"</script")>0 Or InStr(t,"</SCRIPT")>0 Then
   IsScript = False
   End If
  Else
  If InStr(t,"<script")>0 Or InStr(t,"<script")>0 Then
  DelLineCount = DelLineCount +1
   If InStr(t,"</script")=0 And InStr(t,"</SCRIPT")=0 Then
   IsScript = True
   End If
  Else
  If InStr(t,"href='javascript")>0 Then
  DelLineCount = DelLineCount +1
  Else
  '--------remove missing images------
  If InStr(t,"src=""")>0 Then
  u=InStr(t,"src=""")+5
  v=Mid(t,u)
  w=InStr(v,"""")
  img = Left(v,w-1)
   If Fso.FileExists(path & "/" & img) Then
    If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
    t= ">" & Replace(t, "src=""", "<img src=""")
    End If
   NewHtm = NewHtm & VbCrlf & t
   Else
   DelLineCount = DelLineCount +1
   DelImgCount = DelImgCount +1
   End If
  img = ""
  Else
  If InStr(t,"SRC=""")>0 Then
  u=InStr(t,"SRC=""")+5
  v=Mid(t,u)
  w=InStr(v,"""")
  img = Left(v,w-1)
   If Fso.FileExists(path & "/" & img) Then
    If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
    t= ">" & Replace(t, "SRC=""", "<IMG SRC=""")
    End If
   NewHtm = NewHtm & VbCrlf & t
   Else
   DelLineCount = DelLineCount +1
   DelImgCount = DelImgCount +1
   End If
  img = ""
  Else
  '-------
  If InStr(t,"src=")>0 Then
  u=InStr(t,"src=")+4
  v=Mid(t,u)
  w=InStr(v,".")
  img = Left(v,w+3)
   If Fso.FileExists(path & "/" & img) Then
    If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
    t= ">" & Replace(t, "src=", "<img src=")
    End If
   NewHtm = NewHtm & VbCrlf & t
   Else
   DelLineCount = DelLineCount +1
   DelImgCount = DelImgCount +1
   End If
  img = ""
  Else
  If InStr(t,"SRC=")>0 Then
  u=InStr(t,"SRC=")+4
  v=Mid(t,u)
  w=InStr(v,".")
  img = Left(v,w+3)
   If Fso.FileExists(path & "/" & img) Then
    If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
    t= ">" & Replace(t, "SRC=", "<IMG SRC=")
    End If
   NewHtm = NewHtm & VbCrlf & t
   Else
   DelLineCount = DelLineCount +1
   DelImgCount = DelImgCount +1
   End If
  img = ""
  Else
  '------
  If InStr(t,"><img")>0 Then
  t= Replace(t,"><img","")
  t= replace(t,"<nobr","<a")
  NewHtm = NewHtm & VbCrlf & t
  Else
  If InStr(t,"><IMG")>0 Then
  t= Replace(t,"><IMG","")
  t= replace(t,"<NOBR","<A")
  NewHtm = NewHtm & VbCrlf & t
  '------
  Else  
  NewHtm = NewHtm & VbCrlf & t
  End If
  End If
  End If
  End If
  End If
  End If
  End If
  End If
  End If
 '----------------------------
 Loop
ts.Close
 If AlreadyDone = True Then
 Msg = Msg & """" & FileName & """: " & VbTab & "File already cleaned." & VbCrlf
 Else
 Msg = Msg & """" & FileName & """: " & VbTab & DelLineCount & " on " & LineCount & " (" & DelImgCount & " images missing)" & VbCrlf
 '--------Move original file to Dirty_htm-----
 f.Move DirtyHtm & "\" & FileName
 '--------write to file-----------
 Set objOutputFile = fso.CreateTextFile(path & "\" & FileName, True)
 objOutputFile.Write NewHtm
 objOutputFile.Close
 '---------------------------------
 End If     '-------refers to "If AlreadyDone = True"
End if       '------refers to "If GoAhead = True"
End If      '-----refers to "If fso.GetExtensionName(FilePath)="htm"..."
Next
MsgBox "Result:" & VbCrlf & Msg,, Title
'---------End of Script-----------------

Link to comment
Share on other sites


The script read the htm file line by line.

Each line is processed as the variable t .

The parts of the script below apply for each line, through a loop.

This is the part that deals with htm code (after and before it's just interface stuffs)

Here parts of htm code are merely replaced by something that kills it.

It's the equivalent of "edit --> replace --> replace all" in Word.

For example I replace <NOSCRIPT by <br .

t=Replace(t,"<NOSCRIPT","<br")
t=Replace(t,"<noscript","<br")
t=Replace(t,"</NOSCRIPT","<br")
t=Replace(t,"</noscript","<br")
t=Replace(t,"text/css","text(slash)css")
t=Replace(t,"@import url","(at)import(space)url")
t=Replace(t,"TEXT/CSS","TEXT(slash)CSS")
t=Replace(t,"@IMPORT URL","(at)IMPORT(space)URL")
t=Replace(t,".css","(dot)css")
t=Replace(t,".js","(dot)js")
t=Replace(t,".vbs","(dot)vbs")
t=Replace(t,".CSS","(dot)CSS")
t=Replace(t,".JS","(dot)JS")
t=Replace(t,".VBS","(dot)VBS")

this is just counting lines (not important now)

below you will see DelLineCount that counts the lines that has been dropped.

LineCount = LineCount +1

this is removing scripts from the page; detecting when a script code ends ( </script )

 '-------remove script-----
 If IsScript = True Then
 DelLineCount = DelLineCount +1
  If InStr(t,"</script")>0 Or InStr(t,"</SCRIPT")>0 Then
  IsScript = False
  End If

and when it start ( <script )

(for programing reasons, I decided to put the end before the start...

 Else
 If InStr(t,"<script")>0 Or InStr(t,"<script")>0 Then
 DelLineCount = DelLineCount +1
  If InStr(t,"</script")=0 And InStr(t,"</SCRIPT")=0 Then
  IsScript = True
  End If
 Else

this is detecting href='javascript and then dropping the line that contain it.

 If InStr(t,"href='javascript")>0 Then
 DelLineCount = DelLineCount +1
 Else

this checks if the image in a reference exists (and therefore was saved with the page and therefore the link to it should remain)

It detect src=" and src=

Once in lowercase, once in uppercase.

If the file exists, it add img before src= if it's not present.

Because htm code can be split and wrapped in many fashions, the script may find that an image doesn't exists because the link to it is on the next line.

So at the next line it must add img.

 '--------remove missing images------
 If InStr(t,"src=""")>0 Then
 u=InStr(t,"src=""")+5
 v=Mid(t,u)
 w=InStr(v,"""")
 img = Left(v,w-1)
  If Fso.FileExists(path & "/" & img) Then
   If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
   t= ">" & Replace(t, "src=""", "<img src=""")
   End If
  NewHtm = NewHtm & VbCrlf & t
  Else
  DelLineCount = DelLineCount +1
  DelImgCount = DelImgCount +1
  End If
 img = ""
 Else
 If InStr(t,"SRC=""")>0 Then
 u=InStr(t,"SRC=""")+5
 v=Mid(t,u)
 w=InStr(v,"""")
 img = Left(v,w-1)
  If Fso.FileExists(path & "/" & img) Then
   If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
   t= ">" & Replace(t, "SRC=""", "<IMG SRC=""")
   End If
  NewHtm = NewHtm & VbCrlf & t
  Else
  DelLineCount = DelLineCount +1
  DelImgCount = DelImgCount +1
  End If
 img = ""
 Else
 '-------
 If InStr(t,"src=")>0 Then
 u=InStr(t,"src=")+4
 v=Mid(t,u)
 w=InStr(v,".")
 img = Left(v,w+3)
  If Fso.FileExists(path & "/" & img) Then
   If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
   t= ">" & Replace(t, "src=", "<img src=")
   End If
  NewHtm = NewHtm & VbCrlf & t
  Else
  DelLineCount = DelLineCount +1
  DelImgCount = DelImgCount +1
  End If
 img = ""
 Else
 If InStr(t,"SRC=")>0 Then
 u=InStr(t,"SRC=")+4
 v=Mid(t,u)
 w=InStr(v,".")
 img = Left(v,w+3)
  If Fso.FileExists(path & "/" & img) Then
   If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then
   t= ">" & Replace(t, "SRC=", "<IMG SRC=")
   End If
  NewHtm = NewHtm & VbCrlf & t
  Else
  DelLineCount = DelLineCount +1
  DelImgCount = DelImgCount +1
  End If
 img = ""
 Else

This delete ><img when there is no src= (when src= might be on the next line,)

 '------
 If InStr(t,"><img")>0 Then
 t= Replace(t,"><img","")
 t= replace(t,"<nobr","<a")
 NewHtm = NewHtm & VbCrlf & t
 Else
 If InStr(t,"><IMG")>0 Then
 t= Replace(t,"><IMG","")
 t= replace(t,"<NOBR","<A")
 NewHtm = NewHtm & VbCrlf & t
 

I hope that you understand a little bit of what I wrote... :blink:

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...