Fredledingue Posted May 7, 2005 Share Posted May 7, 2005 I'm writing a software (actualy a vb script) to clear htm documents from:-scripts-reference to scripts-missing/unsaved images-etcThe main goal is to clear completely the page from reference triggering the internet connection but also of any useless garbage.What are the code that I should or shouldn't delete? I want to do a script that process line by line as it's faster than character by character...Copypaste the code into notepad and save it as cleanhtm.vbs to try it.I think my method is quiet radical, but of course some page behave wierd after cleaning, while I still didn't notice any loss of usefull data so far.There are resulting in sometimes appearing bit of loss code in page view... On a few pages it didn't remove all the auto-connecting references...'Clear garbage from htlm files'Written by ***Fredledingue*** fredledingo@yahoo.com'-----------------------------------------------------------Const ForReading = 1, ForWriting = 2, ForAppending = 3Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0Const OPEN_FILE_FOR_APPENDING = 8Title = "HTLM Cleaner 1.0"'--------browse for a folder---------'On Error Resume Next set oShell = CreateObject("Shell.Application")Set oF = oShell.BrowseForFolder(0, "Select the folder where the htlm files are located" , 0, "My Computer")path = oF.Items.Item.path If Err.Number > 0 Then MsgBox "Action canceled.",,Title wscript.quit End IfSet fso = CreateObject("Scripting.FileSystemObject")Set Folder = fso.getfolder(Path)Set Files = folder.files'------Check if Dirty_htm exists, and if not create it--------------DirtyHtm = path & "\Dirty_Htm" If Fso.FolderExists(DirtyHtm) Then MsgBox "The original files will be saved in the folder """ & DirtyHtm & """",,Title Else Fso.CreateFolder(DirtyHtm) MsgBox "A folder """ & DirtyHtm & """ was created to save the original files.",,Title End If'------------Ask to confirm for each file------- If (MsgBox("Confirm for each file?", vbYesNo, Title) = vbYes) Then AskToConfirm = True Else AskToConfirm = False End If'-------------Process files-----------------For Each File in filesFileName = file.nameFilePath = file.pathGoAhead = FalseIf fso.GetExtensionName(FilePath)="htm" Or fso.GetExtensionName(FilePath)="html" Then '------confirm----- If AskToConfirm = True Then If (MsgBox("Nettoyer """ & FileName & """ ?", vbYesNo, Title) = vbYes) Then GoAhead = True End If Else GoAhead = True End If '---------process the file line by line------------- If GoAhead = True Then NewHtm = "<!--version modified by Fredledingue's html cleaner -->" & VbCrlf LineCount = 0 DelLineCount = 0 DelImgCount = 0 IsScript = False AlreadyDone = False Set f = fso.GetFile(FilePath) Set ts = f.OpenAsTextStream(ForReading, Tristatefalse) Do Until ts.AtEndOfStream t = ts.ReadLine If t = "<!--version modified by Fredledingue's html cleaner -->" Then AlreadyDone = True Exit Do End If t=Replace(t,"<NOSCRIPT","<br") t=Replace(t,"<noscript","<br") t=Replace(t,"</NOSCRIPT","<br") t=Replace(t,"</noscript","<br") t=Replace(t,"text/css","text(slash)css") t=Replace(t,"@import url","(at)import(space)url") t=Replace(t,"TEXT/CSS","TEXT(slash)CSS") t=Replace(t,"@IMPORT URL","(at)IMPORT(space)URL") t=Replace(t,".css","(dot)css") t=Replace(t,".js","(dot)js") t=Replace(t,".vbs","(dot)vbs") t=Replace(t,".CSS","(dot)CSS") t=Replace(t,".JS","(dot)JS") t=Replace(t,".VBS","(dot)VBS") LineCount = LineCount +1 '-------remove script----- If IsScript = True Then DelLineCount = DelLineCount +1 If InStr(t,"</script")>0 Or InStr(t,"</SCRIPT")>0 Then IsScript = False End If Else If InStr(t,"<script")>0 Or InStr(t,"<script")>0 Then DelLineCount = DelLineCount +1 If InStr(t,"</script")=0 And InStr(t,"</SCRIPT")=0 Then IsScript = True End If Else If InStr(t,"href='javascript")>0 Then DelLineCount = DelLineCount +1 Else '--------remove missing images------ If InStr(t,"src=""")>0 Then u=InStr(t,"src=""")+5 v=Mid(t,u) w=InStr(v,"""") img = Left(v,w-1) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "src=""", "<img src=""") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else If InStr(t,"SRC=""")>0 Then u=InStr(t,"SRC=""")+5 v=Mid(t,u) w=InStr(v,"""") img = Left(v,w-1) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "SRC=""", "<IMG SRC=""") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else '------- If InStr(t,"src=")>0 Then u=InStr(t,"src=")+4 v=Mid(t,u) w=InStr(v,".") img = Left(v,w+3) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "src=", "<img src=") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else If InStr(t,"SRC=")>0 Then u=InStr(t,"SRC=")+4 v=Mid(t,u) w=InStr(v,".") img = Left(v,w+3) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "SRC=", "<IMG SRC=") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else '------ If InStr(t,"><img")>0 Then t= Replace(t,"><img","") t= replace(t,"<nobr","<a") NewHtm = NewHtm & VbCrlf & t Else If InStr(t,"><IMG")>0 Then t= Replace(t,"><IMG","") t= replace(t,"<NOBR","<A") NewHtm = NewHtm & VbCrlf & t '------ Else NewHtm = NewHtm & VbCrlf & t End If End If End If End If End If End If End If End If End If '---------------------------- Loop ts.Close If AlreadyDone = True Then Msg = Msg & """" & FileName & """: " & VbTab & "File already cleaned." & VbCrlf Else Msg = Msg & """" & FileName & """: " & VbTab & DelLineCount & " on " & LineCount & " (" & DelImgCount & " images missing)" & VbCrlf '--------Move original file to Dirty_htm----- f.Move DirtyHtm & "\" & FileName '--------write to file----------- Set objOutputFile = fso.CreateTextFile(path & "\" & FileName, True) objOutputFile.Write NewHtm objOutputFile.Close '--------------------------------- End If '-------refers to "If AlreadyDone = True" End if '------refers to "If GoAhead = True"End If '-----refers to "If fso.GetExtensionName(FilePath)="htm"..."NextMsgBox "Result:" & VbCrlf & Msg,, Title'---------End of Script----------------- Link to comment Share on other sites More sharing options...
matrix0978 Posted May 8, 2005 Share Posted May 8, 2005 wait so what exactly does it do? Link to comment Share on other sites More sharing options...
Fredledingue Posted May 8, 2005 Author Share Posted May 8, 2005 The script read the htm file line by line.Each line is processed as the variable t .The parts of the script below apply for each line, through a loop.This is the part that deals with htm code (after and before it's just interface stuffs)Here parts of htm code are merely replaced by something that kills it.It's the equivalent of "edit --> replace --> replace all" in Word.For example I replace <NOSCRIPT by <br .t=Replace(t,"<NOSCRIPT","<br") t=Replace(t,"<noscript","<br") t=Replace(t,"</NOSCRIPT","<br") t=Replace(t,"</noscript","<br") t=Replace(t,"text/css","text(slash)css") t=Replace(t,"@import url","(at)import(space)url") t=Replace(t,"TEXT/CSS","TEXT(slash)CSS") t=Replace(t,"@IMPORT URL","(at)IMPORT(space)URL") t=Replace(t,".css","(dot)css") t=Replace(t,".js","(dot)js") t=Replace(t,".vbs","(dot)vbs") t=Replace(t,".CSS","(dot)CSS") t=Replace(t,".JS","(dot)JS") t=Replace(t,".VBS","(dot)VBS")this is just counting lines (not important now)below you will see DelLineCount that counts the lines that has been dropped.LineCount = LineCount +1this is removing scripts from the page; detecting when a script code ends ( </script ) '-------remove script----- If IsScript = True Then DelLineCount = DelLineCount +1 If InStr(t,"</script")>0 Or InStr(t,"</SCRIPT")>0 Then IsScript = False End Ifand when it start ( <script )(for programing reasons, I decided to put the end before the start... Else If InStr(t,"<script")>0 Or InStr(t,"<script")>0 Then DelLineCount = DelLineCount +1 If InStr(t,"</script")=0 And InStr(t,"</SCRIPT")=0 Then IsScript = True End If Elsethis is detecting href='javascript and then dropping the line that contain it. If InStr(t,"href='javascript")>0 Then DelLineCount = DelLineCount +1 Elsethis checks if the image in a reference exists (and therefore was saved with the page and therefore the link to it should remain)It detect src=" and src= Once in lowercase, once in uppercase.If the file exists, it add img before src= if it's not present.Because htm code can be split and wrapped in many fashions, the script may find that an image doesn't exists because the link to it is on the next line.So at the next line it must add img. '--------remove missing images------ If InStr(t,"src=""")>0 Then u=InStr(t,"src=""")+5 v=Mid(t,u) w=InStr(v,"""") img = Left(v,w-1) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "src=""", "<img src=""") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else If InStr(t,"SRC=""")>0 Then u=InStr(t,"SRC=""")+5 v=Mid(t,u) w=InStr(v,"""") img = Left(v,w-1) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "SRC=""", "<IMG SRC=""") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else '------- If InStr(t,"src=")>0 Then u=InStr(t,"src=")+4 v=Mid(t,u) w=InStr(v,".") img = Left(v,w+3) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "src=", "<img src=") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" Else If InStr(t,"SRC=")>0 Then u=InStr(t,"SRC=")+4 v=Mid(t,u) w=InStr(v,".") img = Left(v,w+3) If Fso.FileExists(path & "/" & img) Then If InStr(t,"<img ")=0 And InStr(t,"<IMG ")=0 Then t= ">" & Replace(t, "SRC=", "<IMG SRC=") End If NewHtm = NewHtm & VbCrlf & t Else DelLineCount = DelLineCount +1 DelImgCount = DelImgCount +1 End If img = "" ElseThis delete ><img when there is no src= (when src= might be on the next line,) '------ If InStr(t,"><img")>0 Then t= Replace(t,"><img","") t= replace(t,"<nobr","<a") NewHtm = NewHtm & VbCrlf & t Else If InStr(t,"><IMG")>0 Then t= Replace(t,"><IMG","") t= replace(t,"<NOBR","<A") NewHtm = NewHtm & VbCrlf & t I hope that you understand a little bit of what I wrote... Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now