CoffeeFiend Posted February 21, 2012 Share Posted February 21, 2012 Edit: forget it then. My solution (now removed) made too much sense I guess i.e. parsing the main page for its 5 child pages' URLs, parsing the child pages for the PDFs in them, and then downloading the PDFs while saving with the proper names in the first place (using VBScript as a language, ServerXMLHTTP to make HTTP requests, and Regular Expressions to extract information from the HTML).Have fun stringing hunks of archaic 1980's technology (cryptic batch files) together with non built-in tools only to end up with the same results, but via many unnecessary steps and over-complications, adding external dependencies in the process, while steering clear of anything remotely modern if that's your thing. Link to comment Share on other sites More sharing options...
DosCode Posted February 21, 2012 Author Share Posted February 21, 2012 (edited) I don't download the document manually, I have script for CMD and wget. I also rename the files and I work with all informations from the site with CMD. I simply want batch file to be the main execution file. The files that I have in my PC are renamed by script. So when I downloaded EK_GEN_0_1_en.pdf I renamed it to 0_1_en.pdf, When I downloaded gen_4_2.pdf I renamed it to 4_2.pdfIf you have Windows, which supports a plethora of modern scripting technologies, then why would you pick an archaic 1980's MS-DOS (i.e. pre-Windows) solution? There is far better and more modern stuff built-in, like vbscript, jscript and now powershell as well (and even compilers for C# & VB on recent versions of Windows).I don't really see how they would make the job easier or better, and if you want to learn "the Windows way" then it's typically not the way we do it (pretending it works like Linux, and using its tools instead of what Windows already has)I want to do it in way which I like and which is close to what I already know. I don't want use VB or something which is too far of what I learned till now. Edited February 21, 2012 by DosCode Link to comment Share on other sites More sharing options...
Yzöwl Posted February 21, 2012 Share Posted February 21, 2012 The methods you are wanting to use are not really suitable for the task. You may be able to get your car up the street by pulling it with its built in winch but it's a darn site faster and more assured if you use its built in engine! Link to comment Share on other sites More sharing options...
bphlpt Posted February 21, 2012 Share Posted February 21, 2012 (edited) CoffeeFiend, even if the OP wants to do it his way - which is perfectly fine, I would love to see an example as you have described. I can think of other possible situations where knowing how to do this would be handy. I know CMD and wget, can make do with vbscript or jscript, and can muddle through Regular Expressions. But I'm not familiar with ServerXMLHTTP and I'm always willing to learn new things. I personally learn best with a couple of examples I can edit and model, along with links to reference documents where I can look things up. If you don't want to post an example here, since the OP has no interest in it, would you mind sending it to me via PM? I'd be much obliged.If you do decide to post it here, who knows? Maybe the OP will see the error of his ways and see that it's not as difficult as he imagines.Cheers and Regards Edited February 21, 2012 by bphlpt Link to comment Share on other sites More sharing options...
DosCode Posted February 21, 2012 Author Share Posted February 21, 2012 (edited) Here I am so far:@echo offsetlocal EnableDelayedExpansionset "source=GEN 0 GENERAL.html"set "pdf=0_1_en.pdf"echo In file:%source%echo Look for anchor:%pdf%rem Process each line in %source% file:for /F "usebackq delims=" %%c in ("%source%") do ( set "line=%%c" REM Test if the line contains pdf file I look for: SET "pdfline=!line:%pdf%=!" if not "!pdfline!" == "!line!" ( cls echo Line: !line! REM Test if the pdfline contains tag b if not "!pdfline:*><b>=!" == "!pdfline!" ( cls echo I HAVE IT! set "tag=!pdfline:<b>=$!" set "tag=!tag:</b>=$!" for /F "tokens=2 delims=$" %%b in ("!tag!") do set title=%%b echo Title found: "!title!" pause ) ))pauseThere is still more ways how to do this part, but I think now is your turn. Edited February 21, 2012 by DosCode Link to comment Share on other sites More sharing options...
CoffeeFiend Posted February 21, 2012 Share Posted February 21, 2012 I would love to see an example as you have described. I can think of other possible situations where knowing how to do this would be handy.That's a good point. Here's the VBScript again for those it might help at some point:Option ExplicitDim oXmlHttp, oRegExp, oMatch, adoStr, sChildPages(), i, urlSet oXmlHttp = createobject ("Msxml2.ServerXMLHTTP.6.0")oXmlHttp.Open "GET", "http://www.slv.dk/Dokumenter/dsweb/View/Collection-357", FalseoXmlHttp.SendSet oRegExp = New RegExpoRegExp.IgnoreCase = TrueoRegExp.Global = TrueoRegExp.Pattern = "<a\shref=""(/Dokumenter/dsweb/View/Collection-\d*)"">"Set oMatch = oRegExp.Execute(oXmlHttp.ResponseText)If oMatch.Count = 0 Then WScript.Quit'really ugly hack where we skip the first child page found (itself)ReDim sChildPages(oMatch.Count-2)For i = 1 to oMatch.Count-1 sChildPages(i-1) = "http://www.slv.dk" & oMatch.Item(i).Submatches(0) NextoRegExp.Pattern = "<a\shref=""(/Dokumenter/dsweb/Get/Document.*pdf)""\sclass=""uline""><b>(.*?)</b>"For Each url in sChildPages oXmlHttp.Open "GET", url, False oXmlHttp.Send Set oMatch = oRegExp.Execute(oXmlHttp.ResponseText) For i = 0 to oMatch.Count-1 DownloadBinaryFile "http://www.slv.dk" & oMatch.Item(i).Submatches(0), oMatch.Item(i).Submatches(1) & ".pdf" NextNextFunction DownloadBinaryFile(sUrl, sFileName) oXmlHttp.Open "GET", sUrl, False oXmlHttp.Send Set adoStr = CreateObject("ADODB.Stream") adoStr.Type = 1 'adTypeBinary adoStr.Open adoStr.Write oXmlHttp.ResponseBody adoStr.SaveToFile sFileName, 2 'adSaveCreateOverWrite adoStr.CloseEnd FunctionIt's pretty ugly, there's no error handling of any kind and all that but it gets the job done. Writing essentially the same thing in other languages should be pretty straightforward too (most of the work here is getting the regular expressions right). And in most cases it would be nicer/better/simpler too (VBScript data structures suck hard, downloading binary files here is a bit of a hack, error handling is beyond awful, etc). Link to comment Share on other sites More sharing options...
DosCode Posted February 21, 2012 Author Share Posted February 21, 2012 (edited) Check my last post. I edited it, there is the code that could be implemented to the code I posted before, yet one little change need to be done. Edited February 21, 2012 by DosCode Link to comment Share on other sites More sharing options...
bphlpt Posted February 21, 2012 Share Posted February 21, 2012 (edited) CoffeeFiend - That is Marvelous! It certainly does get the job done!I copied your script above and saved it as "GetPdf.vbs". I put the file in an otherwise empty folder named "Temp", I opened a command box in that folder and ran the file with the command:cscript getpdf.vbsand waited. The file ran silently. When the command completed and I got a command prompt back, I refreshed the contents of "Temp" in Windows Explorer and all the pdf files that the OP was looking for were in the folder and named correctly! The files opened correctly in Foxit Reader, my pdf reader of choice. Absolutely no problems at all. No extra files, nothing to rename, no extra external apps were required that weren't already part of Windows 7, all looked great!Now that I know it WORKS, I've just got to do some reading so I can understand WHY it works and HOW I need to modify it to meet future needs. I hate to ask for more after you've put this together, but to save time blindly using Google, would you mind pointing me to a few links where I can read about the key parts of your script? Maybe The Scripting Guys address something similar?Many Thanks!Cheers and Regards Edited February 21, 2012 by bphlpt Link to comment Share on other sites More sharing options...
jaclaz Posted February 21, 2012 Share Posted February 21, 2012 This might do:@echo offsetlocal EnableDelayedExpansionset "source=GEN 0 GENERAL.html"echo In file:%source%ECHO.FOR /F "tokens=* delims=" %%A IN ('FIND ".pdf" "%source%" ^|FIND "href"^|FIND "class="') DO (SET Line="%%A"CALL :Line_process)GOTO :EOF:Line_processSET Line=!Line:^<=§!SET Line=!Line:^>=§!:loop_hrefIF NOT "!line:~1,4!"=="href" SET Line="!Line:~2,-1!"&GOTO :loop_hrefSET File="!Line:~7,-1!"CALL :File_name !File!SET Line="!File:%Filepath%=!":loop_classIF NOT "!line:~1,4!"=="§§b§" SET Line="!Line:~2,-1!"&GOTO :loop_classSET File="!Line:~4,-1!"FOR /F "tokens=1 delims=§" %%B IN (!File!) DO SET FileTitle=%%B&SET File=SET FileGOTO :EOF:File_nameSET Filepath="%~1"SET Filename="%~nx1"GOTO :EOFjaclaz Link to comment Share on other sites More sharing options...
bphlpt Posted February 21, 2012 Share Posted February 21, 2012 (edited) DosCode - No offense was meant, and I'm a CMD script fan. I've written CMD scripts that are almost 3000 lines long. There are definitely cases, IMO, where CMD script is faster and more flexible than other options. There's a reason that it has existed since before Windows up to the present day. It can be very powerful. I encourage you to pursue learning how to accomplish what you want using CMD script. But I would also STRONGLY suggest you at least try CoffeeFiend's script at least once, using the instructions I listed in my last post. CoffeeFiend accomplished in that one post everything you have asked for in every post you have made here since you became a member. Everything. CoffeeFiend and I don't need to continue on. That is all you need. If nothing else you can have it as a backup approach. The post is appropriate to leave in this thread. This section deals with all types of scripting, not just CMD script, and that script deals directly with what you wanted to accomplish as an end goal. Others who read this thread might be interested in alternative approaches, as I was. You don't have to listen to our advice. But the threads here are for the benefit of all readers, not just you. Our two posts do not distract from your overall goal as much as your posts which have been scattered over multiple threads and have yet to come up with a working solution. You have been asking about bits and pieces for a week now, and we didn't even know what your overall goal was until 18 hours ago. Now that jaclaz, our CMD script wizard, has stepped in, I'm sure he can help you come up with a script that can meet your needs, and I wish you well. But that does not make alternatives less valid.Cheers and Regards Edited February 21, 2012 by bphlpt Link to comment Share on other sites More sharing options...
CoffeeFiend Posted February 21, 2012 Share Posted February 21, 2012 I hate to ask for more after you've put this together, but to save time blindly using Google, would you mind pointing me to a few links where I can read about the key parts of your script?There is no central place for all of this that I'm aware of, nor am I a VBScript guru (I've mostly given up on it, and most of my VBScript knowledge dates back to the Win2k era, part of it being from writing classic ASP pages). As such I'm not certain what are the best resources out there today. But here's some bits and pieces that might help:Msxml2.ServerXMLHTTP.6.0 is one of several objects which you can use to get content from the web (just like web pages that use "AJAX" stuff). The Open method is what you use to initialize the object, which HTTP verb to use and the URL. The Send method is what actually makes the request and gets the response (HTML here) back in its ResponseText property, which I've later parsed using regular expressions.As for using regular expressions, the idea is to design them to have submatches for the content you want (the desired chunks surrounded by parentheses). Then you already have the info you want without further parsing or processing.And finally, the regular expressions explained:<a\shref="(/Dokumenter/dsweb/View/Collection-\d*)"><a matches literal text\s matches a spacehref=" matches more literal text( this marks the beginning of the information I'm interested in (the submatch which here is the URL of a child page)/Dokumenter/dsweb/View/Collection- matches some more literal text\d matches a numeric digit (0 to 9)* means that this previous digit can be present any amount of times (zero to infinity)) marks the end of the information I care about"> matches literal text<a\shref="(/Dokumenter/dsweb/Get/Document.*pdf)"\sclass="uline"><b>(.*?)</b><a matches literal text\s matches a spacehref=" matches more literal text( this marks the beginning of the information I'm interested in (the submatch, which here is the URL of the PDF)/Dokumenter/dsweb/Get/Document matches literal text. matches any character* which is there zero times or morepdf matches more literal text) marks the end of the information I care about" literal text\s spaceclass="uline"><b> literal text( marks the beginning on the text group of infos I want (next numbered submatch which is the desired filename for the PDF). is still any old character*? is a "fancier" version of * which matches any amount of times, but keeping the selection as short as possible) marks the end of the 2nd group</b> literal textI think this covers the most interesting parts Not that I use this myself for page scraping/parsing mind you. Link to comment Share on other sites More sharing options...
bphlpt Posted February 22, 2012 Share Posted February 22, 2012 Thank you.Cheers and Regards Link to comment Share on other sites More sharing options...
Aacini Posted February 22, 2012 Share Posted February 22, 2012 (edited) Hi. I am newbie at this forum. I was invited here by DosCode to post my last solution to this problem that was developed in detail in other site. So here it is:@echo offsetlocal EnableDelayedExpansionset "source=GEN 0 GENERAL.html"set "pdf=0_1_en.pdf"echo In file: "%source%"echo Look for anchor: "%pdf%"for /F "delims=" %%c in ('findstr /C:"<a " "%source%" ^| findstr /C:"%pdf%"') do ( set "tag=%%c" rem Get the value of "<b>" sub-tag set "tag=!tag:<b>=$!" set "tag=!tag:</b>=$!" for /F "tokens=2 delims=$" %%b in ("!tag!") do set title=%%b echo Title found: "!title!")If you wish, you may review the development of this solution at this siteRegards...AntonioEDIT: I fixed a detail in my code (changed "<a>" by "<a ") that prevent it to correctly run... Edited February 23, 2012 by Aacini Link to comment Share on other sites More sharing options...
Yzöwl Posted February 22, 2012 Share Posted February 22, 2012 It is not the solution!Unless the original remit has changed it is simply a sub function.BTW I'm unable to access the link to take a look at your .html, (I cannot access 'http://www.slv.dk/Dokumenter/' at all I receive Connection closed by remote server message). Link to comment Share on other sites More sharing options...
DosCode Posted February 22, 2012 Author Share Posted February 22, 2012 It is not the solution!Unless the original remit has changed it is simply a sub function.BTW I'm unable to access the link to take a look at your .html, (I cannot access 'http://www.slv.dk/Dokumenter/' at all I receive Connection closed by remote server message).It is solution for sub function. I guess you tested the Visual Basic script from the guy that has written here and admin has blocked your access. I did not say go and download/overload their server. If anybody did so, his own fault. I am not glad of this happen. But to be exact, I did not test or checked that VB scipt, I just guess there were some instructions to download the pdf files. I will paste here my code after your answer. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now