Continue to Site

Welcome to our site!

Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

  • Welcome to our site! Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

Text.finder looking for values in a webpage

Status
Not open for further replies.

dr pepper

Well-Known Member
Most Helpful Member
I want to parse some numeric values off a site.
I used text.finder on a small site off one of my servers, it works.
Now I'm trying it on a secure https client on a esp8266, it works but some strings it doesnt find.
It seems that it doesnt 'look' inside Css directives or values inside <> tags, I thought the function just looked at the data stream instead of discriminating text & tags.
Can anyone tell me more about this.
 
Do you have a link to this Text.finder (part of HTTP.server, ?)
 
Looking at the linked page, it looks like all it does is string matching; i.e. no parsing of the stream when you use the find() function. I'm guessing that there's some text or whitespace in the html that hasn't been specified in the parameter to find().

Can you provide code + html?
 
Yes thats what I thought.
The docs say that finder.getvalue() looks for the next valid numeric value after the string to be searched for is found, so I was thinking so long as the first value after the search string is actually the value I need it should work.
I'm now thinking that using a site with ads is probably not a good idea as the finder might get mixed up with a string in the ad that matches, also in the Eu we get a message we have to agree to about using cookies, that could be messing things up.
I'm going to try using a .org site that doesnt use ads and see if that works.
The code is on my pc at home, I'm working right now.
 
Just been playing with firefox's webpage html editor (press F12).
Looking at a site with the editor I found some data for wind speed, searching for wind-speed as it appears in the html found it using the html editors search function (along with other unwanted instances of wind-speed), however when I used "wind-speed" which is exactly how it appears in the page the search couldnt find it.
I wonder if this is some silly unicode thing where theres more than one code for the " symbol, and my search string is looking for the wrong one, I'm using the \ escape character in the ide so's not to confuse the compiler.
I'll try a double search, where it looks for another string on the same line first, then looks for the wanted data, that way I can search without using the " symbol, and filter out unwanted results.
 
Quotes ["] are encoded as html entities, so a literal search will not find them in the page source.

Try this:

&quot;wind-speed&quot;
 
If you're looking at a site that uses javascript to update fields on the page, you cant just download the html. Try downloading the site using curl, and check if any of the data you want is there
 
JR yes thats another way of doing it, however if I got it right using "\"" should pass a single " to the system.
That said your way works, my ide might need updating.

doug I'm not familiar with curl, I have written some simple javascript, and I thought that it was still downloaded as text before it was compiled & executed, maybe i'm wrong, i'm new with this stuff.

Anyway I sussed it, I replaced the search function with one that just prints the downloaded data to the text terminal, guess what I got, access denied, for some reason the ssl function doesnt work on some sites, so I tried another site and this time it printed out the entire page, then when I used the find function it works perfectly.

Thanks all.
 
JR yes thats another way of doing it, however if I got it right using "\"" should pass a single " to the system.
That said your way works, my ide might need updating.

doug I'm not familiar with curl, I have written some simple javascript, and I thought that it was still downloaded as text before it was compiled & executed, maybe i'm wrong, i'm new with this stuff.

Anyway I sussed it, I replaced the search function with one that just prints the downloaded data to the text terminal, guess what I got, access denied, for some reason the ssl function doesnt work on some sites, so I tried another site and this time it printed out the entire page, then when I used the find function it works perfectly.

Thanks all.

Python has all kinds of web scraping tools if you want to go that route.
 
Its good to have done this, as I now understand pages & string searches better.
However I think I'll do this the same way everyone else does & use an Api.
I guess thats why such things were put in place.
 
JR yes thats another way of doing it, however if I got it right using "\"" should pass a single " to the system.

To clarify, the double quote never exists as a visible character in a properly formatted web page source.
The page source text has &quot; which the web browser translates on the fly to "

No matter how you enter a " in your comparison search string, it's not in the page to be compared to.


Just use "view source" on the page you are trying to search and have a look at the unformatted text, to see if it uses any entities rather than literal characters.
 
Yes I think I got it.
I might see a " but its been converted from &quot.
Explains why i coudnt search for it.
 
If I download a html page I get lots of double quotes. It's only the ones in actual content that are &quot.

Or. did I miss something?

Mike.
 
If I download a html page I get lots of double quotes. It's only the ones in actual content that are &quot.

Or. did I miss something?

Mike.

&quot; or &#38; will display a quote mark.
 
If I download a html page I get lots of double quotes. It's only the ones in actual content that are &quot.

Or. did I miss something?

You are correct.
The actual quotes in the source are part of the html syntax, ones that are to be displayed are encoded as the &quot; entity so they are not confused with the internal markup.
 
I had a go at writing a web page a while back, I dont think I used &quot, but I remember &nbsp & a few others like that.
 
Status
Not open for further replies.

Latest threads

New Articles From Microcontroller Tips

Back
Top