Search Engine

2 Comments

Livros de Redes Sociais, SEO e Web 2.0Perhaps the worst SEO post title I could possibly use. If you are still wondering what SEO is ask Nitin Parmar ;)

Still lost? SEO is Search Engine Optimisation. I’ve had a long interest in SEO primarily in selfish terms to try and get this blog more read, but also in a wider ‘educational resource discovery’ context. I was the author of the ‘SEO and discoverability’ chapter in Into the wild – Technology for open educational resources, which highlights the importance and UKOER programme experiments with SEO.

So it’s perhaps no surprise that I agree with Tony:

Something I’m increasingly become aware of is SEO is not just about having the right metadata on your webpage, in fact arguably this is the least important aspect. The area I’m particularly interested in is the tools/techniques the SEO community use to gain ‘actionable insight’.

Ironically this is an area I’ve been inadvertently contributing to without really knowing it. Someone who spotted this early was Wil Reynolds founder of SEER Interactive:

SEER Interactive offer services in Paid Search Marketing and Search Engine Optimization but what’s particularly interesting is their commitment to being “an Analytics first company, and we will not take on projects where we can’t analyze our impact on your business”.

So what do I do that’s of interest to the SEO community? Well it seems like me SEOers like a good old-fashioned spreadsheet. They also like a good old-fashioned spreadsheet that they can hook into social network channels. A recent example of this is the work Richard Baxter (CEO and founder of SEOgadget) presented at MOZCon which extends TAGS (my Twitter Archiving Google Spreadsheet solution) demonstrating How To Use Twitter Data for Really Targeted Outreach. The general synopsis is:

an alternative method to find sites that our target audiences may be sharing on Twitter. With that data, you can build content strategy, understand your market a little better, and construct an alternative outreach plan based on what real people are sharing and engaging with, rather than starting with sites that just rank for guest post queries.

It was really interesting to read how Richard had used the output from TAGS, which was ingested into Excel where additional free Excel based SEO tools could be used to gain that all important ‘actionable insight’.

So ‘learning tech and library folk’ if you are planning your next phase of CPD maybe you should be looking at some SEO training and perhaps I’ll see you at MOZCon next year ;)

One of the nice things working for CETIS is when organisations like the Association of Learning Technologists (ALT) approach you to contribute to their projects it’s often an opportunity to field test cutting edge innovation. This was the case when ALT recruited me* for their Maths Apps Index project which is part of the maths4us initiative being co-ordinated by NIACE.

*I only work 0.8FTE for CETIS giving me room to explore other projects

Maths App Index is designed as a community review site for maths related resources. The idea builds on the Jisc funded ‘Community-led Evaluation and Dissemination of Support Resources – Pilot’, which I was also involved with. One of the recommendations made as part of this pilot was the better display and indexation resource reviews. When ALT asked me for guidance on the latest project and having kept abreast with the work my colleague Phil Barker was doing with new learning resource metadata standards my immediate response was to use the Learning Resource Metadata Initiative (LRMI) properties being proposed for schema.org.

LRMI/schema.org

For those unfamiliar with LRMI/schema.org below is a short briefing note I prepared as part of the project:

There is a general trend in webpages away from hidden metadata (keywords, descriptions contained in the header of a page) towards structured markup. This is in part a move by search sites to prevent the manipulation of search ranking by using hidden metadata. The solution has been to move towards combining human-oriented resource description and machine readable metadata.

An example of this is used in the Creative Commons embeds information about licenses in webpages. Using their ‘license chooser’ tool it generates extra HTML code for you to include in your distributed work. As well as the human readable icon and/or text the machine readable markup includes the rel="license" attribute shown in Figure 1.

clip_image002
Figure 1 Example of RDFa markup used in Creative Commons license

The inclusion on rel="license" allows search engines to identify that a resource might be released under a specific license, this information being used as a means to facet search results.

Schema.org

A development in this area of particular significance is schema.org, an initiative involving Google, Yahoo, Yandex and MS Bing that aims to:

"… improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results." (schema.org, 2013)

There are two aspects to schema.org; a syntax for encoding parts of a page to identify additional metadata, and a shared schema of item types and their properties to make it easier for search engines to consistently index information.

Learning Resource Metadata Initiative (LRMI)

The Learning Resource Metadata Initiative (LRMI) is working to extend the the controlled vocabulary used in describing educational resources which is compatible with schema.org and other systems. This will mean search engines will be able to understand the information on web pages describing learning resources and make it easier for users to find them.

Table 1 is an extract from the draft LRMI Specification version 1.0[1] and describes the metadata that could be used to describe a learning resource. Within schema.org/LRMI all properties are optional.

Table 1 LRMI Specification version 1.0

Property

Description

educationalAlignment

An alignment to an established educational framework.

educationalUse

The purpose of the work in the context of education.

● Ex: "assignment"

● Ex: “group work”

intendedEndUserRole

The individual or group for which the work in question was produced.

● Ex: "student"

● Ex: "teacher"

interactivityType

The predominate mode of learning supported by the learning resource. Acceptable values are active, expositive, or mixed.

● Ex: "active"

● Ex: "mixed"

isBasedOnUrl

A resource that was used in the creation of this resource. This term can be repeated for multiple sources.

● Ex: "http://example.com/great-multiplication-intro.html"

learningResourceType

The predominate type or kind characterizing the learning resource.

● Ex: "presentation"

● Ex: "handout"

timeRequired

Approximate or typical time it takes to work with or through this learning resource for the typical intended target audience.

● Ex: "P30M"

● Ex: "P1H25M"

typicalAgeRange

The typical range of ages the content's intendedEndUser.

● Ex: "7-9"

● Ex: "18-"

useRightsUrl

The URL where the owner specifies permissions for using the resource.

● Ex: "http://creativecommons.org/licenses/by/3.0/"

● Ex: "http://publisher.com/content-use-description"


[1] http://wiki.creativecommons.org/LRMI/Properties

It’s worth noting that these were the draft specification and since then intendedEndUserRole has become educationalRole and, as noted by Phil useRightsUrl hasn’t currently made the cut.

In action

Because we thought it was unrealistic for a reviewer to supply data like educationalAlignment for the site we opted for a subset of the LRMI markup. As the review site is a WordPress installation we use a user submitted blog post for each review using the TDO Mini Forms plugin to capture the additional metadata. This is then rendered in modification of the Sampression Lite theme. Below an example of how LRMI is included within this review.

example LRMI/schema.org 

Where to next

When we plug the page into Google’s Structured Data Testing Tool we can see Google is detecting all our lovely metadata. Within a Google Custom Search Engine (CSE) we can even use this to filter a search. For example here are all the reviews with educationalUse set to independent learning. The problem however is Google CSE doesn’t currently provide any tools to all easy faceting of a custom search. Within the Maths App Index site we do have filtering for LRMI properties but currently it’s not integrated with search e.g. independent learning reviews. So with little perceived benefit why are we capturing this data now. It’s primarily about future proofing. I for one wouldn’t like to go back over thousands of reviews adding the metadata. 

3 Comments

If you haven’t already you should check out Jorum's 2012 Summer of Enhancements and you’ll see it’s a lot more than a spring clean. In summary there are 4 major projects going on:

  • JDEP - Improving discoverability through semantic technology
  • JEAP - Expanding Jorum’s collection through aggregation projects
  • JPEP - Exposing activity data and paradata
  • JUEP - Improving the front-end UI and user experience (UI/UX)
SEO the Game by Subtle Network Design - The Apprentice Card
Image Copyright subtlenetwork.com

As I was tasked to write the chapter on OER Search Engine Optimisation (SEO) and Discoverability as part of our recent OER Booksprint I thought I’d share some personal reflections on the JDEP - Improving discoverability through semantic technology project (touching upon JEAP - Expanding Jorum’s collection through aggregation projects).

Looking through JDEP the focus appears to be mainly improving internal discoverability within Jorum with better indexing. There are some very interesting developments in this area most of which are beyond my realm of expertise.

Autonomy IDOL

The first aspect is deploying Autonomy IDOL which uses “meaning-based search to unlock significant research material”. Autonomy is a HP owned company and IDOL (Intelligent Data Operating Layer) was recently used in a project by Mimas, JISC Collections and the British Library to unlocks hidden collections. With Autonomy IDOL it means that:

rather than searching simply by a specific keyword or phrase that could have a number of definitions or interpretations, our interface aims to understand relationships between documents and information and recognize the meaning behind the search query.

This is achieved by:

  • cluster search results around related conceptual themes
  • full-text indexing of documents and associated materials
  • text-mining of full-text documents
  • dynamic clustering and serendipitous browsing
  • visualisation approaches to search results

An aspect of Autonomy IDOL that caught my eye was:

 conceptual clustering capability of text, video and speech

Will Jorum be able to index resources using Autonomy's Speech Analytics solution?

If so that would be very useful, the issue may be how Jorum resources are packaged and where resources are hosted. If you would like to see Autonomy IDOL in action you can try the Institutional Repository Search which searches across 160 UK repositories.

Will Jorum be implementing an Amazon style recommendation system?

One thing it’ll be interesting to see (and this is perhaps more of a future aspiration) is the integration of a Amazon style recommendation system. The CORE project has already published a similar documents plugin, but given Jorum already has single sign-on I wonder how easy it would be integrate a solution to make resource recommendations based on usage data (here’s a paper on A Recommender System for the DSpace Open Repository Platform).

Elasticsearch

This is a term I’ve heard of but don’t really know enough to comment on. I’m mentioning it here mainly to highlight the report Cottage Labs prepared Investigating the suitability of Apache Solr and Elasticsearch for Mimas Jorum / Dashboard, which outlines the problem and solution for indexing and statistical querying.

External discoverability and SEO

Will Jorum be improving search engine optimisation?

From the forthcoming chapter on OER SEO and Discoverability:

Why SEO and discoverability are important

In common with other types of web resources, the majority of people will use a search engine to find open educational resources, therefore it is important to ensure that OERs feature prominently in search engine results.  In addition to ensuring that resources can be found by general search engines it is also important to make sure they also are easily discoverable in sites that are content or type specific e.g iTunes, YouTube, Flickr.

Although search engine optimisation can be complex, particularly given that search engines may change their algorithms with little or no prior warning or documentation, there is growing awareness that if institutions, projects or individuals wish to have a visible web presence and to disseminate their resources efficiently and effectively search engine optimisation and ranking can not be ignored1.

The statistics are compelling:

  • Over 80% of web searches are performed using Google [Ref 1]
  • Traffic from Google searches varies from repository to repository but ranges between 50-80% are not uncommon [Ref 2]
  • As an indication 83% of college students begin their information search in a search engine [Ref 3]

Given the current dominance of Google as the preferred search engine, it is important to understand how to optimise open educational resources to be discovered via Google Search. However SEO techniques are not specific to Google and are applicable to optimise resource discovery by other search engines.

By all accounts the only way for Jorum is up as it was recently reported in the JISCMail REPOSITORIES-LIST that “just over 5% of Jorum traffic comes directly from Google referrals”. So what is going wrong?

I’m not an SEO expert but a quick check using a search for site:dspace.jorum.ac.uk returns 135,000 results so content is being indexed (Jorum should have access to Googe Webmaster Tools to get detailed index and ranking data). Resource pages include metadata including DC.creator, DC.subject and more. One thing I noticed was missing from Jorum resource pages was <meta name="description" content="A description of the page" />. Why might this be important? Google will ignore meta tags it doesn't know (and here is the list of metatags Google knows).

Another factor might be that Google, apparently (can’t find a reference) trusts metadata that is human readable by using RDFa markup. So instead of hiding meta tags in the of a page Google might weight the data better if it was inline markup:

Current Jorum resource html source
Current Jorum resource html source

With example of RDFa markup
With example of RDFa markup

[Taking this one step further Jorum might want to use schema.org to improve how resources are displayed in search results]

It’ll will be interesting to see if JEAP - Expanding Jorum’s collection through aggregation projects will improve SEO because of backlink love.

Looking further ahead

Will there be a LTI interface to allow institutions to integrate Jorum into their VLE?

Final thought. It's been interesting to see Blackboard enter the repository marketplace with xpLor (see Michael Feldstein’s Blackboard’s New Platform Strategy for details). A feature of this cloud service that particularly caught my eye was the use of IMS Learning Tools Interoperability (LTI) to allow institutions to integrate a repository within their existing VLE (CETIS IMS Learning Tools Interoperability Briefing paper). As I understand it with this institutions would be able to seamlessly deposit and search for resources. I wonder Is this type of solution on the Jorum roadmap or do you feel there would be a lack of appetite within the sector for such a solution?

Fin

Those are my thoughts anyway. I know Jorum would welcome additional feedback on their Summer of Enhancements. I also welcome any thoughts on my thoughts ;)

BTW Here's a nice presentation on Improving Institutional Repository Search Engine Visibility in Google and Google Scholar

We've seen a huge increase in the amount of audio and video being integrated into e-learning. Sites like TeacherTube make it very easy to upload custom content or reuse existing material created by others. An issue with this type of media is how do you sift through all the junk to find the content you want? Tagging helps to a degree but you are reliant on the individual attaching the right data and what if the content you are interested in is a 30 second clip in a 10 minute video? Google think they have solved this problem with their Gaudi audio indexing and search service.

Gaudi is designed to search through clips, initially  on YouTube, and automatically catalogue spoken words into a text transcript. This index will not only be searchable but users will also be able to jump to parts of the original video where the words were uttered. So far this new tool will only be used to index political speeches in the current American Presidential election (groan), but hopefully it will be unleashed to index all of YouTube, Google Video and beyond. You can try out the new service at http://labs.google.com/gaudi.